It was true until pretty recently that data scientists’ languages of choice to manipulate and make sense out of data were Python, R, or MATLAB, which led to split in the data science community and duplication of efforts in languages offering similar sets of functionality. Then distributed technologies came out of the blue, most using a convenient and easy-to-deploy platform, the JVM.
Data scientists are now part of heterogeneous teams that face many problems and must work toward global solutions together, including a new responsibility to be productive and agile in order to have their work integrated into platforms. This is why technologies like Apache Spark are so important and are gaining this traction from different communities. And even though some bindings are available for legacy languages, all the creative, new ways to analyze data are done in Scala.
Using a fully productive and reproducible environment combining the Spark Notebook and Docker, Andy Petrella and Dean Wampler explore what it means to do data science today and why Scala succeeds at coping with large and fast data where older languages fail. Andy and Dean then introduce and summarize all the new methodologies and scientific advances in machine learning that use Scala as the main language, including Splash, mic-cut problem, OptiML, needle (DL), ADAM, and more, and demonstrate how these programs work for data scientists by enabling interactivity, live reactivity, charting capabilities, and robustness in Scala—things that were still missing from the legacy languages.
Andy Petrella is the CEO of Kensu, an analytics and AI governance company that created the Kensu Data Activity Manager (DAM), a first-of-its-kind governance, compliance, and performance (GCP) solution. He’s a mathematician turned distributed computing entrepreneur. Besides being a Scala and Spark trainer, Andy has participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata Data Conference, Scala eXchange, Data Science eXchange, and Devoxx events.
Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He is Head of Developer Relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he is the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.