It was true until pretty recently that data scientists’ languages of choice to manipulate and make sense out of data were Python, R, or MATLAB, which led to split in the data science community and duplication of efforts in languages offering similar sets of functionality. Then distributed technologies came out of the blue, most using a convenient and easy-to-deploy platform, the JVM.
Data scientists are now part of heterogeneous teams that face many problems and must work toward global solutions together, including a new responsibility to be productive and agile in order to have their work integrated into platforms. This is why technologies like Apache Spark are so important and are gaining this traction from different communities. And even though some bindings are available for legacy languages, all the creative, new ways to analyze data are done in Scala.
Using a fully productive and reproducible environment combining the Spark Notebook and Docker, Andy Petrella and Dean Wampler explore what it means to do data science today and why Scala succeeds at coping with large and fast data where older languages fail. Andy and Dean then introduce and summarize all the new methodologies and scientific advances in machine learning that use Scala as the main language, including Splash, mic-cut problem, OptiML, needle (DL), ADAM, and more, and demonstrate how these programs work for data scientists by enabling interactivity, live reactivity, charting capabilities, and robustness in Scala—things that were still missing from the legacy languages.
Andy Petrella is a mathematician turned distributed computing entrepreneur. Besides being a Scala/Spark trainer, Andy participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata, Scala eXchange, Data Science eXchange, and Devoxx events.
Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the Lightbend Fast Data Platform project, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.