Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Scala: The unpredicted lingua franca for data science

Andy Petrella (Kensu), Dean Wampler (Lightbend)
12:05–12:45 Friday, 3/06/2016
Data science & advanced analytics
Location: Capital Suite 8/9 Level: Intermediate
Average rating: ***..
(3.12, 8 ratings)

Prerequisite knowledge

Attendees should have basic but pragmatic knowledge of data science projects, an understanding of the legacy tooling for data science, and an interest in distributed technologies like Spark or Kafka as well as advances in science.

Description

It was true until pretty recently that data scientists’ languages of choice to manipulate and make sense out of data were Python, R, or MATLAB, which led to split in the data science community and duplication of efforts in languages offering similar sets of functionality. Then distributed technologies came out of the blue, most using a convenient and easy-to-deploy platform, the JVM.

Data scientists are now part of heterogeneous teams that face many problems and must work toward global solutions together, including a new responsibility to be productive and agile in order to have their work integrated into platforms. This is why technologies like Apache Spark are so important and are gaining this traction from different communities. And even though some bindings are available for legacy languages, all the creative, new ways to analyze data are done in Scala.

Using a fully productive and reproducible environment combining the Spark Notebook and Docker, Andy Petrella and Dean Wampler explore what it means to do data science today and why Scala succeeds at coping with large and fast data where older languages fail. Andy and Dean then introduce and summarize all the new methodologies and scientific advances in machine learning that use Scala as the main language, including Splash, mic-cut problem, OptiML, needle (DL), ADAM, and more, and demonstrate how these programs work for data scientists by enabling interactivity, live reactivity, charting capabilities, and robustness in Scala—things that were still missing from the legacy languages.

Photo of Andy Petrella

Andy Petrella

Kensu

Andy Petrella is a mathematician turned distributed computing entrepreneur. Besides being a Scala/Spark trainer, Andy participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata, Scala eXchange, Data Science eXchange, and Devoxx events.

Photo of Dean Wampler

Dean Wampler

Lightbend

Dean Wampler is an expert in streaming data systems, focusing on applications of ML/AI. Formerly, he was the vice president of fast data engineering at Lightbend, where he led the development Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.