Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Scala: The unpredicted lingua franca for data science

Andy Petrella (Kensu), Dean Wampler (Anyscale)
12:05–12:45 Friday, 3/06/2016
Data science & advanced analytics
Location: Capital Suite 8/9 Level: Intermediate
Average rating: ***..
(3.12, 8 ratings)

Prerequisite knowledge

Attendees should have basic but pragmatic knowledge of data science projects, an understanding of the legacy tooling for data science, and an interest in distributed technologies like Spark or Kafka as well as advances in science.

Description

It was true until pretty recently that data scientists’ languages of choice to manipulate and make sense out of data were Python, R, or MATLAB, which led to split in the data science community and duplication of efforts in languages offering similar sets of functionality. Then distributed technologies came out of the blue, most using a convenient and easy-to-deploy platform, the JVM.

Data scientists are now part of heterogeneous teams that face many problems and must work toward global solutions together, including a new responsibility to be productive and agile in order to have their work integrated into platforms. This is why technologies like Apache Spark are so important and are gaining this traction from different communities. And even though some bindings are available for legacy languages, all the creative, new ways to analyze data are done in Scala.

Using a fully productive and reproducible environment combining the Spark Notebook and Docker, Andy Petrella and Dean Wampler explore what it means to do data science today and why Scala succeeds at coping with large and fast data where older languages fail. Andy and Dean then introduce and summarize all the new methodologies and scientific advances in machine learning that use Scala as the main language, including Splash, mic-cut problem, OptiML, needle (DL), ADAM, and more, and demonstrate how these programs work for data scientists by enabling interactivity, live reactivity, charting capabilities, and robustness in Scala—things that were still missing from the legacy languages.

Photo of Andy Petrella

Andy Petrella

Kensu

Andy Petrella is the CEO of Kensu, an analytics and AI governance company that created the Kensu Data Activity Manager (DAM), a first-of-its-kind governance, compliance, and performance (GCP) solution. He’s a mathematician turned distributed computing entrepreneur. Besides being a Scala and Spark trainer, Andy has participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata Data Conference, Scala eXchange, Data Science eXchange, and Devoxx events.

Photo of Dean Wampler

Dean Wampler

Anyscale

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He is Head of Developer Relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he is the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.