Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Playing well together: Big data beyond the JVM with Spark and friends

Holden Karau (Google), Rachel Warren (Salesforce Einstein)
1:50pm2:30pm Thursday, March 8, 2018
Average rating: ***..
(3.40, 5 ratings)

Who is this presentation for?

  • Data engineers

Prerequisite knowledge

  • A basic understanding of Spark, Kafka, or a similar system
  • A working knowledge of Python

What you'll learn

  • Understand how to work with Spark, Kafka, and other JVM-based big data systems in Python
  • Learn how to better support Python users for developers


Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).

Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.

Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.

Photo of Holden Karau

Holden Karau


Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Photo of Rachel Warren

Rachel Warren

Salesforce Einstein

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Comments on this page are now closed.


Picture of Holden Karau
03/09/2018 6:12am PST

I’ve uploaded the slides from this: