Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.
Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
I’ve uploaded the slides from this: https://www.slideshare.net/hkarau/big-data-beyond-the-jvm-strata-san-jose-2018