Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Schedule: Spark & beyond sessions

Add to your personal schedule
9:00am - 5:00pm Monday, September 25 & Tuesday, September 26
Location: 1A 15/16/17
Secondary topics:  Streaming
SOLD OUT
Joseph Kambourakis (Databricks)
Average rating: *****
(5.00, 1 rating)
Joseph Kambourakis walks you through using Apache Spark to perform exploratory data analysis (EDA), developing machine learning pipelines, and using the APIs and algorithms available in the Spark MLlib DataFrames API. Read more.
Add to your personal schedule
Add to your personal schedule
9:00am5:00pm Tuesday, September 26, 2017
Location: 1A 08/10
Secondary topics:  Text
Brooke Wenig (Databricks)
Brooke Wenig introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 26, 2017
Location: 1A 12/14 Level: Intermediate
Secondary topics:  Deep learning
Vartika Singh (Cloudera), Jeffrey Shmain (Cloudera)
Average rating: **...
(2.50, 6 ratings)
Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 26, 2017
Location: 1E 12/13 Level: Intermediate
Secondary topics:  Architecture
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Average rating: ***..
(3.27, 11 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Cheng Chang (Alluxio), Haoyuan Li (Alluxio)
Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark further accelerate applications. Haoyuan Li and Cheng Chang explain how Alluxio makes Spark more effective and share production deployments of Alluxio and Spark working together. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Intermediate
Lucy Yu (MemSQL)
Average rating: **...
(2.50, 6 ratings)
Lucy Yu demonstrates how to extend the Spark SQL abstraction to support more complex pushdown, such as group by, subqueries, and joins. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Intermediate
Holden Karau (Google), Seth Hendrickson (Cloudera)
Average rating: *****
(5.00, 1 rating)
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, you'll leave with a deeper understanding of Spark's ML pipelines. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1E 09 Level: Beginner
Marc Carlson (Seattle Children's Research Institute), Sean Taylor (Seattle Children's Research Institute)
Average rating: *****
(5.00, 1 rating)
Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed most. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1A 08/10 Level: Advanced
Secondary topics:  Media
Seth Hendrickson (Cloudera), DB Tsai (Netflix)
Average rating: *****
(5.00, 1 rating)
Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai explain when and how to use this new API and walk you through creating your own Spark ML optimizer. Along the way, they also share performance benefits and real-world use cases. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Advanced
Adrian Popescu (Unravel Data Systems), Shivnath Babu (Unravel Data Systems)
A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 28, 2017
Location: 1A 08/10 Level: Intermediate
Viral Shah (Julia Computing), Stefan Karpinski (The Julia Language)
Spark is a fast and general engine for large-scale data. Julia is a fast and general engine for large-scale compute. Viral Shah and Stefan Karpinski explain how combining Julia's compute and Spark's data processing capabilities makes amazing things possible. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1A 21/22 Level: Advanced
Kimoon Kim (Pepperdata)
There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 28, 2017
Location: 1A 21/22 Level: Intermediate
Average rating: *****
(5.00, 1 rating)
Common ETL jobs used for importing log data into Hadoop clusters require a considerable amount of resources, which varies based on the input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixed amounts of resources. Read more.