Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Spark & beyond sessions

A deep dive into an extremely popular big data framework: we’ll cover best practices, architectural considerations, and real-world case studies drawn from startups to large enterprises.

Add to your personal schedule
9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Location: 212 B
Bruce Martin (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Bruce Martin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field. Join in to learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Location: 212 C
Secondary topics:  Streaming
Jacob D Parr (JParr Productions)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, March 14, 2017
Location: San Jose Ballroom, Marriott
Secondary topics:  Streaming, Text
Andy Konwinski (Databricks)
Average rating: ****.
(4.43, 7 ratings)
Andy Konwinski introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 14, 2017
Location: LL21 B Level: Intermediate
Dean Wampler (Lightbend)
Average rating: *****
(5.00, 4 ratings)
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 14, 2017
Location: 210 B/F Level: Intermediate
Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.), Jeffrey Shmain (Cloudera)
Average rating: ***..
(3.83, 6 ratings)
Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 14, 2017
Location: 210 D/H Level: Intermediate
Secondary topics:  Architecture
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Average rating: ****.
(4.60, 10 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 14, 2017
Location: 210 D/H
Secondary topics:  Architecture, Cloud
James Malone (Google), John Mikula (Google Cloud)
Average rating: **...
(2.00, 6 ratings)
James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 15, 2017
Location: LL21 C/D
Reynold Xin (Databricks)
Average rating: ****.
(4.11, 9 ratings)
Reynold Xin looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Reynold then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek into the future of Spark. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 15, 2017
Location: LL21 C/D
Secondary topics:  Streaming
Michael Armbrust (Databricks), Tathagata Das (Databricks)
Average rating: ****.
(4.29, 7 ratings)
Apache Spark 2.0 introduced the core APIs for Structured Streaming, a new streaming processing engine on Spark SQL. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 15, 2017
Location: 230 A Level: Intermediate
Secondary topics:  Data Platform, Financial services, Geospatial
Jasjeet Thind (Zillow)
Average rating: ****.
(4.50, 2 ratings)
Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 15, 2017
Location: LL21 C/D Level: Intermediate
Secondary topics:  Streaming
Holden Karau (IBM), Seth Hendrickson (Cloudera)
Average rating: ****.
(4.00, 8 ratings)
Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 15, 2017
Location: LL21 C/D Level: Beginner
Secondary topics:  R
Edgar Ruiz (RStudio)
Average rating: ****.
(4.80, 5 ratings)
Sparklyr makes it easy and practical to analyze big data with R—you can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 15, 2017
Location: LL21 C/D Level: Beginner
Secondary topics:  Architecture, Data Platform, Media
Average rating: ***..
(3.00, 3 ratings)
Spark powers various services in Bing, but the Bing team had to customize and extend Spark to cover its use cases and scale the implementation of Spark-based data pipelines to handle internet-scale data volume. Kaarthik Sivashanmugam explores these use cases, covering the architecture of Spark-based data platforms, challenges faced, and the customization done to Spark to address the challenges. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Location: LL21 C/D
Secondary topics:  Cloud
Anand Iyer (Cloudera), Eugene Fratkin (Cloudera)
Average rating: *****
(5.00, 1 rating)
Both Spark workloads and use of the public cloud have been rapidly gaining adoption in mainstream enterprises. Anand Iyer and Eugene Fratkin discuss new developments in Spark and provide an in-depth discussion on the intersection between the latest Spark and cloud technologies. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Location: LL21 C/D
Yin Huai (Databricks)
Average rating: ***..
(3.86, 7 ratings)
Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Yin Huai looks at the top causes of job failures customers encountered in production and examines ways to mitigate such problems by modifying Spark. He also shares a methodology for improving resilience: a combination of monitoring and debugging techniques for users. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Location: 210 A/E
Secondary topics:  Deep learning, Hardcore Data Science
Joseph Bradley (Databricks), Tim Hunter (Databricks, Inc.)
Average rating: ***..
(3.75, 4 ratings)
Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: LL20 A
Secondary topics:  Platform, Streaming
Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar and Prasanna Rajaperumal introduce Hoodie, a newly open sourced system at Uber that adds new incremental processing primitives to existing Hadoop technologies to provide near-real-time data at 10x reduced cost. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Location: LL21 C/D Level: Intermediate
Holden Karau (IBM), Joey Echeverria (Rocana)
Average rating: ***..
(3.67, 3 ratings)
Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Location: LL21 C/D Level: Advanced
Secondary topics:  Hardcore Data Science
Alexander Ulanov (Hewlett Packard Labs), Manish Marwah (Hewlett Packard Labs)
Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 16, 2017
Location: LL21 C/D Level: Beginner
Secondary topics:  Financial services
Bryan Cheng (BlockCypher), Karen Hsu (BlockCypher)
Average rating: *****
(5.00, 2 ratings)
Bryan Cheng and Karen Hsu describe how they built machine-learning and graph traversal systems on Apache Spark to help government organizations and private businesses stay informed in the brave new world of blockchain technology. Bryan and Karen also share lessons learned combining these two bleeding-edge technologies and explain how these techniques can be applied to private and federated chains. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 16, 2017
Location: LL21 E/F
Jiri Simsa (Alluxio), Calvin Jia (Alluxio)
Average rating: ****.
(4.67, 3 ratings)
Alluxio bridges Spark applications with various storage systems and further accelerates data-intensive applications. Gene Pang and Jiri Simsa introduce Alluxio, explain how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments with both Alluxio and Spark working together. Read more.