Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Spark & Beyond conference sessions

Tuesday, March 29

9:00am–12:30pm Tuesday, 03/29/2016
Location: LL21 C/D
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Gary Dusbabek (Silicon Valley Data Science)
Average rating: ***..
(3.96, 49 ratings)
What are the essential components of a data platform? John Akred, Stephen O'Sullivan, and Gary Dusbabek explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
9:00am–5:00pm Tuesday, 03/29/2016
SOLD OUT
Location: LL21 E/F
Sameer Farooqui (Databricks)
Average rating: ***..
(3.90, 41 ratings)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Sameer Farooqui explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms. Read more.
9:00am–12:30pm Tuesday, 03/29/2016
Location: 210 A/E
Jayant Shekhar (Sparkflows Inc.), Amandeep Khurana (Cloudera), Krishna Sankar (U.S.Bank), Vartika Singh (Cloudera)
Average rating: **...
(2.80, 45 ratings)
Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX. Read more.

Wednesday, March 30

11:00am–11:40am Wednesday, 03/30/2016
Location: LL20 A
Robert Nishihara (University of California, Berkeley)
Average rating: ****.
(4.59, 17 ratings)
Robert Nishihara offers an overview of SparkNet, a framework for training deep networks in Spark using existing deep learning libraries (such as Caffe) for the backend. SparkNet gets an order of magnitude speedup from distributed training relative to Caffe on a single GPU, even in the regime in which communication is extremely expensive. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 210 A/E
Reynold Xin (Databricks)
Average rating: ****.
(4.36, 28 ratings)
Reynold Xin reviews Spark’s adoption and development in 2015. Reynold then looks to the future to outline three major technology trends—the integration of streaming systems and enterprise data infrastructure, cloud computing and elasticity, and the rise of new hardware—discuss the major efforts to address these trends, and explore their implications for Spark users. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 230 A
Tags: real-time
Bin Fan (Alluxio), Haojun Wang (Baidu)
Average rating: ****.
(4.25, 8 ratings)
Baidu runs Alluxio in production with hundreds of nodes managing petabytes of data. Bin Fan and Haojun Wang demonstrate how Alluxio improves big data analytics (ad hoc query)—Baidu experienced a 30x performance improvement—and explain how Baidu leverages Alluxio in its machine-learning architecture and how it uses Alluxio to manage heterogeneous storage resources. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 210 A/E
Tags: featured
Dean Wampler (Lightbend)
Average rating: ****.
(4.57, 23 ratings)
The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 A/E
Tags: real-time
Alex Silva (Pluralsight)
Average rating: ***..
(3.94, 16 ratings)
Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 A/E
Holden Karau (Independent)
Average rating: ****.
(4.11, 19 ratings)
Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data. Read more.

Thursday, March 31

11:00am–11:40am Thursday, 03/31/2016
Location: 210 A/E
michael dddd (Databricks)
Average rating: *****
(5.00, 11 ratings)
Michael Armbrust explores real-time analytics with Spark from interactive queries to streaming. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 210 A/E
Tags: real-time
Tathagata Das (Databricks)
Average rating: ****.
(4.54, 13 ratings)
Tathagata Das introduces Streaming DataFrames, the next evolution of Spark Streaming. Streaming DataFrames unifies an additional dimension: interactive analysis. In addition, it provides enhanced support for out-of-order (delayed) data, zero-latency decision making and integration with existing enterprise data warehouses. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 210 A/E
Neelesh Salian (Stitch Fix)
Average rating: **...
(2.93, 14 ratings)
Spark has been growing in deployments for the past year. Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark in their projects. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: 210 A/E
Tags: health
Timothy Danford (Tamr, Inc.)
Average rating: ****.
(4.57, 7 ratings)
To keep up with the DNA-sequencing-technology revolution, bioinformaticians need more-scalable tools for genomics analysis. Timothy Danford outlines one possible solution in a case study of a cancer genomics analysis pipeline implemented as part of the open source genomics software project, ADAM, which uses Apache Spark-generated abstractions executed on commodity computing infrastructure. Read more.