Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Schedule: Spark & beyond sessions

Add to your personal schedule
9:00 - 17:00 Monday, 22 May & Tuesday, 23 May
Location: Capital Suite 7
Secondary topics:  Text Analysis and Mining
Zoltan Toth (Databricks)
Average rating: ****.
(4.00, 2 ratings)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Zoltan Toth employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. Read more.
Add to your personal schedule
9:00 - 17:00 Monday, 22 May & Tuesday, 23 May
Location: Capital Suite 16
Kai Voigt (Cloudera)
Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Using in-class simulations and exercises, Kai Voigt walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field. Read more.
Add to your personal schedule
9:0017:00 Tuesday, 23 May 2017
Location: Capital Suite 11
Secondary topics:  Text Analysis and Mining
Stephane Rion (Big Data Partnership)
Average rating: ****.
(4.00, 2 ratings)
Stephane Rion introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 23 May 2017
Location: Capital Suite 4
Level: Intermediate
Dean Wampler (Lightbend)
Average rating: ****.
(4.50, 2 ratings)
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 23 May 2017
Location: Capital Suite 8
Level: Intermediate
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Average rating: ***..
(3.64, 14 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
Add to your personal schedule
9:0017:00 Tuesday, 23 May 2017
Location: London Suite 2/3
Angie Ma (ASI), Ben Lorica (O'Reilly Media), Ira Cohen (Anodot), Yingsong Zhang (ASI Data Science), Ali Hürriyetoglu (Statistics Netherlands), Nelleke Oostdijk (Radboud University), Robin Senge (inovex GmbH), Mathew Salvaris (Microsoft), Miguel Gonzalez-Fierro (Microsoft), Amitai Armon (Intel), Yahav Shadmi (Intel), Kay Brodersen (Google), Ding Ding (Intel), Alan Mosca (Sendence | Birkbeck, University of London), Eduard Vazquez (Cortexica Vision Systems), Aida Mehonic (ASI Data Science), David Barber (Department of Computer Science, UCL)
A full day of hardcore data science, exploring emerging topics and new areas of study made possible by vast troves of raw data and cutting-edge architectures for analyzing and exploring information. Along the way, leading data science practitioners teach new techniques and technologies to add to your data science toolbox. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 23 May 2017
Location: Capital Suite 2/3
Level: Intermediate
Jeffrey Shmain (Cloudera), Jayant Shekhar (Sparkflows Inc.), Vartika Singh (Cloudera)
Average rating: ***..
(3.50, 4 ratings)
Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches using the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 23 May 2017
Location: Capital Suite 9
Level: Intermediate
Douglas Ashton (Mango Solutions), Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions)
Average rating: *****
(5.00, 1 rating)
R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 12
Average rating: **...
(2.00, 5 ratings)
Herman van Hövell tot Westerflier looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Herman then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek... Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: **...
(2.00, 3 ratings)
Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of setting up security and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
Seth Hendrickson (Cloudera)
Average rating: ***..
(3.57, 7 ratings)
There are many resources available for learning how to use Spark to build collaborative filtering models. However, there are relatively few that explain how to build a large-scale, end-to-end recommender system. Seth Hendrickson demonstrates how to create such a system using Spark Streaming, Spark ML, and Elasticsearch. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Holden Karau (IBM), Seth Hendrickson (Cloudera)
Average rating: ***..
(3.25, 8 ratings)
Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: ***..
(3.31, 13 ratings)
Spark is now the de facto engine for big data processing. Vincent Van Steenbergen walks you through two real-world applications that use Spark to build functional machine-learning pipelines (wine price prediction and malware analysis), discussing the architecture and implementation and sharing the good, the bad, and the ugly experiences he had along the way. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 12
Secondary topics:  Ecommerce, Financial services
Level: Intermediate
Harry Powell (Barclays), Raffael Strassnig (Barclays)
Average rating: ****.
(4.00, 6 ratings)
Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Natalino Busa (Teradata)
Natalino Busa shares an implementation for classifying pictures based on Spark and Slider that was developed during the 2016 Yelp Restaurant Photo Classification challenge. Spark processes data and trains the ML model, which consists of deep learning and ensemble classification methods, while picture scoring is exposed via an API that is persisted and scaled with Slider. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 13
Secondary topics:  Deep learning
Level: Intermediate
Chris Fregly (PipelineAI)
Average rating: ***..
(3.00, 1 rating)
Chris Fregly explores an often-overlooked area of machine learning and artificial intelligence—the real-time, end-user-facing "serving” layer in hybrid-cloud and on-premises deployment environments—and shares a production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly scalable and highly available robustness. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: *****
(5.00, 1 rating)
Herman van Hövell tot Westerflier offers a deep dive into Spark SQL's Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features are implemented using Catalyst. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Hall S21/23 (B)
Secondary topics:  Cloud
Level: Intermediate
Leah McGuire (Salesforce)
Average rating: ****.
(4.00, 2 ratings)
What if you had to build more models than there are data scientists in the world—a feat enterprise companies serving hundreds of thousands of businesses often have to do? Leah McGuire offers an overview of Salesforce's general-purpose machine-learning platform that automatically builds per-company optimized models for any given predictive problem at scale, beating out most hand-tuned models. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Bas Geerdink (ING)
Average rating: ***..
(3.25, 4 ratings)
As a data-driven enterprise, ING is heavily investing in big data, analytics, and stream processing. Bas Geerdink shares three use cases at ING and discusses their respective architectures and technology. All software is currently in production, running with modern tools such as Kafka, Cassandra, Spark, Flink, and H2O.ai. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Holden Karau (IBM)
Average rating: ****.
(4.75, 4 ratings)
Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau explores how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Nicolas Poggi (Barcelona Supercomputing-Microsoft Research Center)
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Beginner
Matthias Niehoff (codecentric AG)
Average rating: ****.
(4.00, 4 ratings)
Matthias Niehoff shares lessons learned working with Spark, Cassandra, and the Spark-Cassandra connector and best practices drawn from his work on multiple big and fast data projects, as well as challenges encountered along the way. Read more.