Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Schedule: Data engineering and architecture sessions

Add to your personal schedule
9:0012:30 Tuesday, 23 May 2017
Location: Capital Suite 8
Level: Intermediate
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Average rating: ***..
(3.64, 14 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 23 May 2017
Location: Capital Suite 9
Level: Intermediate
David Tishgart (Cloudera), Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Jennifer Wu (Cloudera)
Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 23 May 2017
Location: Capital Suite 8
Level: Advanced
Jonathan Seidman (Cloudera), Mark Grover (Lyft), Ted Malaska (Blizzard Entertainment)
Average rating: *****
(5.00, 6 ratings)
Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 23 May 2017
Location: Capital Suite 10
Level: Intermediate
John Mikula (Google Cloud)
Average rating: *....
(1.33, 3 ratings)
John Mikula explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Average rating: ***..
(3.20, 5 ratings)
If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Level: Beginner
Michael Noll (Confluent)
Average rating: ****.
(4.00, 11 ratings)
Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Ben Sharma (Zaloni)
Average rating: ***..
(3.83, 6 ratings)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Beginner
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Intermediate
Mark Madsen (Third Nature)
Average rating: ***..
(3.33, 12 ratings)
Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Victor Zabalza (ASI Data Science)
Average rating: ***..
(3.75, 4 ratings)
Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Beginner
Wojciech Biela (Teradata), Łukasz Osipiuk (Teradata)
Average rating: ****.
(4.00, 1 rating)
Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 13
Secondary topics:  Deep learning
Level: Intermediate
Chris Fregly (PipelineAI)
Average rating: ***..
(3.00, 1 rating)
Chris Fregly explores an often-overlooked area of machine learning and artificial intelligence—the real-time, end-user-facing "serving” layer in hybrid-cloud and on-premises deployment environments—and shares a production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly scalable and highly available robustness. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Beginner
Tyler Akidau (Google)
Average rating: ***..
(3.80, 5 ratings)
The world of big data involves an ever-changing field of players. Much as SQL is a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Tyler Akidau explains how this vision has been realized and discusses the challenges that lie ahead. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Beginner
Aurélien Géron (Kiwisoft)
Average rating: *****
(5.00, 4 ratings)
Collaborative filtering is great for recommendations, yet it suffers from the cold-start problem. New content with no views is ignored, and new users get poor recommendation. Aurélien Géron shares a solution: knowledge graphs. With a knowledge graph, you can truly understand your users' interests and make better, more relevant recommendations. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: *****
(5.00, 1 rating)
Herman van Hövell tot Westerflier offers a deep dive into Spark SQL's Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features are implemented using Catalyst. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Beginner
Xueyan Li (Qunar), Yupeng Fu (Alluxio)
Average rating: ***..
(3.00, 1 rating)
Alluxio—the first memory-speed virtual distributed storage system in the world—unifies the data from various under storage systems and presents a global namespace to various computation frameworks. Xueyan Li and Yupeng Fu explore how Alluxio has led to performance improvements averaging a 300x improvement at service peak time on stream processing workloads at Qunar. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 15/16
Level: Beginner
Sean Kandel (Trifacta)
Average rating: ***..
(3.00, 6 ratings)
Sean Kandel offers an overview of an entirely new approach to visualizing metadata and data lineage, explaining how to track how different attributes of data are derived during the data preparation process and the associated linkages across different elements in the data. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Advanced
Mark Grover (Lyft), Ted Malaska (Blizzard Entertainment)
Average rating: ****.
(4.00, 4 ratings)
Any nontrivial streaming app requires that you consider a number of important topics, but questions like how to manage offsets or state often go unanswered. Mark Grover and Ted Malaska share practices that no one talks about when you start writing a streaming app but that you'll inevitably need to learn along the way. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Rekha Joshi (Intuit)
Average rating: **...
(2.00, 5 ratings)
Performance and security are often at loggerheads. Rekha Joshi explains why and offers a deep dive into how performance and security are managed in some of the most intense and critical data platform services at Intuit. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Average rating: ****.
(4.75, 4 ratings)
In most organizations, data is spread across multiple data sources, such as Hadoop/cloud storage, RDBMS, and NoSQL. Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Beginner
Matthias Niehoff (codecentric AG)
Average rating: ****.
(4.00, 4 ratings)
Matthias Niehoff shares lessons learned working with Spark, Cassandra, and the Spark-Cassandra connector and best practices drawn from his work on multiple big and fast data projects, as well as challenges encountered along the way. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Beginner
The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Intermediate
Arturo Bayo (Synergic Partners), Alvaro Fernandez Velando (Santander Spain)
Average rating: ****.
(4.50, 6 ratings)
Arturo Bayo and Alvaro Fernandez Velando explain how a data hub strategy helps clarify data sharing and governance in an organization and share one way to implement a data hub architecture using big data technology and resources that are already established in the enterprise. Read more.