Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Schedule: Big data and the Cloud sessions

Add to your personal schedule
9:0012:30 Tuesday, 23 May 2017
Location: Capital Suite 9
Level: Intermediate
David Tishgart (Cloudera), Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Jennifer Wu (Cloudera)
Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Eugene Fratkin, Philip Langdale, David Tishgart, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 23 May 2017
Location: Capital Suite 10
Ian Meyers (Amazon Web Services (AWS)), Pratim Das (Amazon Web Services (AWS)), Ian Robinson (Amazon Web Services (AWS))
Average rating: *****
(5.00, 2 ratings)
Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ian Meyers, Pratim Das, and Ian Robinson walk you through building a big data application in real time using a combination of open source technologies, including Apache Hadoop, Spark, and Zeppelin, as well as AWS managed services such as Amazon EMR, Amazon Kinesis, and more. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 23 May 2017
Location: Capital Suite 9
Level: Intermediate
Douglas Ashton (Mango Solutions), Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions)
Average rating: *****
(5.00, 1 rating)
R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 23 May 2017
Location: Capital Suite 10
Level: Intermediate
John Mikula (Google Cloud)
Average rating: *....
(1.33, 3 ratings)
John Mikula explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
Matthew Rocklin (Anaconda)
Average rating: ****.
(4.33, 3 ratings)
Dask parallelizes Python libraries like NumPy, pandas, and scikit-learn, bringing a popular data science stack to the world of distributed computing. Matthew Rocklin discusses the architecture and current applications of dask used in the wild. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Average rating: ***..
(3.20, 5 ratings)
If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Intermediate
Mark Donsky (Cloudera), Vikas Singh (Cloudera)
Average rating: ****.
(4.33, 9 ratings)
Big data needs governance—not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start, especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Vikas Singh share a step-by-step approach to kickstart your big data governance. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Beginner
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 7
Secondary topics:  Deep learning
Level: Intermediate
Average rating: ***..
(3.33, 3 ratings)
Deep learning is one of the most exciting techniques in machine learning. Miguel González-Fierro explores the problem of image classification using ResNet, the deep neural network that surpassed human-level accuracy for the first time, and demonstrates how to create an end-to-end process to operationalize deep learning in computer vision for business problems using Microsoft RServer and GPU VMs. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Intermediate
Daniel Bäurer (inovex GmbH), Sascha Askani (inovex GmbH)
Average rating: *****
(5.00, 1 rating)
Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Beginner
Yuval Dvir (Google)
In an era when we are bombarded with data and tasks to finish, our ability to focus our attention becomes critical. When 70% of our code is for DevOps purposes and 90% of our data is dark, the cloud is a welcome, secure, and efficient relief. Yuval Dvir refutes common misconceptions about the cloud and explains why it's not a matter of "if" but "when" you'll move to the cloud. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Intermediate
Andrei Savu (Cloudera), Philip Langdale (Cloudera)
Cloudera Enterprise has made many focused optimizations in order leverage all of the cloud-native capabilities of AWS for the CDH platform. Andrei Savu and Philip Langdale take you through all the ins and outs of successfully running end-to-end batch data engineering workflows in AWS and demonstrate a Cloudera on AWS data engineering workflow with a sample use case. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Nicolas Poggi (Barcelona Supercomputing-Microsoft Research Center)
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Intermediate
Calum Murray (Intuit)
Average rating: ****.
(4.00, 1 rating)
As Intuit moves its SaaS platform from its own data centers to AWS, it will straddle both worlds for a period of time (and potentially indefinitely). Calum Murray looks at what straddling means to data and data systems. Read more.