Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Schedule: Big data and data science in the cloud sessions

9:0017:00 Tuesday, 22 May 2018
Location: Capital Suite 11 Level: Intermediate
Carl Osipov (Google)
Carl Osipov walks you through building a complete machine learning pipeline from ingest, exploration, training, and evaluation to deployment and prediction. Read more.
11:1511:55 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Stuart Pook (Criteo)
Average rating: ****.
(4.40, 5 ratings)
Criteo has a production cluster of 2K nodes running over 300K jobs a day in the company's own data centers. These clusters were meant to provide a redundant solution to Criteo's storage and compute needs. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo's progress in building another cluster to survive the loss of a full DC. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Greg Rahn (Cloudera)
Average rating: ***..
(3.29, 7 ratings)
For many organizations, the next big data warehouse will be in the cloud. Greg Rahn shares considerations for evaluating the cloud for analytics and big data warehousing, including different architectural approaches to optimize price and performance. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Joshua Patterson (NVIDIA), Chau Dang (NVIDIA)
Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Tomer Shiran (Dremio)
Average rating: ***..
(3.50, 2 ratings)
It's often impractical for organizations to physically consolidate all data into one system. Tomer Shiran offers an overview of Apache Arrow, an open source columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real time, simplifying and accelerating data access without having to copy all data into one location. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Paul Curtis (Weaveworks)
Average rating: ****.
(4.00, 2 ratings)
The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state—access to a scalable persistence layer that supports real mutable files, tables, and streams. Paul Curtis demonstrates how to make containerized applications reliable, available, and performant, even with stateful applications. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions, Media, Advertising, Entertainment
Sergey Ermolin (Intel), Olga Ermolin (MLS Listings)
Average rating: ****.
(4.00, 1 rating)
Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: S11A Level: Beginner
Christopher Royles (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Big data and cloud deployments return huge benefits in flexibility and economics but can also result in runaway costs and failed projects. Drawing on his production experience, Christopher Royles shares tips and best practices for determining initial sizing, strategic planning, and longer-term operation, helping you deliver an efficient platform, reduce costs, and implement a successful project. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Beginner
Secondary topics:  Security and Privacy
Nikki Rouda (Cloudera), Nick Curcuru (Mastercard)
Average rating: ****.
(4.00, 2 ratings)
Having so many cloud-based analytics services available is a dream come true. However, it's a nightmare to manage proper security and governance across all those different services. Nikki Rouda and Nick Curcuru share advice on how to minimize the risk and effort in protecting and managing data for multidisciplinary analytics and explain how to avoid the hassle and extra cost of siloed approaches. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Aljoscha Krettek (Ververica)
Average rating: ****.
(4.67, 3 ratings)
Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink. Read more.
11:1511:55 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Secondary topics:  Data Platforms, E-commerce and Retail
Neelesh Salian (Stitch Fix)
Average rating: *....
(1.00, 1 rating)
Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way. Read more.
12:0512:45 Thursday, 24 May 2018
Location: S11A Level: Advanced
Jacques Nadeau (Dremio)
Average rating: ****.
(4.00, 3 ratings)
Jacques Nadeau offers an overview of a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture, learn how data science, analytical, and custom applications can all leverage the cache simultaneously, and see a live demo. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Beginner
Secondary topics:  Managing and Deploying Machine Learning
Nanda Vijaydev (BlueData), Thomas Phelan (HPE BlueData)
Average rating: ****.
(4.17, 6 ratings)
In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions
Adesh Rao (Qubole), Abhishek Somani (Qubole)
Average rating: ***..
(3.00, 2 ratings)
Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Read more.
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 2/3 Level: Intermediate
Secondary topics:  Telecom
Sven Loeffler (Deutsche Telekom)
Average rating: **...
(2.00, 1 rating)
Sven Löffler offers an overview of the Data Intelligence Hub, T-Systems's implementation of the Fraunhofer Industrial Data Space: a reference architecture for the standardized and secure data exchange between industries in the context of the internet of things. Read more.
14:5515:35 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms
Alvin HEIB (Cloudera), guy le roux (Atos)
Alvin Heib and Guy Leroux offer an overview of ClickFox, a platform able to cope with high-performance analytical needs, from bits and bytes to solving a customer needs, covering the platform's virtualization, big data, and analytical layers. Read more.
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Advanced
Secondary topics:  Data Integration and Data Pipelines sessions
Eugene Kirpichov (Google)
Average rating: ****.
(4.50, 2 ratings)
Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.
16:3517:15 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms
Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)
Average rating: ****.
(4.50, 2 ratings)
There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security. Read more.
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Flavio Junqueira (Dell EMC)
Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.