Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Schedule: Big data and data science in the cloud sessions

9:00–17:00 Tuesday, 22 May 2018

Serverless machine learning with TensorFlow

Location: Capital Suite 11 Level: Intermediate

Carl Osipov (Google)

Carl Osipov walks you through building a complete machine learning pipeline from ingest, exploration, training, and evaluation to deployment and prediction. Read more.

11:15–11:55 Wednesday, 23 May 2018

The cloud is expensive, so build your own redundant Hadoop clusters.

Location: S11A Level: Intermediate

Stuart Pook (Criteo)

Average rating:

(4.40, 5 ratings)

Criteo has a production cluster of 2K nodes running over 300K jobs a day in the company's own data centers. These clusters were meant to provide a redundant solution to Criteo's storage and compute needs. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo's progress in building another cluster to survive the loss of a full DC. Read more.

14:05–14:45 Wednesday, 23 May 2018

Analytics in the cloud: Building a modern cloud-based big data warehouse

Location: S11A Level: Intermediate

Greg Rahn (Cloudera)

Average rating:

(3.29, 7 ratings)

For many organizations, the next big data warehouse will be in the cloud. Greg Rahn shares considerations for evaluating the cloud for analytics and big data warehousing, including different architectural approaches to optimize price and performance. Read more.

14:05–14:45 Wednesday, 23 May 2018

GPU-accelerated threat detection with GOAI

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Security and Privacy

Joshua Patterson (NVIDIA), Chau Dang (NVIDIA)

Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.

14:55–15:35 Wednesday, 23 May 2018

Data science across data sources with Apache Arrow

Location: S11A Level: Intermediate

Tomer Shiran (Dremio)

Average rating:

(3.50, 2 ratings)

It's often impractical for organizations to physically consolidate all data into one system. Tomer Shiran offers an overview of Apache Arrow, an open source columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real time, simplifying and accelerating data access without having to copy all data into one location. Read more.

16:35–17:15 Wednesday, 23 May 2018

Making stateless containers reliable and available even with stateful applications

Location: S11A Level: Intermediate

Paul Curtis (Weaveworks)

Average rating:

(4.00, 2 ratings)

The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state—access to a scalable persistence layer that supports real mutable files, tables, and streams. Paul Curtis demonstrates how to make containerized applications reliable, available, and performant, even with stateful applications. Read more.

16:35–17:15 Wednesday, 23 May 2018

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Location: Capital Suite 13 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines sessions, Media, Advertising, Entertainment

Sergey Ermolin (Intel), Olga Ermolin (MLS Listings)

Average rating:

(4.00, 1 rating)

Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology. Read more.

17:25–18:05 Wednesday, 23 May 2018

Practical advice for driving down the cost of cloud big data platforms

Location: S11A Level: Beginner

Christopher Royles (Cloudera)

Average rating:

(4.00, 1 rating)

Big data and cloud deployments return huge benefits in flexibility and economics but can also result in runaway costs and failed projects. Drawing on his production experience, Christopher Royles shares tips and best practices for determining initial sizing, strategic planning, and longer-term operation, helping you deliver an efficient platform, reduce costs, and implement a successful project. Read more.

17:25–18:05 Wednesday, 23 May 2018

Security, governance, and cloud analytics, oh my!

Location: Capital Suite 7 Level: Beginner

Secondary topics: Security and Privacy

Nikki Rouda (Cloudera), Nick Curcuru (Mastercard)

Average rating:

(4.00, 2 ratings)

Having so many cloud-based analytics services available is a dream come true. However, it's a nightmare to manage proper security and governance across all those different services. Nikki Rouda and Nick Curcuru share advice on how to minimize the risk and effort in protecting and managing data for multidisciplinary analytics and explain how to avoid the hassle and extra cost of siloed approaches. Read more.

17:25–18:05 Wednesday, 23 May 2018

Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink

Location: Capital Suite 8/9 Level: Intermediate

Aljoscha Krettek (Ververica)

Average rating:

(4.67, 3 ratings)

Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink. Read more.

11:15–11:55 Thursday, 24 May 2018

Improving ad hoc and production workflows at Stitch Fix

Location: S11A Level: Intermediate

Secondary topics: Data Platforms, E-commerce and Retail

Neelesh Salian (Stitch Fix)

Average rating:

(1.00, 1 rating)

Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way. Read more.

12:05–12:45 Thursday, 24 May 2018

Setting up a lightweight distributed caching layer using Apache Arrow

Location: S11A Level: Advanced

Jacques Nadeau (Dremio)

Average rating:

(4.00, 3 ratings)

Jacques Nadeau offers an overview of a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture, learn how data science, analytical, and custom applications can all leverage the cache simultaneously, and see a live demo. Read more.

12:05–12:45 Thursday, 24 May 2018

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Location: Capital Suite 7 Level: Beginner

Secondary topics: Managing and Deploying Machine Learning

Nanda Vijaydev (BlueData), Thomas Phelan (HPE BlueData)

Average rating:

(4.17, 6 ratings)

In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment. Read more.

12:05–12:45 Thursday, 24 May 2018

Autonomous ETL with materialized views

Location: Capital Suite 8/9 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines sessions

Adesh Rao (Qubole), Abhishek Somani (Qubole)

Average rating:

(3.00, 2 ratings)

Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Read more.

14:05–14:45 Thursday, 24 May 2018

The Data Intelligence Hub: On-demand Hadoop resource provisioning in Europe’s Industrial Data Space using Cloudera Altus

Location: Capital Suite 2/3 Level: Intermediate

Secondary topics: Telecom

Sven Loeffler (Deutsche Telekom)

Average rating:

(2.00, 1 rating)

Sven Löffler offers an overview of the Data Intelligence Hub, T-Systems's implementation of the Fraunhofer Industrial Data Space: a reference architecture for the standardized and secure data exchange between industries in the context of the internet of things. Read more.

14:55–15:35 Thursday, 24 May 2018

ClickFox: Customer journey analytics powered by OpenStack and Cloudera

Location: S11B Level: Intermediate

Secondary topics: Data Platforms

Alvin HEIB (Cloudera), guy le roux (Atos)

Alvin Heib and Guy Leroux offer an overview of ClickFox, a platform able to cope with high-performance analytical needs, from bits and bytes to solving a customer needs, covering the platform's virtualization, big data, and analytical layers. Read more.

14:55–15:35 Thursday, 24 May 2018

Radically modular data ingestion APIs in Apache Beam

Location: Capital Suite 8/9 Level: Advanced

Secondary topics: Data Integration and Data Pipelines sessions

Eugene Kirpichov (Google)

Average rating:

(4.50, 2 ratings)

Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.

16:35–17:15 Thursday, 24 May 2018

You call it data lake; we call it Data Historian.

Location: S11B Level: Intermediate

Secondary topics: Data Platforms

Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)

Average rating:

(4.50, 2 ratings)

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security. Read more.

16:35–17:15 Thursday, 24 May 2018

Stream scaling in Pravega

Location: Capital Suite 8/9 Level: Intermediate

Flavio Junqueira (Dell EMC)

Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com