San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Schedule: Storage sessions

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26

Building a serverless big data application on AWS

Data Engineering & Architecture
Location: 2018

Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)

Average rating:

(4.50, 4 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Hands-on with Cloudera SDX: Setting up your own shared data experience

Data Engineering & Architecture
Location: 2008

Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)

Average rating:

(5.00, 1 rating)

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Architecture and algorithms for end-to-end streaming data processing

Data Engineering & Architecture
Location: 2005

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(2.67, 12 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Building the AI engine for retail in the new era

Data Engineering & Architecture
Location: 2002

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Average rating:

(4.50, 4 ratings)

Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Data Engineering & Architecture
Location: 2006

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

Average rating:

(4.67, 3 ratings)

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Deep learning beyond the learning

Data Engineering & Architecture
Location: 2008

Tobias Knaup (Mesosphere), Joerg Schad (ArangoDB)

Average rating:

(4.50, 2 ratings)

There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Real-time analytics at Uber: Bring SQL into everything

Data Engineering & Architecture
Location: 2004

Zhenxiao Luo (Twitter)

Average rating:

(4.09, 11 ratings)

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

Data Engineering & Architecture
Location: 2004

Julien Le Dem (WeWork)

Average rating:

(4.83, 6 ratings)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Persistent storage for machine learning in KubeFlow

Data Engineering & Architecture
Location: 2008

Skyler Thomas (MapR), Terry He (MapR Technologies)

Average rating:

(4.75, 4 ratings)

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.

11:00am–11:40am Thursday, March 28, 2019

Optimizing computing cluster resource utilization with an in-memory distributed filesystem

Data Engineering & Architecture
Location: 2008

Yue Li (MemVerge), Shouwei Chen (Rutgers University)

Average rating:

(5.00, 4 ratings)

JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now. Read more.

11:00am–11:40am Thursday, March 28, 2019

The future of machine learning is decentralized

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Alex Ingerman (Google)

Average rating:

(4.67, 12 ratings)

Federated learning is an approach for training ML models across a fleet of participating devices without collecting their data in a central location. Alex Ingerman offers an overview of federated learning, compares traditional and federated ML workflows, and explores the current and upcoming use cases for decentralized machine learning, with examples from Google's deployment of this technology. Read more.

11:00am–11:40am Thursday, March 28, 2019

Presto: Tuning performance of SQL-on-anything analytics

Data Engineering & Architecture
Location: 2004

Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)

Average rating:

(3.33, 3 ratings)

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Journey to the cloud: Architecting for the cloud through customer stories

Data Engineering & Architecture
Location: 2001

Jason Wang (Cloudera), Sushant Rao (Cloudera)

Average rating:

(4.00, 2 ratings)

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Faster ML over joins of tables

Data Engineering & Architecture
Location: 2008

Arun Kumar (University of California, San Diego)

Average rating:

(4.00, 2 ratings)

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Clusters in Kubernetes on a cluster: Building a multitenant environment for the field

Data Engineering & Architecture
Location: 2008

Paul Curtis (Weaveworks)

Average rating:

(4.50, 2 ratings)

What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Bullet: Querying streaming data in transit with sketches

Data Engineering & Architecture
Location: 2006

Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)

Average rating:

(3.67, 3 ratings)

Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

The Paradise Papers and West Africa Leaks: Behind the scenes with the ICIJ

Business Analytics and Visualization, Strata Business Summit
Location: 2018

Pierre Romera (International Consortium of Investigative Journalists (ICIJ))

Average rating:

(4.67, 6 ratings)

The ICIJ was the team behind the Panama Papers and Paradise Papers. Pierre Romera offers a behind-the-scenes look into the ICIJ's process and explores the challenges in handling 1.4 TB of data (in many different formats)—and making it available securely to journalists all over the world. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Data Engineering & Architecture
Location: 2008

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)

Average rating:

(3.33, 3 ratings)

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Data Engineering & Architecture
Location: 2002

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

Average rating:

(4.00, 1 rating)

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Data processing at the speed of 100 Gbps using Apache Crail

Data Engineering & Architecture
Location: 2008

Patrick Stuedi (IBM Research)

Average rating:

(4.00, 1 rating)

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com