Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Schedule: Storage sessions

Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)
Average rating: ****.
(4.50, 4 ratings)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)
Average rating: *****
(5.00, 1 rating)
Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Average rating: **...
(2.67, 12 ratings)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)
Average rating: ****.
(4.50, 4 ratings)
Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
Average rating: ****.
(4.67, 3 ratings)
Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Tobias Knaup (Mesosphere), Joerg Schad (ArangoDB)
Average rating: ****.
(4.50, 2 ratings)
There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Zhenxiao Luo (Twitter)
Average rating: ****.
(4.09, 11 ratings)
From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Julien Le Dem (WeWork)
Average rating: ****.
(4.83, 6 ratings)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Skyler Thomas (MapR), Terry He (MapR Technologies)
Average rating: ****.
(4.75, 4 ratings)
KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Yue Li (MemVerge), Shouwei Chen (Rutgers University)
Average rating: *****
(5.00, 4 ratings)
JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Alex Ingerman (Google)
Average rating: ****.
(4.67, 12 ratings)
Federated learning is an approach for training ML models across a fleet of participating devices without collecting their data in a central location. Alex Ingerman offers an overview of federated learning, compares traditional and federated ML workflows, and explores the current and upcoming use cases for decentralized machine learning, with examples from Google's deployment of this technology. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)
Average rating: ***..
(3.33, 3 ratings)
Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Jason Wang (Cloudera), Sushant Rao (Cloudera)
Average rating: ****.
(4.00, 2 ratings)
Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Arun Kumar (University of California, San Diego)
Average rating: ****.
(4.00, 2 ratings)
Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Paul Curtis (MapR Technologies)
Average rating: ****.
(4.50, 2 ratings)
What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)
Average rating: ***..
(3.67, 3 ratings)
Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Pierre Romera (International Consortium of Investigative Journalists (ICIJ))
Average rating: ****.
(4.67, 6 ratings)
The ICIJ was the team behind the Panama Papers and Paradise Papers. Pierre Romera offers a behind-the-scenes look into the ICIJ's process and explores the challenges in handling 1.4 TB of data (in many different formats)—and making it available securely to journalists all over the world. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)
Average rating: ***..
(3.33, 3 ratings)
Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Average rating: ****.
(4.00, 1 rating)
Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Patrick Stuedi (IBM Research)
Average rating: ****.
(4.00, 1 rating)
Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.