Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Schedule: Streaming, realtime analytics, and IoT sessions

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Jesse Anderson (Big Data Institute)
Average rating: ***..
(3.00, 1 rating)
Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem. Read more.
9:00am12:30pm Tuesday, March 26, 2019
Fabian Hueske (Ververica)
Average rating: *****
(5.00, 1 rating)
Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API. Read more.
1:30pm5:00pm Tuesday, March 26, 2019
Matt Fuller (Starburst)
Average rating: ***..
(3.57, 7 ratings)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
1:30pm5:00pm Tuesday, March 26, 2019
Arun Kejariwal (Facebook), Karthik Ramasamy (Streamlio)
Average rating: **...
(2.67, 12 ratings)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.
11:00am11:40am Wednesday, March 27, 2019
Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
Average rating: ****.
(4.67, 3 ratings)
Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Average rating: ****.
(4.60, 5 ratings)
In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Lars Volker (Cloudera), Michael Ho (Cloudera)
Average rating: ****.
(4.50, 6 ratings)
In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.
2:40pm3:20pm Wednesday, March 27, 2019
Zhenxiao Luo (Uber)
Average rating: ****.
(4.09, 11 ratings)
From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.
2:40pm3:20pm Wednesday, March 27, 2019
Adem Efe Gencer (LinkedIn)
Average rating: ***..
(3.50, 2 ratings)
Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Read more.
4:20pm5:00pm Wednesday, March 27, 2019
Sean Glover (Lightbend)
Average rating: ****.
(4.00, 1 rating)
The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. Sean Glover offers an overview of the Strimzi Kafka Operator, a popular new open source Operator-based Apache Kafka implementation on Kubernetes. Read more.
4:20pm5:00pm Wednesday, March 27, 2019
Julien Le Dem (WeWork)
Average rating: ****.
(4.83, 6 ratings)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Zhenxiao Luo (Uber)
Average rating: ****.
(4.00, 4 ratings)
From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts are using deep learning and big data to train models, make predictions, and run analytics in real time. Zhenxiao Luo explains how Uber runs real-time analytics with deep learning. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Rakesh Kumar (Lyft), Thomas Weise (Lyft)
Average rating: ****.
(4.00, 3 ratings)
Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Average rating: ****.
(4.50, 2 ratings)
GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production. Read more.
11:00am11:40am Thursday, March 28, 2019
Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)
Average rating: ***..
(3.33, 3 ratings)
Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.
11:00am11:40am Thursday, March 28, 2019
Sijie Guo (ASF), Penghui Li (Zhaopin)
Average rating: ****.
(4.00, 1 rating)
Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar. Read more.
11:50am12:30pm Thursday, March 28, 2019
Fabian Hueske (Ververica)
Average rating: ****.
(4.30, 10 ratings)
Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.
11:50am12:30pm Thursday, March 28, 2019
Vivek Pasari (Netflix), Jitender Aswani (Netflix)
Average rating: ***..
(3.14, 7 ratings)
Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale. Read more.
1:50pm2:30pm Thursday, March 28, 2019
Arun Kumar (University of California, San Diego)
Average rating: ****.
(4.00, 2 ratings)
Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.
1:50pm2:30pm Thursday, March 28, 2019
Haifeng Chen (Intel)
Average rating: ****.
(4.00, 3 ratings)
Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics. Read more.
1:50pm2:30pm Thursday, March 28, 2019
Matvey Arye (TimescaleDB)
Average rating: ***..
(3.75, 4 ratings)
Matvey Arye offers an overview of two newly released features of TimescaleDB—automated adaptation of time-partitioning intervals and continuous aggregations in near real time—and discusses how these capabilities ease time series data management. Along the way, he also shares real-world use cases, including TimescaleDB's use with other technologies such as Kafka. Read more.
2:40pm3:20pm Thursday, March 28, 2019
Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)
Average rating: ***..
(3.67, 3 ratings)
Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.
3:50pm4:30pm Thursday, March 28, 2019
Dean Wampler (Lightbend)
Average rating: ****.
(4.33, 6 ratings)
Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations. Read more.
3:50pm4:30pm Thursday, March 28, 2019
Julien Delange (Twitter), Neng Lu (Twitter)
Average rating: **...
(2.67, 3 ratings)
Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.
3:50pm4:30pm Thursday, March 28, 2019
Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)
Average rating: ***..
(3.33, 3 ratings)
Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.
3:50pm4:30pm Thursday, March 28, 2019
Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Average rating: ****.
(4.00, 1 rating)
Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.
4:40pm5:20pm Thursday, March 28, 2019
Patrick Stuedi (IBM Research)
Average rating: ****.
(4.00, 1 rating)
Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.
4:40pm5:20pm Thursday, March 28, 2019
Alex Gorbachev (Pythian), Paul Spiegelhalter (Pythian)
Average rating: ****.
(4.67, 3 ratings)
Alex Gorbachev and Paul Spiegelhalter use the example of a mining haul truck to explain how to map preventive maintenance needs to supervised machine learning problems, create labeled datasets, do feature engineering from sensors and alerts data, evaluate models—then convert it all to a complete AI solution on Google Cloud Platform that's integrated with existing on-premises systems. Read more.
4:40pm5:20pm Thursday, March 28, 2019
Case studies
Location: 2007
Nancy Rausch (SAS Institute)
Average rating: ****.
(4.80, 5 ratings)
For data to be meaningful, it needs to be presented in a way that people can relate to. Nancy Rausch explains how she combined streaming data from a solar array and machine learning techniques to create a live-action art piece—an approach that helped bring the data to life in a fun and compelling way. Read more.