Schedule: Streaming, realtime analytics, and IoT sessions: Big data conference & machine learning training

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26

Professional Kafka development

Data Engineering & Architecture
Location: 3016

Jesse Anderson (Big Data Institute)

Average rating:

(3.00, 1 rating)

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Introduction to Flink via Flink SQL

Data Engineering & Architecture, Streaming and IoT
Location: 2004

Fabian Hueske (Ververica)

Average rating:

(5.00, 1 rating)

Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Learning Presto: SQL on anything

Data Engineering & Architecture
Location: 2004

Matt Fuller (Starburst)

Average rating:

(3.57, 7 ratings)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Architecture and algorithms for end-to-end streaming data processing

Data Engineering & Architecture
Location: 2005

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(2.67, 12 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Data Engineering & Architecture
Location: 2006

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

Average rating:

(4.67, 3 ratings)

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Enabling insights and analytics with data streaming architectures and pipelines using Kafka and Hadoop

Data Engineering & Architecture
Location: 2006

Mohammad Quraishi (Cigna)

Average rating:

(4.60, 5 ratings)

In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Accelerating analytical antelopes: Integrating Apache Kudu's RPC into Apache Impala

Data Engineering & Architecture
Location: 2004

Lars Volker (Cloudera), Michael Ho (Cloudera)

Average rating:

(4.50, 6 ratings)

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Real-time analytics at Uber: Bring SQL into everything

Data Engineering & Architecture
Location: 2004

Zhenxiao Luo (Twitter)

Average rating:

(4.09, 11 ratings)

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Cruise Control: Effortless management of Kafka clusters

Data Engineering & Architecture
Location: 2006

Adem Efe Gencer (LinkedIn)

Average rating:

(3.50, 2 ratings)

Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Put Kafka in jail with Strimzi

Data Engineering & Architecture
Location: 2006

Sean Glover (Lightbend)

Average rating:

(4.00, 1 rating)

The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. Sean Glover offers an overview of the Strimzi Kafka Operator, a popular new open source Operator-based Apache Kafka implementation on Kubernetes. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

Data Engineering & Architecture
Location: 2004

Julien Le Dem (WeWork)

Average rating:

(4.83, 6 ratings)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Real-time analytics on deep learning: When TensorFlow met Presto at Uber

Data Science, Machine Learning & AI
Location: 2016

Zhenxiao Luo (Twitter)

Average rating:

(4.00, 4 ratings)

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts are using deep learning and big data to train models, make predictions, and run analytics in real time. Zhenxiao Luo explains how Uber runs real-time analytics with deep learning. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

The magic behind your Lyft ride prices: A case study on machine learning and streaming

Data Science, Machine Learning & AI
Location: 2009

Rakesh Kumar (Lyft), Thomas Weise (Lyft)

Average rating:

(4.00, 3 ratings)

Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Critical turbine maintenance: Monitoring and diagnosing planes and power plants in real time

Data Engineering & Architecture, Streaming and IoT
Location: 2006

June Andrews (GE), John Rutherford (GE)

Average rating:

(4.50, 2 ratings)

GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production. Read more.

11:00am–11:40am Thursday, March 28, 2019

Presto: Tuning performance of SQL-on-anything analytics

Data Engineering & Architecture
Location: 2004

Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)

Average rating:

(3.33, 3 ratings)

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.

11:00am–11:40am Thursday, March 28, 2019

How Zhaopin.com built its enterprise event bus using Apache Pulsar

Data Engineering & Architecture, Streaming and IoT
Location: 2006

Sijie Guo (StreamNative), Penghui Li (Zhaopin)

Average rating:

(4.00, 1 rating)

Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Flink SQL in action

Data Engineering & Architecture, Streaming and IoT
Location: 2004

Fabian Hueske (Ververica)

Average rating:

(4.30, 10 ratings)

Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.

11:50am–12:30pm Thursday, March 28, 2019

How Netflix measures app performance on 250 million unique devices across 190 countries

Data Engineering & Architecture
Location: 2006

Vivek Pasari (Netflix), Jitender Aswani (Netflix)

Average rating:

(3.14, 7 ratings)

Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Faster ML over joins of tables

Data Engineering & Architecture
Location: 2008

Arun Kumar (University of California, San Diego)

Average rating:

(4.00, 2 ratings)

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Spark adaptive execution: Unleash the power of Spark SQL

Data Engineering & Architecture
Location: 2004

Haifeng Chen (Intel)

Average rating:

(4.00, 3 ratings)

Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Performant time series data management and analytics with Postgres

Data Engineering & Architecture
Location: 2006

Matvey Arye (TimescaleDB)

Average rating:

(3.75, 4 ratings)

Matvey Arye offers an overview of two newly released features of TimescaleDB—automated adaptation of time-partitioning intervals and continuous aggregations in near real time—and discusses how these capabilities ease time series data management. Along the way, he also shares real-world use cases, including TimescaleDB's use with other technologies such as Kafka. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Bullet: Querying streaming data in transit with sketches

Data Engineering & Architecture
Location: 2006

Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)

Average rating:

(3.67, 3 ratings)

Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Executive Briefing: What it takes to use machine learning in fast data pipelines

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Dean Wampler (Anyscale)

Average rating:

(4.33, 6 ratings)

Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Real-time monitoring of Twitter's network infrastructure with Heron

Data Engineering & Architecture
Location: 2024

J Delange (Twitter), N Lu (Twitter)

Average rating:

(2.67, 3 ratings)

Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Data Engineering & Architecture
Location: 2008

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)

Average rating:

(3.33, 3 ratings)

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Data Engineering & Architecture
Location: 2002

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

Average rating:

(4.00, 1 rating)

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Data processing at the speed of 100 Gbps using Apache Crail

Data Engineering & Architecture
Location: 2008

Patrick Stuedi (IBM Research)

Average rating:

(4.00, 1 rating)

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Machine learning for preventive maintenance of mining haul trucks

Data Science, Machine Learning & AI
Location: 2009

Alex Gorbachev (Pythian), Paul Spiegelhalter (Pythian)

Average rating:

(4.67, 3 ratings)

Alex Gorbachev and Paul Spiegelhalter use the example of a mining haul truck to explain how to map preventive maintenance needs to supervised machine learning problems, create labeled datasets, do feature engineering from sensors and alerts data, evaluate models—then convert it all to a complete AI solution on Google Cloud Platform that's integrated with existing on-premises systems. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Bringing data to life: Combining machine learning and art to tell a data story

Case studies
Location: 2007

Nancy Rausch (SAS)

Average rating:

(4.80, 5 ratings)

For data to be meaningful, it needs to be presented in a way that people can relate to. Nancy Rausch explains how she combined streaming data from a solar array and machine learning techniques to create a live-action art piece—an approach that helped bring the data to life in a fun and compelling way. Read more.

Schedule: Streaming, realtime analytics, and IoT sessions

Sponsorship Opportunities

Partner Opportunities

Contact Us