Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Streaming systems and real-time applications sessions

Data collected and generated by things—including the difficulties of storing, analyzing, and publishing such information; and the challenges of extracting understandable, meaningful insights from the resulting torrent.

Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: 210 B/F Level: Beginner
Secondary topics:  Graphs and Time-series
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio), Arun Kejariwal (MZ)
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: 210 C/G Level: Beginner
Tim Berglund (Confluent)
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, March 6, 2018
Location: LL20 A
Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Rajiv Synghal (Kaiser Permanente), Valentin Bercovici (Pencil Data Inc.), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin (O'Reilly Media), Divya Ramachandran (Captricity)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 6, 2018
Location: 210 C/G Level: Intermediate
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Shivnath Babu (Duke University | Unravel Data Systems), Sumit Jindal (Unravel Data Systems)
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Intermediate
Manu Mukerji (Criteo)
Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
William Chambers (Databricks), Michael Armbrust (Databricks)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL20 A Level: Intermediate
Dan Crankshaw (UC Berkeley RISE Lab)
Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL20 D Level: Intermediate
When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Diane Chang shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices she's learned along the way. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive SplittableDoFn. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Debasish Ghosh (Lightbend )
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 230 C Level: Beginner
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Jordan Hambleton (Cloudera), Guru Medasani (Cloudera)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: Expo Hall 1 Level: Advanced
Secondary topics:  Expo Hall, Graphs and Time-series
Yu Xu (TigerGraph)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: 230 A Level: Beginner
Secondary topics:  Graphs and Time-series
Sijie Guo (Streamlio)
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: 230 A Level: Intermediate
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
Dean Wampler (Lightbend)
Dean Wampler explores two microservice streaming applications based on Kafka to compare and contrast using Akka Streams and Kafka Streams for data processing. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: 230 A Level: Beginner
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: LL21 C/D Level: Intermediate
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: LL20 D Level: Intermediate
Jennifer Prendki (Atlassian)
Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Holden Karau (Google), Rachel Warren (Salesforce Einstein)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: 210 D/H Level: Intermediate
Michael Lysaght (Weight Watchers), Steven Levine (Weight Watchers )
For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall, Graphs and Time-series
Roy Ben-Alta (Amazon Web Services), Ira Cohen (Anodot)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: 230 A Level: Beginner
Secondary topics:  Graphs and Time-series
Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)
Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
Chris Fregly (PipelineAI)
Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file format’s such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: LL21 C/D Level: Advanced
Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing the challenges such as HDFS data locality and secure HDFS support. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: 230 A Level: Beginner
Matteo Merli (Streamlio)
Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: 210 C/G Level: Beginner
Felix Gorodishter (GoDaddy)
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.