Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Streaming systems and real-time applications sessions

Data collected and generated by things—including the difficulties of storing, analyzing, and publishing such information; and the challenges of extracting understandable, meaningful insights from the resulting torrent.

9:00am–12:30pm Tuesday, March 6, 2018

Modern real-time streaming architectures

Location: 210 B/F

Secondary topics: Graphs and Time-series

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (StreamNative), Arun Kejariwal (Independent)

Average rating:

(5.00, 2 ratings)

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Stream processing with Kafka

Location: 210 C/G

Tim Berglund (Confluent)

Average rating:

(4.36, 11 ratings)

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Data Case Studies

Location: LL20 A

Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Jennie Shin (Kaiser Permanente), Valentin Bercovici (PencilDATA), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin Martin (O'Reilly Media), Divya Ramachandran (Captricity)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Location: 210 C/G

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.50, 2 ratings)

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Using machine learning to simplify Kafka operations

Location: 230 A

Secondary topics: Graphs and Time-series

Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)

Average rating:

(4.50, 2 ratings)

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Machine learning versus machine learning in production

Location: LL21 E/F

Manu Mukerji (8x8)

Average rating:

(4.22, 9 ratings)

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Streaming big data in the cloud: What to consider and why

Location: 230 A

Secondary topics: Graphs and Time-series

Bill Chambers (Databricks), michael dddd (Databricks)

Average rating:

(4.60, 5 ratings)

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Deploying and monitoring interactive machine learning applications with Clipper

Location: LL20 A

Dan Crankshaw (UC Berkeley RISELab)

Average rating:

(4.25, 4 ratings)

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Want to build a better chatbot? Start with your data.

Location: LL20 D

Andrew Mattarella-Micke (Intuit)

Average rating:

(5.00, 1 rating)

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Radically modular data ingestion APIs in Apache Beam

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Eugene Kirpichov (Google)

Average rating:

(4.75, 4 ratings)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Approximation data structures in streaming data processing

Location: 230 A

Debasish Ghosh (Lightbend)

Average rating:

(3.33, 3 ratings)

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Vectorized query processing using Apache Arrow

Location: 230 C

Siddharth Teotia (Dremio)

Average rating:

(5.00, 1 rating)

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

Average rating:

(4.25, 4 ratings)

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Location: 230 A

Henry Cai (Pinterest), Yi Yin (Pinterest)

Average rating:

(3.00, 1 rating)

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Real-time deep link analytics: The next stage of graph analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Yu Xu (TigerGraph)

Average rating:

(5.00, 2 ratings)

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Stream storage with Apache BookKeeper

Location: 230 A

Secondary topics: Graphs and Time-series

Sijie Guo (StreamNative)

Average rating:

(3.67, 3 ratings)

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber

Location: 230 A

Fabian Hueske (data Artisans), Shuyi Chen (Uber)

Average rating:

(5.00, 1 rating)

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Continuous machine learning over streaming data

Location: LL20 A

Secondary topics: Graphs and Time-series

Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

Average rating:

(5.00, 8 ratings)

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Read more.

11:00am–11:40am Thursday, March 8, 2018

Foundations of streaming SQL; or, How I learned to love stream and table theory

Location: 230 A

Tyler Akidau (Google)

Average rating:

(5.00, 4 ratings)

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.

11:00am–11:40am Thursday, March 8, 2018

Kafka streaming applications with Akka Streams and Kafka Streams

Location: Expo Hall 1

Secondary topics: Expo Hall

Dean Wampler (Anyscale)

Average rating:

(5.00, 1 rating)

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Effectively once, exactly once, and more in Heron

Location: 230 A

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)

Average rating:

(4.00, 1 rating)

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Machine-learned model quality monitoring in fast data and streaming applications

Location: LL21 C/D

Emre Velipasaoglu (Lightbend)

Average rating:

(4.00, 1 rating)

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The science of patchy data

Location: LL20 D

Jennifer Prendki (Figure Eight)

Average rating:

(3.00, 1 rating)

Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Playing well together: Big data beyond the JVM with Spark and friends

Location: 230 C

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(3.40, 5 ratings)

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The real-time journey from raw streaming data to AI-based analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)

Average rating:

(5.00, 1 rating)

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Unified and elastic batch and stream processing with Pravega and Apache Flink

Location: 230 A

Secondary topics: Graphs and Time-series

Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)

Average rating:

(3.33, 3 ratings)

Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building ML and AI pipelines with Spark and TensorFlow

Location: Expo Hall 1

Secondary topics: Expo Hall

Chris Fregly (Amazon Web Services)

Average rating:

(5.00, 1 rating)

Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

HDFS on Kubernetes: Tech deep dive on locality and security

Location: LL21 C/D

Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)

Average rating:

(5.00, 1 rating)

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Effectively once in Apache Pulsar, the next-generation messaging system

Location: 230 A

Matteo Merli (Streamlio)

Average rating:

(1.00, 1 rating)

Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Big data insights equal big money: Stories from the trenches at GoDaddy

Location: 210 C/G

Felix Gorodishter (GoDaddy)

Average rating:

(3.00, 2 ratings)

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com