Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Stream processing and analytics sessions

9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Location: 213
Secondary topics:  Architecture, Cloud
Jesse Anderson (Big Data Institute)
Average rating: ****.
(4.00, 1 rating)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
9:00am12:30pm Tuesday, March 14, 2017
Location: 210 A/E Level: Beginner
Secondary topics:  Streaming
Frances Perry (Google), Tyler Akidau (Google)
Average rating: ***..
(3.00, 2 ratings)
Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Frances Perry cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow. Read more.
9:00am12:30pm Tuesday, March 14, 2017
Location: LL20 C
Edd Wilder-James (Google), Ellen Friedman (MapR Technologies), Jim Scott (NVIDIA), GABRIELA QUEIROZ (R-Ladies), Melanie Warrick (Google), Aneesh Karve (Quilt)
Data 101 introduces you to core principles of data architecture, teaches you how to build and manage successful data teams, and inspires you to do more with your data through real-world applications. Setting the foundation for deeper dives on the following days of Strata + Hadoop World, Data 101 reinforces data fundamentals and helps you focus on how data can solve your business problems. Read more.
1:30pm5:00pm Tuesday, March 14, 2017
Location: 210 A/E Level: Intermediate
Secondary topics:  Streaming
Ian Wrigley (StreamSets)
Average rating: ****.
(4.83, 6 ratings)
Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis. Read more.
11:00am11:40am Wednesday, March 15, 2017
Location: LL20 C Level: Beginner
Secondary topics:  Streaming
Jay Kreps (Confluent)
Average rating: ***..
(3.70, 10 ratings)
The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem. Read more.
11:00am11:40am Wednesday, March 15, 2017
Location: LL20 D Level: Advanced
Secondary topics:  Streaming
Kenneth Knowles (Google)
Average rating: ****.
(4.80, 5 ratings)
Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Location: LL20 C
Secondary topics:  Cloud
Roger Barga (Amazon Web Services)
Average rating: ****.
(4.00, 2 ratings)
Roger Barga offers an overview of Kinesis, Amazon’s data streaming platform, which includes Kinesis Firehose, Kinesis Analytics, and Kinesis Streams, and explains how customers have architected their applications using Kinesis services for low-latency and extreme scale. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Location: LL20 C Level: Intermediate
Secondary topics:  Media, Streaming
Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He'll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Platform
Sridhar Alla (BlueWhale), Shekhar Agrawal (Comcast)
Average rating: *****
(5.00, 2 ratings)
Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Location: LL20 C
Secondary topics:  Media, Streaming
Sijie Guo (Apache Software Foundation)
Average rating: **...
(2.00, 2 ratings)
Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day. Read more.
11:00am11:40am Thursday, March 16, 2017
Location: LL20 C Level: Beginner
Secondary topics:  Streaming
Tyler Akidau (Google)
Average rating: ***..
(3.00, 2 ratings)
Join Tyler Akidau for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, as Tyler compares and contrasts systems at Google with popular open source systems in use today. Read more.
11:00am11:40am Thursday, March 16, 2017
Location: LL20 D Level: Intermediate
Secondary topics:  Data Platform, Media, Streaming
Bill Graham (Twitter), Avrilia Floratau (Microsoft), Ashvin Agrawal (Microsoft)
Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience. Read more.
11:50am12:30pm Thursday, March 16, 2017
Location: LL20 C Level: Intermediate
Secondary topics:  Streaming
Slava Chernyak (Google)
Average rating: *****
(5.00, 2 ratings)
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications. Read more.
11:50am12:30pm Thursday, March 16, 2017
Location: LL20 D Level: Intermediate
Secondary topics:  Media, Streaming
Arun Kejariwal (Independent), Karthik Ramasamy (Twitter)
Average rating: ***..
(3.00, 1 rating)
Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Location: LL20 C Level: Intermediate
Secondary topics:  Streaming
Jamie Grier (data Artisans)
Average rating: *****
(5.00, 4 ratings)
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Streaming
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
There are many good reasons to run more than one Kafka cluster. . .and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Location: LL21 E/F Level: Intermediate
Secondary topics:  Streaming
David Yan (DataTorrent, Inc.)
David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. With Apex, you can build applications that scalably and reliably process their data with high throughput and low latency. Read more.