Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Streaming sessions

9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Spark & beyond
Location: 212 C
Jacob Parr (Databricks)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. Read more.
9:00am12:30pm Tuesday, March 14, 2017
Stream processing and analytics
Location: 210 A/E Level: Beginner
Frances Perry (Google), Tyler Akidau (Google)
Average rating: ***..
(3.00, 2 ratings)
Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Frances Perry cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow. Read more.
9:00am5:00pm Tuesday, March 14, 2017
Spark & beyond
Location: San Jose Ballroom, Marriott
Andy Konwinski (Databricks)
Average rating: ****.
(4.43, 7 ratings)
Andy Konwinski introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
9:00am12:30pm Tuesday, March 14, 2017
Location: LL20 C
Edd Wilder-James (Google), Ellen Friedman (MapR Technologies), Jim Scott (MapR Technologies), GABRIELA QUEIROZ (R-Ladies), Melanie Warrick (Google), Aneesh Karve (Quilt)
Data 101 introduces you to core principles of data architecture, teaches you how to build and manage successful data teams, and inspires you to do more with your data through real-world applications. Setting the foundation for deeper dives on the following days of Strata + Hadoop World, Data 101 reinforces data fundamentals and helps you focus on how data can solve your business problems. Read more.
1:30pm5:00pm Tuesday, March 14, 2017
Stream processing and analytics
Location: 210 A/E Level: Intermediate
Ian Wrigley (StreamSets)
Average rating: ****.
(4.83, 6 ratings)
Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis. Read more.
11:00am11:40am Wednesday, March 15, 2017
Stream processing and analytics
Location: LL20 D Level: Advanced
Kenneth Knowles (Google)
Average rating: ****.
(4.80, 5 ratings)
Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow. Read more.
11:00am11:40am Wednesday, March 15, 2017
Data engineering and architecture, Enterprise adoption
Location: 230 A Level: Beginner
Felix Gorodishter (GoDaddy)
Average rating: ****.
(4.25, 4 ratings)
GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch. Read more.
11:00am11:40am Wednesday, March 15, 2017
Stream processing and analytics
Location: LL20 C Level: Beginner
Jay Kreps (Confluent)
Average rating: ***..
(3.70, 10 ratings)
The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Sensors, IOT & Industrial Internet
Location: LL20 D Level: Advanced
Tim Gasper (Janrain)
Average rating: *****
(5.00, 1 rating)
Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D
michael dddd (Databricks), Tathagata Das (Databricks)
Average rating: ****.
(4.29, 7 ratings)
Apache Spark 2.0 introduced the core APIs for Structured Streaming, a new streaming processing engine on Spark SQL. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D Level: Intermediate
Holden Karau (Google), Seth Hendrickson (Cloudera)
Average rating: ****.
(4.00, 8 ratings)
Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 C Level: Intermediate
Ryan Pridgeon (Confluent), Dustin Cote (Confluent)
Average rating: ****.
(4.67, 3 ratings)
Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Chandan Joarder (Macys.com)
Average rating: ***..
(3.56, 9 ratings)
Chandan Joarder shares a guide to building real-time dashboards in-house using tools such as Kafka, web frameworks, and an in-memory database, utilizing JavaScript and Scala. Along the way, Chandan also discusses the architectural principles used in these dashboards to provide up-to-the-hour business performance metrics and alerts. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Data engineering and architecture, Real-time applications
Location: LL20 A Level: Intermediate
Kartik Paramasivam (LinkedIn)
Average rating: *****
(5.00, 2 ratings)
LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Platform Security and Cybersecurity
Location: LL21 B Level: Intermediate
Ajit Gaddam (VISA), Jiphun Satapathy (VISA)
Average rating: ***..
(3.83, 6 ratings)
Apache Kafka is used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He'll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 C Level: Intermediate
Kevin Mao (Capital One)
Average rating: ****.
(4.67, 3 ratings)
Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Data science & advanced analytics, Real-time applications
Location: 210 C/G Level: Intermediate
Shivnath Babu (Duke University | Unravel Data Systems)
Average rating: ***..
(3.33, 3 ratings)
Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Kishore R (GE)
Average rating: ***..
(3.00, 1 rating)
Kishore Reddipalli explores how to stream data at a large scale from the edge to the cloud to the client, detect anomalies, analyze machine data in stream and rest in an industrial world, and optimize the industrial operations by providing real-time insights and recommendations using big data technologies. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Advanced
Monal Daxini (Netflix)
Average rating: ****.
(4.50, 2 ratings)
Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Real-time applications
Location: LL20 D Level: Intermediate
Michael Freedman (TimescaleDB)
Average rating: *****
(5.00, 3 ratings)
IoT applications often need more-complex queries than those supported by traditional time series databases. Michael Freedman outlines a new distributed time series database for such workloads, supporting efficient queries, including complex predicates across many metrics, while scaling out to support IoT ingest rates. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Sijie Guo (ASF)
Average rating: **...
(2.00, 2 ratings)
Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day. Read more.
11:00am11:40am Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 D Level: Intermediate
Bill Graham (Twitter), Avrilia Floratau (Microsoft), Ashvin Agrawal (Microsoft)
Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience. Read more.
11:00am11:40am Thursday, March 16, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Todd Lipcon (Cloudera), Marcel Kornacker (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting. Read more.
11:00am11:40am Thursday, March 16, 2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Todd Mostak (MapD), Abdul Subhan (Verizon Wireless)
Average rating: ****.
(4.00, 2 ratings)
With more than 91M customers, Verizon produces oceans of data. The challenge this onslaught presents isn’t one of storage—it’s one of speed. The solution? Harnessing the power of GPUs to access insights in less than a millisecond. Todd Mostak and Abdul Subhan explain how Verizon solved its data challenge by implementing GPU-tuned analytics and visualization. Read more.
11:00am11:40am Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 C Level: Beginner
Tyler Akidau (Google)
Average rating: ***..
(3.00, 2 ratings)
Join Tyler Akidau for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, as Tyler compares and contrasts systems at Google with popular open source systems in use today. Read more.
11:50am12:30pm Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 C Level: Intermediate
Slava Chernyak (Google)
Average rating: *****
(5.00, 2 ratings)
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications. Read more.
11:50am12:30pm Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 D Level: Intermediate
Arun Kejariwal (Independent), Karthik Ramasamy (Twitter)
Average rating: ***..
(3.00, 1 rating)
Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 C Level: Intermediate
Jamie Grier (data Artisans)
Average rating: *****
(5.00, 4 ratings)
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Spark & beyond
Location: LL20 A
Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar and Prasanna Rajaperumal introduce Hoodie, a newly open sourced system at Uber that adds new incremental processing primitives to existing Hadoop technologies to provide near-real-time data at 10x reduced cost. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Stream processing and analytics
Location: LL21 E/F Level: Intermediate
David Yan (DataTorrent, Inc.)
David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. With Apex, you can build applications that scalably and reliably process their data with high throughput and low latency. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Real-time applications
Location: 210 A/E Level: Advanced
Jeffrey Yau (Silicon Valley Data Science)
Average rating: ***..
(3.20, 5 ratings)
Thanks to frameworks such as Spark's GraphX and GraphFrames, graph-based techniques are increasingly applicable to anomaly, outlier, and event detection in time series. Jeffrey Yau offers an overview of applying graph-based techniques in fraud detection, IoT processing, and financial data and outlines the benefits of graphs relative to other techniques. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
There are many good reasons to run more than one Kafka cluster. . .and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs. Read more.
4:20pm5:00pm Thursday, March 16, 2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Mahesh Goud T (Ticketmaster)
Average rating: **...
(2.00, 1 rating)
Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and explores Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot. Read more.