Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Streaming conference sessions

12:05pm–12:45pm Wednesday, 12/07/2016
Ted Dunning explains how a stream-first approach simplifies and speeds development of applications, resulting in real-time applications that have significant impact. Along the way, Ted contrasts a stream-first approach with existing approaches that start with an application that dictates specialized data structures, ETL activities, data silos, and processing delays.
1:45pm–2:25pm Wednesday, 12/07/2016
Apache Beam (incubating) defines a new data processing programming model evolved from more than a decade of experience building big data infrastructure within Google. Beam pipelines are portable across open source and private cloud runtimes. Dan Halperin covers the basics of Apache Beam—its evolution, main concepts in the programming mode, and how it compares to similar systems.
1:45pm–2:25pm Wednesday, 12/07/2016
Hybrid cloud architectures marry the flexibility to scale workloads on-demand in the public cloud with the ability to control mission-critical applications on-premises. Publish-subscribe message streams offer a natural paradigm for hybrid cloud use cases. Mathieu Dumoulin describes how to architect a real-time, global IoT analytics hybrid cloud application with a Kafka-based message stream system.
9:00am–12:30pm Tuesday, 12/06/2016
Mark Grover, Ted Malaska, and Jonathan Seidman explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world and discuss how to use components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.
11:15am–11:55am Thursday, 12/08/2016
The interconnected world presents unprecedented opportunities to gain new insights on behavior, both human and nonhuman alike. Likewise, it also poses unprecedented challenges on how organizations can act on these moments of opportunities in time. Michael O'Connell and San Zaw share real-world case studies demonstrating how real-time analytics solves these challenges.
5:05pm–5:45pm Wednesday, 12/07/2016
Todd Lipcon and Marcel Kornacker provide an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.
11:15am–11:55am Thursday, 12/08/2016
Translating streaming, real-time telecommunications data into actionable analytics products remains challenging. Boon Siew Seah explores SmartHub’s past successes and failures building telco analytics products for its customers and shares the big data technologies behind its two API-based telco analytics products: Grid360 (geolocation analytics) and C360 (consumer insights).
5:05pm–5:45pm Wednesday, 12/07/2016
Creating big data solutions that can process data at terabyte scale and produce spatial-temporal real-time insights at speed demands a well-thought-through system architecture. Chandras Sekhar Saripaka details the production architecture at DataSpark that works through terabytes of spatial-temporal telco data each day in PaaS mode and showcases how DataSpark operates in SaaS mode.
4:15pm–4:55pm Wednesday, 12/07/2016
Picking up where his talk at Strata + Hadoop World in London left off, Gopal GopalKrishnan shares lessons learned from using components of the big data ecosystem for insights from industrial sensor and time series data and explores use cases in predictive maintenance, energy optimization, process efficiency, production cost reduction, and quality improvement.
11:15am–11:55am Wednesday, 12/07/2016
IHI has developed a common platform for remote monitoring and maintenance and has started leveraging Spark MLlib to get up speed developing applications for process improvement and product fault diagnosis. Yoshitaka Suzuki and Masaru Dobashi explain how IHI used PySpark and MLlib to improve its services and share best practices for application development and lessons for operating Spark on YARN.
1:30pm–5:00pm Tuesday, 12/06/2016
Tyler Akidau, Slava Chernyak, and Dan Halperin offer a guided walkthrough of Apache Beam (incubating)—the most sophisticated and portable stream processing model on the planet—covering the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice (Flink, Spark, or Google Cloud Dataflow).
12:05pm–12:45pm Thursday, 12/08/2016
Modern telecommunications are alphabet soups that produce massive amounts of diagnostic data. Ted Dunning offers an overview of a real-time, low-fidelity simulation of the edge protocols of such a system to help illustrate how modern big data tools can be used for telecom analytics. Ted demos the system and shows how several tools can produce useful analytical results and system understanding.
1:45pm–2:25pm Wednesday, 12/07/2016
Discuss Spark, Scala, and streaming data architectures with Dean.
4:15pm–4:55pm Thursday, 12/08/2016
Aljoscha Krettek offers a very short introduction to stream processing before diving into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. All of this will be done in the context of a real-time analytics application that we'll be modifying on the fly based on the topics we're working though.
9:00am–5:00pm Monday, 12/05/2016
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Andy Huang employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.
12:05pm–12:45pm Thursday, 12/08/2016
Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.
4:15pm–4:55pm Wednesday, 12/07/2016
Jorge Pablo Fernandez and Nicolette Bullivant explore Santander Bank's Spendlytics app, which helps customers track their spending by offering a listing of transactions, transactions aggregations, and real-time enrichment based on the categorization of transactions depending on market and brands. Along they way, they share the challenges encountered and lessons learned while implementing the app.
9:00am–5:00pm Tuesday, 12/06/2016
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Brian Clapper employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible.
12:05pm–12:45pm Wednesday, 12/07/2016
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Maosong Fu offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation).
2:35pm–3:15pm Wednesday, 12/07/2016
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.