Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Schedule: Streaming systems and real-time applications sessions

9:00 - 17:00 Monday, 21 May & Tuesday, 22 May
Location: Capital Suite 16
Jesse Anderson (Big Data Institute)
Average rating: *****
(5.00, 1 rating)
To handle real-time big data, you need to solve two difficult problems: How do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 8 Level: Intermediate
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)
Average rating: ***..
(3.67, 3 ratings)
The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 8 Level: Intermediate
Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)
Average rating: ***..
(3.25, 4 ratings)
Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Along the way, Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Gerard Maas (Lightbend)
Average rating: ****.
(4.00, 13 ratings)
Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences in key aspects of a streaming application, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines. Read more.
12:0512:45 Wednesday, 23 May 2018
Location: S11A Level: Beginner
Jim Scott (NVIDIA)
Average rating: ****.
(4.00, 2 ratings)
Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure as a service (IaaS). Jim Scott explains why a global data fabric is a requirement for running on all cloud providers simultaneously. Read more.
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 15/16 Level: Intermediate
Secondary topics:  Telecom, Time Series and Graphs
Ira Cohen (Anodot)
The mobile world has so many moving parts that a simple change to one element can cause havoc somewhere else, resulting in issues that annoy users and cause revenue leaks. Ira Cohen outlines ways to use anomaly detection to track everything mobile, from the service and roaming to specific apps, to fully optimize your mobile offerings. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Joshua Patterson (NVIDIA), Chau Dang (NVIDIA)
Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Beginner
Michael Noll (Confluent)
Average rating: ****.
(4.67, 6 ratings)
Michael Noll offers an overview of KSQL, the open source streaming SQL engine for Apache Kafka, which makes it easy to get started with a wide range of real-time use cases, such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Telecom, Time Series and Graphs
Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)
Average rating: ***..
(3.00, 1 rating)
Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noah’s Ark Lab and Télécom ParisTech. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Security and Privacy
eran avidan (Intel)
Average rating: ****.
(4.50, 2 ratings)
Deep learning is revolutionizing many domains within computer vision, but doing real-time analysis is challenging. Eran Avidan offers an overview of a novel architecture based on Redis, Docker, and TensorFlow that enables real-time analysis of high-resolution streaming video. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Lee Blum (Verint Systems)
Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the system’s overall results. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Ivan Kelly (Streamlio)
Average rating: ***..
(3.00, 2 ratings)
Ivan Kelly offers an overview of Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper, that provides the enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 17 Level: Beginner
Dean Wampler (Anyscale)
Average rating: ****.
(4.00, 2 ratings)
Streaming data systems, so called fast data, promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler outlines what you need to know to exploit fast data successfully. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Managing and Deploying Machine Learning
Emre Velipasaoglu (Lightbend)
Average rating: ***..
(3.67, 3 ratings)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Text and Language processing and analysis
Ran Taig (Dell), Omer Sagi (Dell)
Average rating: **...
(2.00, 1 rating)
DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Sean Glover (Lightbend)
Average rating: **...
(2.50, 2 ratings)
Kafka is best suited to run close to the metal on dedicated machines in static clusters, but these clusters are quickly becoming extinct. Companies want mixed-use clusters that take advantage of every resource available. Sean Glover offers an overview of leading Kafka implementations on DC/OS and Kubernetes to explore how reliably they run Kafka in container-orchestrated clusters. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
Average rating: ****.
(4.00, 2 ratings)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Aljoscha Krettek (Ververica)
Average rating: ****.
(4.67, 3 ratings)
Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink. Read more.
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Data Platforms, Managing and Deploying Machine Learning, Media, Advertising, Entertainment
Kinnary Jangla (Pinterest)
Average rating: ***..
(3.00, 5 ratings)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of its ML teams while increasing uptime and ease of deployment. Read more.
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Beginner
Secondary topics:  Visualization, Design, and UX
Erin Recachinas (Zoomdata)
Average rating: ****.
(4.00, 2 ratings)
The value of real-time streaming analytics with historical data is immense. Big data application Zoomdata updates historical dashboards in real time without complex reaggregations, but streaming in the age of the IoT requires handling of data in volumes not seen in traditional feeds. Erin Recachinas explains how Zoomdata moved to a scalable microservice architecture for streaming sources. Read more.
12:0512:45 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Transportation and Logistics
Mark Grover (Lyft), Ted Malaska (Capital One)
Average rating: *****
(5.00, 6 ratings)
Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Financial Services
Calum Murray (Intuit)
Average rating: *....
(1.50, 2 ratings)
Machine learning-based applications are becoming the new norm. Calum Murray shares five use cases at Intuit that use the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Data Platforms, Managing and Deploying Machine Learning
Moty Fania (Intel)
Moty Fania explains how Intel implemented an AI inference platform to enable internal visual inspection use cases and shares lessons learned along the way. The platform is based on open source technologies and was designed for real-time streaming and online actuation. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Time Series and Graphs
Erik Nordström (Timescale)
Erik Nordström explains how and why to use PostgreSQL as a Prometheus backend to support complex questions (and get a proper SQL interface), offers an overview of pg_prometheus, a custom Prometheus datatype, and prometheus-postgresql-adapter, a remote storage adaptor for PostgreSQL, and shares his experience with TimescaleDB, which enables PostgreSQL to scale for classic monitoring volumes. Read more.
14:0514:45 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms, Time Series and Graphs
Tony Xing (Microsoft), Bixiong Xu (Microsoft)
Average rating: **...
(2.00, 1 rating)
Tony Xing and Bixiong Xu offer an overview of Project Kensho, Microsoft's one-stop shop for business incident monitoring and automated insights. Tony and Bixiong cover the technology's evolution, the architecture, the algorithms, and the benefits and the trade-offs. Along the way, they share a case study on Bing ads key metrics monitoring and automated diagnostic insights. Read more.
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Kostas Kloudas (data Artisans)
Average rating: **...
(2.25, 4 ratings)
Complex event processing (CEP) helps detect patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g., contaminated goods), and user activity analysis fall into this category. Kostas Kloudas offers an overview of Flink's CEP library and explains the benefits of its integration with Flink. Read more.
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Advanced
Secondary topics:  Data Integration and Data Pipelines sessions
Eugene Kirpichov (Google)
Average rating: ****.
(4.50, 2 ratings)
Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.
16:3517:15 Thursday, 24 May 2018
Location: S11A Level: Beginner
Jason Bell (Independent Speaker)
Jason Bell offers an overview of a self-learning knowledge system that uses Apache Kafka and Deeplearning4j to accept data, apply training to a neural network, and output predictions. Jason covers the system design and the rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence. Read more.
16:3517:15 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms
Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)
Average rating: ****.
(4.50, 2 ratings)
There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security. Read more.
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Giuseppe D'alessio (ING Group)
Average rating: ***..
(3.25, 4 ratings)
Giuseppe D'alessio details ING's DevOps journey, covering its impact on people, processes and tools, best practices, and pitfalls. Giuseppe concludes with a concrete example of using analytics and streaming technology on real-time applications. Read more.
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Flavio Junqueira (Dell EMC)
Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.