Schedule: Streaming systems and real-time applications sessions: Big data conference & machine learning training

9:00 - 17:00 Monday, 21 May & Tuesday, 22 May

Real-time systems with Spark Streaming and Kafka

Location: Capital Suite 16

Jesse Anderson (Big Data Institute)

Average rating:

(5.00, 1 rating)

To handle real-time big data, you need to solve two difficult problems: How do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.

9:00–12:30 Tuesday, 22 May 2018

Modern real-time streaming architectures

Location: Capital Suite 8 Level: Intermediate

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

Average rating:

(3.67, 3 ratings)

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.

13:30–17:00 Tuesday, 22 May 2018

Kafka streaming microservices with Akka Streams and Kafka Streams

Location: Capital Suite 8 Level: Intermediate

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.25, 4 ratings)

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Along the way, Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.

11:15–11:55 Wednesday, 23 May 2018

Processing fast data with Apache Spark: A tale of two APIs

Location: Capital Suite 8/9 Level: Intermediate

Gerard Maas (Lightbend)

Average rating:

(4.00, 13 ratings)

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences in key aspects of a streaming application, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines. Read more.

12:05–12:45 Wednesday, 23 May 2018

Using a global data fabric to run a mixed cloud deployment

Location: S11A Level: Beginner

Jim Scott (NVIDIA)

Average rating:

(4.00, 2 ratings)

Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure as a service (IaaS). Jim Scott explains why a global data fabric is a requirement for running on all cloud providers simultaneously. Read more.

12:05–12:45 Wednesday, 23 May 2018

The app trap: Why every mobile app and mobile operator needs anomaly detection

Location: Capital Suite 15/16 Level: Intermediate

Secondary topics: Telecom, Time Series and Graphs

Ira Cohen (Anodot)

The mobile world has so many moving parts that a simple change to one element can cause havoc somewhere else, resulting in issues that annoy users and cause revenue leaks. Ira Cohen outlines ways to use anomaly detection to track everything mobile, from the service and roaming to specific apps, to fully optimize your mobile offerings. Read more.

14:05–14:45 Wednesday, 23 May 2018

GPU-accelerated threat detection with GOAI

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Security and Privacy

Joshua Patterson (NVIDIA), Chau Dang (NVIDIA)

Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.

14:05–14:45 Wednesday, 23 May 2018

Unlocking the world of stream processing with KSQL, the streaming SQL engine for Apache Kafka

Location: Capital Suite 8/9 Level: Beginner

Michael Noll (Confluent)

Average rating:

(4.67, 6 ratings)

Michael Noll offers an overview of KSQL, the open source streaming SQL engine for Apache Kafka, which makes it easy to get started with a wide range of real-time use cases, such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL. Read more.

14:05–14:45 Wednesday, 23 May 2018

StreamDM: Advanced data science with Spark Streaming

Location: Capital Suite 12 Level: Intermediate

Secondary topics: Telecom, Time Series and Graphs

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

Average rating:

(3.00, 1 rating)

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noah’s Ark Lab and Télécom ParisTech. Read more.

14:05–14:45 Wednesday, 23 May 2018

Real-time deep learning on video streams

Location: Capital Suite 13 Level: Intermediate

Secondary topics: Security and Privacy

eran avidan (Intel)

Average rating:

(4.50, 2 ratings)

Deep learning is revolutionizing many domains within computer vision, but doing real-time analysis is challenging. Eran Avidan offers an overview of a novel architecture based on Redis, Docker, and TensorFlow that enables real-time analysis of high-resolution streaming video. Read more.

14:55–15:35 Wednesday, 23 May 2018

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

Location: Capital Suite 7 Level: Intermediate

Lee Blum (Verint Systems)

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the system’s overall results. Read more.

14:55–15:35 Wednesday, 23 May 2018

Multi-data center and multitenant durable messaging with Apache Pulsar

Location: Capital Suite 8/9 Level: Intermediate

Ivan Kelly (Streamlio)

Average rating:

(3.00, 2 ratings)

Ivan Kelly offers an overview of Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper, that provides the enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access. Read more.

14:55–15:35 Wednesday, 23 May 2018

Executive Briefing: What you need to know about fast data

Location: Capital Suite 17 Level: Beginner

Dean Wampler (Anyscale)

Average rating:

(4.00, 2 ratings)

Streaming data systems, so called fast data, promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler outlines what you need to know to exploit fast data successfully. Read more.

14:55–15:35 Wednesday, 23 May 2018

Machine-learned model quality monitoring in fast data and streaming applications

Location: Expo Hall Level: Intermediate

Secondary topics: Managing and Deploying Machine Learning

Emre Velipasaoglu (Lightbend)

Average rating:

(3.67, 3 ratings)

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications. Read more.

16:35–17:15 Wednesday, 23 May 2018

Improving DevOps and QA efficiency using machine learning and NLP methods

Location: S11B Level: Intermediate

Secondary topics: Text and Language processing and analysis

Ran Taig (Dell), Omer Sagi (Dell)

Average rating:

(2.00, 1 rating)

DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues. Read more.

16:35–17:15 Wednesday, 23 May 2018

Kafka in jail: Running Kafka in container-orchestrated clusters

Location: Capital Suite 8/9 Level: Intermediate

Sean Glover (Lightbend)

Average rating:

(2.50, 2 ratings)

Kafka is best suited to run close to the metal on dedicated machines in static clusters, but these clusters are quickly becoming extinct. Companies want mixed-use clusters that take advantage of every resource available. Sean Glover offers an overview of leading Kafka implementations on DC/OS and Kubernetes to explore how reliably they run Kafka in container-orchestrated clusters. Read more.

17:25–18:05 Wednesday, 23 May 2018

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Location: S11B Level: Intermediate

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(4.00, 2 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.

17:25–18:05 Wednesday, 23 May 2018

Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink

Location: Capital Suite 8/9 Level: Intermediate

Aljoscha Krettek (Ververica)

Average rating:

(4.67, 3 ratings)

Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink. Read more.

11:15–11:55 Thursday, 24 May 2018

Accelerating development velocity of production ML systems with Docker

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Data Platforms, Managing and Deploying Machine Learning, Media, Advertising, Entertainment

Kinnary Jangla (Pinterest)

Average rating:

(3.00, 5 ratings)

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of its ML teams while increasing uptime and ease of deployment. Read more.

11:15–11:55 Thursday, 24 May 2018

You’re doing it wrong: How Zoomdata rearchitected streaming

Location: Capital Suite 8/9 Level: Beginner

Secondary topics: Visualization, Design, and UX

Erin Recachinas (Zoomdata)

Average rating:

(4.00, 2 ratings)

The value of real-time streaming analytics with historical data is immense. Big data application Zoomdata updates historical dashboards in real time without complex reaggregations, but streaming in the age of the IoT requires handling of data in volumes not seen in traditional feeds. Erin Recachinas explains how Zoomdata moved to a scalable microservice architecture for streaming sources. Read more.

12:05–12:45 Thursday, 24 May 2018

Big data at speed

Location: S11B Level: Intermediate

Secondary topics: Transportation and Logistics

Mark Grover (Lyft), Ted Malaska (Capital One)

Average rating:

(5.00, 6 ratings)

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.

12:05–12:45 Thursday, 24 May 2018

Machine learning at Intuit: Five delightful use cases

Location: Capital Suite 12 Level: Intermediate

Secondary topics: Financial Services

Calum Murray (Intuit)

Average rating:

(1.50, 2 ratings)

Machine learning-based applications are becoming the new norm. Calum Murray shares five use cases at Intuit that use the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance. Read more.

12:05–12:45 Thursday, 24 May 2018

A high-performance system for deep learning inference and visual inspection

Location: Capital Suite 13 Level: Intermediate

Secondary topics: Data Platforms, Managing and Deploying Machine Learning

Moty Fania (Intel)

Moty Fania explains how Intel implemented an AI inference platform to enable internal visual inspection use cases and shares lessons learned along the way. The platform is based on open source technologies and was designed for real-time streaming and online actuation. Read more.

12:05–12:45 Thursday, 24 May 2018

A heretical monitoring view: Using PostgreSQL to store Prometheus metrics and visualizing them in Grafana

Location: Expo Hall Level: Intermediate

Secondary topics: Time Series and Graphs

Erik Nordström (Timescale)

Erik Nordström explains how and why to use PostgreSQL as a Prometheus backend to support complex questions (and get a proper SQL interface), offers an overview of pg_prometheus, a custom Prometheus datatype, and prometheus-postgresql-adapter, a remote storage adaptor for PostgreSQL, and shares his experience with TimescaleDB, which enables PostgreSQL to scale for classic monitoring volumes. Read more.

14:05–14:45 Thursday, 24 May 2018

Bringing AI to BI: Microsoft's road to automated business incident monitoring and diagnostics with Project Kensho

Location: S11B Level: Intermediate

Secondary topics: Data Platforms, Time Series and Graphs

Tony Xing (Microsoft), Bixiong Xu (Microsoft)

Average rating:

(2.00, 1 rating)

Tony Xing and Bixiong Xu offer an overview of Project Kensho, Microsoft's one-stop shop for business incident monitoring and automated insights. Tony and Bixiong cover the technology's evolution, the architecture, the algorithms, and the benefits and the trade-offs. Along the way, they share a case study on Bing ads key metrics monitoring and automated diagnostic insights. Read more.

14:05–14:45 Thursday, 24 May 2018

Complex event processing with Apache Flink

Location: Capital Suite 8/9 Level: Intermediate

Kostas Kloudas (data Artisans)

Average rating:

(2.25, 4 ratings)

Complex event processing (CEP) helps detect patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g., contaminated goods), and user activity analysis fall into this category. Kostas Kloudas offers an overview of Flink's CEP library and explains the benefits of its integration with Flink. Read more.

14:55–15:35 Thursday, 24 May 2018

Radically modular data ingestion APIs in Apache Beam

Location: Capital Suite 8/9 Level: Advanced

Secondary topics: Data Integration and Data Pipelines sessions

Eugene Kirpichov (Google)

Average rating:

(4.50, 2 ratings)

Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.

16:35–17:15 Thursday, 24 May 2018

Learning how to design automatically updating AI with Apache Kafka and Deeplearning4j

Location: S11A Level: Beginner

Jason Bell (Independent Speaker)

Jason Bell offers an overview of a self-learning knowledge system that uses Apache Kafka and Deeplearning4j to accept data, apply training to a neural network, and output predictions. Jason covers the system design and the rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence. Read more.

16:35–17:15 Thursday, 24 May 2018

You call it data lake; we call it Data Historian.

Location: S11B Level: Intermediate

Secondary topics: Data Platforms

Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)

Average rating:

(4.50, 2 ratings)

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security. Read more.

16:35–17:15 Thursday, 24 May 2018

DevOps at ING Analytics: Combining data engineering with data operations

Location: Capital Suite 7 Level: Intermediate

Giuseppe D'alessio (ING Group)

Average rating:

(3.25, 4 ratings)

Giuseppe D'alessio details ING's DevOps journey, covering its impact on people, processes and tools, best practices, and pitfalls. Giuseppe concludes with a concrete example of using analytics and streaming technology on real-time applications. Read more.

16:35–17:15 Thursday, 24 May 2018

Stream scaling in Pravega

Location: Capital Suite 8/9 Level: Intermediate

Flavio Junqueira (Dell EMC)

Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.

Schedule: Streaming systems and real-time applications sessions

Sponsorship Opportunities

Partner Opportunities

Contact Us