Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Large-scale stream processing and analytics sessions

1:15pm–1:55pm Wednesday, 09/30/2015
Roy Ben Alta (Amazon Web Services)
Amazon Kinesis is a fully managed service for real-time streaming big data ingestion and processing. This talk explores Kinesis concepts in detail, including best practices for scaling your core streaming data ingestion pipeline. We then discuss building and deploying Kinesis processing applications using capabilities like Kinesis Client Libraries, AWS Lambda, and Amazon EMR (via Spark).
5:25pm–6:05pm Wednesday, 09/30/2015
Ian Eslick (VitalLabs)
Capturing and integrating device-based and other health data for research is frustratingly difficult. We explain the open source technology frame​work for capturing and routing device-based health data for use by healthcare providers and for access, via a trusted analytic container, to ​​researchers we developed, working with O’Reilly Media and support from the Robert Wood Johnson Foundation.​
2:55pm–3:35pm Thursday, 10/01/2015
Fangjin Yang (Imply), Gian Merlino (Imply)
The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a real-time analytics stack using Kafka, Samza, and Druid. This combination of technologies can power a robust data pipeline that supports real-time ingestion and flexible, low-latency queries.
2:55pm–3:35pm Wednesday, 09/30/2015
Arvind Prabhakar (StreamSets)
Modern data infrastructures operate on vast volumes of continuously produced data generated by independent channels. Enterprises such as consumer banks that have many such channels are starting to implement a single view of customers that can power all customer touchpoints. In this session we present an architectural approach for implementing such a solution using a customer event hub.
1:15pm–1:55pm Thursday, 10/01/2015
Neha Narkhede (Confluent)
Often the hardest step in processing streams is being able to collect all your data in a structured way. We present Copycat, a framework for data ingestion that addresses some common impedance mismatches between data sources and stream processing systems. Copycat uses Kafka as an intermediary, making it easy to get streaming, fault-tolerant data ingestion across a variety of data sources.
4:35pm–5:15pm Wednesday, 09/30/2015
Martin Kleppmann (University of Cambridge)
Slides:   1-PDF 
Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is Step 1 toward becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.
11:20am–12:00pm Thursday, 10/01/2015
By 2020, researchers estimate there will be 100 million internet connected devices. To process this data in real time—whether from mobile phones or jet engines—will be the new normal. How are companies today adapting to this new real time stream of data?
2:05pm–2:45pm Wednesday, 09/30/2015
Eric Frenkiel (MemSQL), Noah Zucker (Novus Partners), Ian Hansen (Digital Ocean), Michael DePrizio (Akamai Technologies)
In-memory is no longer just a trend: it’s an imperative, for high volume, real-time data workloads. With the relational, distributed MemSQL database, modern enterprises are unlocking value from gigabytes and terabytes of data. Learn about some of latest applications and deployments of in-memory technology from Akamai Technologies, Novus, and Digital Ocean.
2:55pm–3:35pm Wednesday, 09/30/2015
Eric Schmidt (Google)
Big data processing is challenged by four conflicting desires: latency, accuracy, simplicity, and cost. Google Cloud Dataflow intelligently merges the desired unified and open sourced programming model, backed by a fully managed cloud service. Dataflow enables developers to answer questions with the right level of latency and accuracy, with low operational overhead regardless of size/complexity.
2:05pm–2:45pm Thursday, 10/01/2015
Ankur Gupta (Bitwise Inc.)
Using an open source technology stack, we implemented a solution for real-time analysis of sensor data from mining equipment. We will share the technical architecture used to show the tools we implemented for real-time complex event processing, why we implemented Spark instead of Storm, some of the challenges faced, benchmarks achieved, and tips for easy integration.
2:05pm–2:45pm Wednesday, 09/30/2015
Haoyuan Li (Alluxio)
Tachyon is a memory-centric fault-tolerant distributed storage system, which enables reliable file sharing at memory-speed. It is open source and is deployed at multiple companies. In addition, Tachyon has more than 80 contributors from over 30 institutions. In this talk, we present Tachyon's architecture, performance evaluation, and several use cases we have seen in the real world.
4:35pm–5:15pm Wednesday, 09/30/2015
Hari Shreedharan (Cloudera), Anand Iyer (Cloudera)
Over the past year, Spark Streaming has emerged as the leading platform to implement IoT and similar real-time use cases. This session includes a brief introduction to Spark Streaming’s micro-batch architecture for real-time stream processing, as well as a live demo of an example use case that includes processing and alerting on-time series data (such as sensor data).
11:20am–12:00pm Thursday, 10/01/2015
Haden Land (Lockheed Martin IS&GS), Jason Loveland (Lockheed Martin)
Slides:   1-PPTX 
Lockheed Martin builds unmanned and manned human space systems, which require systems that are tested for all possible conditions – even for unforeseen situations. We present a test system that is a learning system built on big data technologies, that supports the testing of the Orion Multi-Purpose Crew Vehicle being designed for long-duration, human-rated deep space exploration.
9:00am–12:30pm Tuesday, 09/29/2015
Jesse Anderson (Big Data Institute), Ewen Cheslack-Postava (Confluent)
Slides:   1-PDF 
This is a hands-on workshop where you’ll learn how to leverage the capabilities of Kafka to collect, manage, and process stream data for big data projects and general purpose enterprise data integration needs alike. When your data is captured in real-time and available as real-time subscriptions, you can start to compute new datasets in real-time off these original feeds.
1:30pm–5:00pm Tuesday, 09/29/2015
Patrick McFadin (DataStax)
This tutorial is all about managing large volumes of data coming at your data center fast and continuously. If you don't have a strategy, then allow me to help. Amazing Apache Project software can make this problem a lot easier to deal with. Spend a few hours and learn about how each part works, and how they work together. Your users will thank you.
4:30pm–5:00pm Tuesday, 09/29/2015
Reynold Xin (Databricks)
In this talk, we introduce a recent effort in Spark to employ randomized algorithms for a number of common, expensive methods: membership testing, cardinality, stratified sampling, frequent items, quantile estimation.
2:55pm–3:35pm Wednesday, 09/30/2015
Jim Scott (NVIDIA)
Slides:   1-PDF 
With the move to real-time data analytics and machine learning, streaming applications are becoming more relied upon than ever before. Discover how to build and deploy a globally scalable streaming system. This includes producing messages in one data center and consuming them in another data center, as well as how to make the guarantees that nothing is ever lost.
4:35pm–5:15pm Wednesday, 09/30/2015
Bruce Reading (VoltDB)
Slides:   1-PDF 
You have 10 milliseconds. Less than the blink of an eye, the beat of a heart – that’s how much time you have to ingest fast streams of data, perform analytics on the streams, and take action. Ten milliseconds to win a customer, 10 milliseconds to make a sale, 10 milliseconds to save a life – it’s not much time.
2:05pm–2:45pm Wednesday, 09/30/2015
Karthik Ramasamy (Streamlio)
This talk will present the design and implementation of a new system, called Heron, that is now the de facto stream data processing engine inside Twitter. Share our experiences in running Heron in production.
11:20am–12:00pm Thursday, 10/01/2015
Tathagata Das (Databricks)
As the adoption of Spark Streaming in the industry is increasing, so is the community's demand for more features. Since the beginning of this year, we have made significant improvements in performance, usability, and semantic guarantees. In this talk, I discuss these improvements, as well as the features we plan to add in the near future.
11:20am–12:00pm Wednesday, 09/30/2015
Gwen Shapira (Confluent), Jeff Holoman (Cloudera)
Kafka provides the low latency, high throughput, high availability, and scale that financial services firms require. But can it also provide complete reliability? In this session, we will go over everything that happens to a message - from producer to consumer, and pinpoint all the places where data can be lost - if you are not careful.