Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Real-time conference sessions

11:00am–11:40am Wednesday, 03/30/2016
Chris Sanden (Netflix), Christopher Colburn (Netflix)
Chris Sanden and Christopher Colburn outline a shared infrastructure for doing anomaly detection. Chris and Christopher explain how their solution addresses both real-time and batch use cases and offer a framework for performance evaluation.
1:30pm–5:00pm Tuesday, 03/29/2016
Patrick McFadin (DataStax)
Patrick McFadin gives a comprehensive overview of the powerful Team Apache: Apache Kafka, Spark, and Cassandra. Patrick demonstrates data models, covers deployment considerations, and explains code for different requirements.
11:00am–11:40am Wednesday, 03/30/2016
Eric Tschetter (Yahoo)
Yahoo uses Druid to provide visibility into the actions of its billions of users and developed a new type of sketch called a Theta Sketch to enable this analysis. Eric Tschetter discusses how Yahoo leverages Druid and Theta Sketches together to enable user-level understanding of their billions of users.
2:40pm–3:20pm Thursday, 03/31/2016
Kostas Tzoumas (data Artisans)
Apache Flink is a full-featured streaming framework with high throughput, millisecond latency, strong consistency, support for out-of-order streams, and support for batch as a special case of streaming. Kostas Tzoumas gives an overview of Flink and its streaming-first philosophy, as well as the project roadmap and vision: fully unifying the worlds of “batch” and “streaming” analytics.
11:00am–11:40am Thursday, 03/31/2016
Michael Armbrust (Databricks)
Michael Armbrust explores real-time analytics with Spark from interactive queries to streaming.
2:40pm–3:20pm Thursday, 03/31/2016
Fangjin Yang (Imply)
Running distributed systems in production can be tremendously challenging. Fangjin Yang covers common problems and failures with distributed systems and discusses design patterns that can be used to maintain data integrity and availability when everything goes wrong. Fangjin uses Druid as a real-world case study of how these patterns are implemented in an open source technology.
5:10pm–5:50pm Wednesday, 03/30/2016
Jean-Marc Spaggiari (Cloudera), Kevin O'Dell (Rocana)
Most already know HBase, but many don't know that it can be coupled with other tools from the ecosystem to increase efficiency. Jean-Marc Spaggiari and Kevin O'Dell walk attendees through some real-life HBase use cases and demonstrate how they have been efficiently implemented.
2:40pm–3:20pm Thursday, 03/31/2016
Joseph Adler (Confluent), Ewen Cheslack-Postava (Confluent), Jun Rao (Confluent), Jesse Anderson (Big Data Institute), Neha Narkhede (Confluent)
Joseph Adler, Ewen Cheslack-Postava, Jun Rao, Jesse Anderson, and Neha Narkhede, the instructors of the Apache Kafka tutorials, field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
1:50pm–2:30pm Thursday, 03/31/2016
Reynold Xin (Databricks), Tathagata Das (Databricks), Michael Armbrust (Databricks)
Join the Spark team for an informal Q&A session. Apache Spark architects Reynold Xin, Tathagata Das, and Michael Armbrust will be on hand to field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
11:50am–12:30pm Wednesday, 03/30/2016
Leo Meyerovich (Graphistry), Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
“Assuming breach” led to centralizing all logs (SIEMs), but incident response and forensics are still behind on the analytics side. Leo Meyerovich, Mike Wendt, and Joshua Patterson share how Graphistry and Accenture Technology Labs are rethinking data engineering and data analysis and modernizing end-to-end architectures.
1:50pm–2:30pm Thursday, 03/31/2016
Matt Olson (CenturyLink)
Software-defined networking (SDN) and network functions virtualization (NFV) hold tremendous potential to enable efficiency and flexibility in service delivery, but SDN/NFV environments are also highly complex and multilayered. Matt Olson explains why effective support for SDN/NFV services requires leveraging the tremendous amount of service and data streaming from the platform.
5:10pm–5:50pm Wednesday, 03/30/2016
Siva Raghupathy (Amazon Web Services), Manjeet Chayel (Amazon Web Services)
Analyzing real-time streams of data is becoming increasingly important to remain competitive. Siva Raghupathy and Manjeet Chayel guide attendees through some of the proven architectures for processing streaming data using a combination of cloud and open source tools such as Apache Spark. Watch a live demo and learn how you can easily scale your applications with Amazon Web Services.
1:30pm–5:00pm Tuesday, 03/29/2016
Joseph Adler (Confluent), Ewen Cheslack-Postava (Confluent), Ian Wrigley (StreamSets)
Joseph Adler, Ewen Cheslack, and Ian Wrigley demonstrate the features of Apache Kafka that make it easy to build fast, secure, and reliable data pipelines and explain how to use Copycat, Kafka Streams, and Kafka Security as they coach you through building a working enterprise data pipeline.
2:40pm–3:20pm Thursday, 03/31/2016
Sijie Guo (Twitter)
DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter.
9:00am–5:00pm Monday, 03/28/2016
Tim Berglund (Confluent), Tanya Gallagher (DataStax)
O’Reilly Media and DataStax have partnered to create a 2-day developer certification course for Apache Cassandra. Get certified as a Cassandra developer at Strata + Hadoop World in San Jose and be recognized for your NoSQL expertise.
11:00am–11:40am Wednesday, 03/30/2016
Eric Frenkiel (MemSQL), JR Cahill (Kellogg)
To win in the on-demand economy, businesses must embrace real-time analytics. Eric Frenkiel demos an enterprise approach to data solutions for predictive analytics. Eric is joined by JR Cahill, who outlines Kellogg's approach to advanced analytics with MemSQL, including moving from overnight to intraday analytics and integrating directly with business intelligence tools like Tableau.
4:20pm–5:00pm Wednesday, 03/30/2016
Alex Silva (Pluralsight)
Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis.
11:00am–11:40am Wednesday, 03/30/2016
Jay Kreps (Confluent)
The world is moving to real-time data, and much of that data flows through Apache Kafka. Jay Kreps explores how Kafka forms the basis for our modern stream-processing architecture. He covers some of the pros and cons of different frameworks and approaches and discusses the recent APIs Kafka has added to allow direct stream processing of Kafka data.
9:10am–9:15am Wednesday, 03/30/2016
Eric Frenkiel (MemSQL)
The next evolution in the on-demand economy is in predictive analytics fueled by live streams of data—in effect knowing what customers want before they do. Eric Frenkiel explains how a real-time trinity of technologies—Kafka, Spark, and MemSQL—is enabling Uber and others to power their own revolutions with predictive apps and analytics.
11:00am–11:40am Thursday, 03/31/2016
Costin Leau (Elastic)
Costin Leau offers an overview of Elastic’s current efforts to enhance Elasticsearch's existing integration with Spark, going beyond Spark core and Spark SQL by focusing on text processing and machine learning to allow data processing and tokenizing to be combined with Spark's MLlib algorithms.
11:50am–12:30pm Thursday, 03/31/2016
Joey Echeverria (Rocana)
Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.
11:50am–12:30pm Wednesday, 03/30/2016
Bin Fan (Alluxio), Haojun Wang (Baidu)
Baidu runs Alluxio in production with hundreds of nodes managing petabytes of data. Bin Fan and Haojun Wang demonstrate how Alluxio improves big data analytics (ad hoc query)—Baidu experienced a 30x performance improvement—and explain how Baidu leverages Alluxio in its machine-learning architecture and how it uses Alluxio to manage heterogeneous storage resources.
11:00am–11:40am Thursday, 03/31/2016
Ted Malaska (Blizzard Entertainment), Jeff Holoman (Cloudera)
Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.
1:50pm–2:30pm Wednesday, 03/30/2016
Todd Lipcon (Cloudera)
Todd Lipcon explores the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage-engine internals. Todd also outlines Kudu, the new addition to the open source Hadoop ecosystem that complements HDFS and HBase to provide a new option for achieving fast scans and fast random access from a single API.
11:00am–11:40am Thursday, 03/31/2016
Steve Wooledge (MapR Technologies)
In order to remain competitive, you need to be able to respond to changing conditions in the moment. New stream-based technologies allow you to build applications that incorporate low-latency processing so you can stream data immediately or whenever you’re ready. Steve Wooledge explores how new streaming technologies make this approach work and how they can be applied in many industries.
11:00am–11:40am Wednesday, 03/30/2016
Yvonne Quacken (Siemens), Allen Hoem (Teradata)
Yvonne Quacken and Allen Hoem explore the business and technical challenges that Siemens faced capturing continuous data from millions of sensors across different areas and explain how Teradata Listener helped Siemens simplify this data-capture process with a single, central service to ingest multiple real-time data streams simultaneously in a reliable fashion.
9:00am–12:30pm Tuesday, 03/29/2016
Jesse Anderson (Big Data Institute), Ewen Cheslack-Postava (Confluent), Joseph Adler (Confluent), Ian Wrigley (StreamSets)
Ewen Cheslack-Postava, Joseph Adler, Jesse Anderson, and Ian Wrigley show how to use Apache Kafka to collect, manage, and process stream data for big data projects and general purpose enterprise data-integration needs alike. Once your data is captured in real time and available as real-time subscriptions, you can start to compute new datasets in real-time from these original feeds.
11:50am–12:30pm Thursday, 03/31/2016
Sumeet Singh (Yahoo), Mridul Jain (Yahoo)
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
9:25am–9:35am Wednesday, 03/30/2016
Jack Norris (MapR Technologies)
Big data is not limited to reporting and analysis; increasingly, companies are differentiating themselves by acting on data in real time. But what does "real time" really mean? Jack Norris discusses the challenges of coordinating data flows, analysis, and integration at scale to truly impact business as it happens.
4:20pm–5:00pm Wednesday, 03/30/2016
Yinglian Xie (DataVisor)
Yinglian Xie describes the anatomy of modern online services, where large armies of malicious accounts hide among legitimate users and conduct a variety of attacks. Yinglian demonstrates how the Spark framework can facilitate early detection of these types of attacks by analyzing billions of user actions.
11:50am–12:30pm Wednesday, 03/30/2016
Helena Edelson (Apple), Evan Chan (Tuplejump)
Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.
11:00am–11:40am Thursday, 03/31/2016
Dean Wampler (Lightbend)
If you’re using (or considering) Scala and JVM as a big data platform, Dean can answer all your questions about Spark, Mesos, and fast data.
11:50am–12:30pm Wednesday, 03/30/2016
Jay Kreps (Confluent)
Working with distributed streaming data architectures? Jay is available to answer all your questions about Apache Kafka, stream processing and streaming data architectures, and Confluent and the Confluent platform.
2:40pm–3:20pm Wednesday, 03/30/2016
Jesse Anderson (Big Data Institute)
If you are a manager or CxO about to launch a big data project, come see Jesse. He’ll offer advice on how to create successful data projects, the patterns of successful data engineering teams and projects, and how to avoid the most common—and costly—pitfalls of a large Hadoop deployment.
11:50am–12:30pm Wednesday, 03/30/2016
Patrick McFadin (DataStax)
Have some questions about using Apache Cassandra on your next project? Patrick will be around to talk about the following: data modeling for specific use cases, using Apache Spark with Cassandra data, and deployment and operation topics.
4:20pm–5:00pm Thursday, 03/31/2016
Tony Ng (eBay, Inc.)
Enterprises are increasingly demanding real-time analytics and insights. Tony Ng offers an overview of Pulsar, an open source real-time streaming system used at eBay, which can scale to millions of events per second with 4GL SQL-like language support. Tony explains how Pulsar integrates Kafka, Kylin, and Druid to provide flexibility and scalability in event and metrics consumption.
5:10pm–5:50pm Wednesday, 03/30/2016
Todd Palino (LinkedIn), Gwen Shapira (Confluent)
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira explore how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production.
11:50am–12:30pm Wednesday, 03/30/2016
If you consider user click paths a process, you can apply process mining. Process mining models users based on their actual behavior, which allows us to compare new clicks with modeled behavior and report any inconsistencies. Bolke de Bruin and Hylke Hendriksen explain how ING implemented process mining on Spark Streaming, enabling real-time fraud detection.
11:50am–12:30pm Wednesday, 03/30/2016
Ted Dunning (MapR Technologies)
Application messaging isn’t new—solutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer.
4:20pm–5:00pm Wednesday, 03/30/2016
Alex Ingerman (Amazon Web Services)
Alex Ingerman explains how several AWS services, including Amazon Machine Learning, Amazon Kinesis, AWS Lambda, and Amazon Mechanical Turk, can be tied together to build a predictive application to power a real-time customer-service use case.
4:20pm–5:00pm Thursday, 03/31/2016
Jim Scott (MapR Technologies)
The Zeta Architecture is an enterprise architecture to move beyond the data lake. The most logical way to scale applications across tiers is to put a messaging platform in between the tiers, which allows a far simpler ability to scale the communications of applications. Jim Scott covers the benefits of this model and offers an example of data-center monitoring.
11:50am–12:30pm Wednesday, 03/30/2016
Jun Rao (Confluent)
With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and ongoing security work.
1:50pm–2:30pm Thursday, 03/31/2016
Timothy Potter (Lucidworks )
Solr has been adopted by all major Hadoop platform vendors as the de facto standard for big data search. Timothy Potter introduces an open source project that exposes Solr as a SparkSQL datasource. Timothy offers common use cases, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution.
2:40pm–3:20pm Thursday, 03/31/2016
Yann Landrin (Autodesk), Charlie Crocker (Autodesk)
Autodesk's next-gen analytics pipeline, based on SDKs, Kafka, Spark, and containers, will solve the problems of platform and product fragmentation, instrumentation quality, and ease of access to analytics. Yann Landrin and Charlie Crocker explore the features that will enable teams to build reliable, high-quality usage analytics for Autodesk's products, autonomously and in mere minutes.
5:10pm–5:50pm Wednesday, 03/30/2016
Ted Dunning (MapR Technologies)
Until recently, batch processing has been the standard model for big data. Today, many have shifted to streaming architectures that offer large benefits in simplicity and robustness, but this isn't your father’s complex event processing. Ted Dunning explores the key design techniques used in modern systems, including percolators, replayable queues, state-point queuing, and microarchitectures.
11:50am–12:30pm Thursday, 03/31/2016
Tathagata Das (Databricks)
Tathagata Das introduces Streaming DataFrames, the next evolution of Spark Streaming. Streaming DataFrames unifies an additional dimension: interactive analysis. In addition, it provides enhanced support for out-of-order (delayed) data, zero-latency decision making and integration with existing enterprise data warehouses.
1:50pm–2:30pm Thursday, 03/31/2016
Ilya Ganelin (Capital One Data Innovation Lab)
What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital One’s novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop.
1:50pm–2:30pm Wednesday, 03/30/2016
John Hugg (VoltDB)
In the race to pair streaming systems with stateful systems, the winners will be stateful systems that process streams natively. These systems remove the burden on application developers to be distributed systems experts and enable new applications to be both powerful and robust. John Hugg describes what’s possible when integrated systems apply a transactional approach to event processing.
2:40pm–3:20pm Thursday, 03/31/2016
Join the SAP team for a demonstration of how OLAP on Hadoop and real-time query federation help unify enterprise and big data, using SAP's new big data solution, SAP HANA Vora. Amit Satoor and Balalji Krishna explore real-world use cases where instant insights from a combination of operational and Hadoop data impact core business operations
1:50pm–2:30pm Thursday, 03/31/2016
Karthik Ramasamy (Twitter)
Heron, Twitter's streaming system, has been in production nearly two years and is widely used by several teams for diverse use cases. Karthik Ramasamy discusses Twitter's operating experiences and shares the challenges of running Heron at scale as well as the approaches that Twitter took to solve them.
11:50am–12:30pm Wednesday, 03/30/2016
Vinoth Chandar explains how Uber revamped its foundational data infrastructure with Hadoop as the source-of-truth data lake, sharing lessons from the experience.
2:40pm–3:20pm Wednesday, 03/30/2016
Calvin Jia (Alluxio), Jiri Simsa (Alluxio)
Not all storage resources are equal. Alluxio has developed Alluxio tiered storage to achieve highly efficient utilization of memory, SSDs, and HDDs that is completely transparent to computation frameworks and user applications. Calvin Jia and Jiri Simsa outline the features and use cases of Alluxio tiered storage.
11:50am–12:30pm Thursday, 03/31/2016
Guozhang Wang (Confluent)
You may have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center. But what if one data center is not enough? Guozhang Wang offers an overview of best practices for multi-data-center deployments, architecture guidelines for data replication, and disaster scenarios.