Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Data Innovations conference sessions

Wednesday, March 30

Add to your personal schedule
11:00am–11:40am Wednesday, 03/30/2016
Location: 210 D/H
Eric Tschetter (Yahoo)
Average rating: ****.
(4.29, 7 ratings)
Yahoo uses Druid to provide visibility into the actions of its billions of users and developed a new type of sketch called a Theta Sketch to enable this analysis. Eric Tschetter discusses how Yahoo leverages Druid and Theta Sketches together to enable user-level understanding of their billions of users. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/30/2016
Location: 210 D/H
Tags: real-time
Helena Edelson (Apple), Evan Chan (Tuplejump)
Average rating: ***..
(3.85, 13 ratings)
Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 210 D/H
Joe Hellerstein (UC Berkeley), Vikram Sreekanti (Berkeley AMP Lab)
Average rating: ****.
(4.00, 7 ratings)
Metadata services are a critical missing piece of the current open source ecosystem for big data. Joe Hellerstein and Vikram Sreekanti give an overview of their vendor-neutral metadata services layer, Ground, through two reference use cases at UC Berkeley: genomics research driven by Spark and courseware using Jupyter Notebooks. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 210 D/H
Tags: real-time
Calvin Jia (Alluxio), Jiri Simsa (Alluxio)
Not all storage resources are equal. Alluxio has developed Alluxio tiered storage to achieve highly efficient utilization of memory, SSDs, and HDDs that is completely transparent to computation frameworks and user applications. Calvin Jia and Jiri Simsa outline the features and use cases of Alluxio tiered storage. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 D/H
Moderated by:
Derrick Harris (Mesosphere)
Panelists:
Rob Peglar (Micron Technology, Inc), Milind Bhandarkar (Ampool, Inc.), Richard Probst (SAP), Todd Lipcon (Cloudera)
Average rating: ****.
(4.00, 5 ratings)
Years of research in nonvolatile memory systems is being productized and has started coming to market. These exciting new technologies promise lower power consumption and higher density for persistent storage. Will these hardware advances revolutionize the data ecosystem as we know it? This compelling panel of data-infrastructure thought leaders discusses the possibilities. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 230 C
Ted Dunning (MapR Technologies)
Average rating: ****.
(4.12, 8 ratings)
SQL is normally a very static language that assumes a fixed and well-known schema. Apache Drill breaks these assumptions by restructuring the execution of queries so optimizations and type resolution can be done just in time. This has profound consequences for how applicable SQL is in the big data world. Ted Dunning walks attendees through Drill and explores its implications for big data. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 C/G
Tags: real-time
Todd Palino (LinkedIn), Gwen Shapira (Confluent)
Average rating: ****.
(4.62, 13 ratings)
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira explore how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 D/H
Tags: real-time
Ted Dunning (MapR Technologies)
Average rating: ****.
(4.11, 9 ratings)
Until recently, batch processing has been the standard model for big data. Today, many have shifted to streaming architectures that offer large benefits in simplicity and robustness, but this isn't your father’s complex event processing. Ted Dunning explores the key design techniques used in modern systems, including percolators, replayable queues, state-point queuing, and microarchitectures. Read more.

Thursday, March 31

Add to your personal schedule
11:00am–11:40am Thursday, 03/31/2016
Location: LL21 E/F
Tags: media
Daniel Weeks (Netflix)
Average rating: ****.
(4.56, 27 ratings)
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Daniel Weeks explains how Netflix has enhanced its 25+ petabyte warehouse by combining Parquet's features with Presto and Spark to boost both ETL and interactive queries. Daniel explores how these approaches offer new ways to look at the relationship between storage and compute. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/31/2016
Location: 210 D/H
Tags: real-time
Costin Leau (Elastic)
Average rating: ****.
(4.15, 13 ratings)
Costin Leau offers an overview of Elastic’s current efforts to enhance Elasticsearch's existing integration with Spark, going beyond Spark core and Spark SQL by focusing on text processing and machine learning to allow data processing and tokenizing to be combined with Spark's MLlib algorithms. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/31/2016
Location: 230 C
Abin Shahab (Altiscale)
Abin Shahab walks attendees through Altiscale's Docker deployment strategy, describes the design decisions behind it, and discusses the issues encountered and fixed along the way. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/31/2016
Location: LL21 E/F
Tags: education
Roshan Sumbaly (Coursera Inc), Pierre Barthelemy (Coursera)
Average rating: ***..
(3.90, 10 ratings)
Coursera's platform allows 15 million learners to take courses from the best universities. Roshan Sumbaly and Thomas Barthelemy outline the pieces of Coursera's data infrastructure (streaming, data warehouse) that support its growing semi- and unstructured data requirements and explain how this ecosystem allows Coursera to build various instructor- and learner-side data products. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/31/2016
Location: 210 C/G
Tags: real-time
Guozhang Wang (Confluent)
Average rating: ***..
(3.80, 5 ratings)
You may have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center. But what if one data center is not enough? Guozhang Wang offers an overview of best practices for multi-data-center deployments, architecture guidelines for data replication, and disaster scenarios. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/31/2016
Location: 230 C
Sumeet Singh (Yahoo), Mridul Jain (Yahoo)
Average rating: ***..
(3.80, 5 ratings)
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL21 E/F
Joe Hellerstein (UC Berkeley), Seshadri Mahalingam (Trifacta)
Average rating: ****.
(4.00, 3 ratings)
Seshadri Mahalingam and Joe Hellerstein discuss Photon, a high-performance data-transformation engine that provides immediacy to the data-wrangling experience, and demonstrate how to make the most of modern processors from both the browser and the desktop, with a focus on issues specific to the variety of big raw data, including heavy string manipulation and statistical data profiling. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/31/2016
Location: 210 D/H
Ilya Ganelin (Capital One Data Innovation Lab)
Average rating: ***..
(3.33, 6 ratings)
What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital One’s novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL21 E/F
Tags: real-time
Fangjin Yang (Imply)
Average rating: ***..
(3.25, 4 ratings)
Running distributed systems in production can be tremendously challenging. Fangjin Yang covers common problems and failures with distributed systems and discusses design patterns that can be used to maintain data integrity and availability when everything goes wrong. Fangjin uses Druid as a real-world case study of how these patterns are implemented in an open source technology. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/31/2016
Location: 210 C/G
Sijie Guo (Twitter)
Average rating: ***..
(3.50, 2 ratings)
DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/31/2016
Location: 230 A
Spencer Kimball (Cockroach Labs)
Average rating: ***..
(3.50, 2 ratings)
Often without realizing it, companies spend significant resources engineering new databases. The need to combine traditional relational datasets with new operational and historical data leads to sharded RDBMS or hybridized RDBMS and NoSQL systems, typically leaving few of the constituent database guarantees intact. Spencer Kimball introduces CockroachDB, an open source, scale-out SQL database. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL21 E/F
Tags: featured
Joseph Turian (Workday), Alex Nisnevich (Bayes Impact)
Average rating: ****.
(4.78, 9 ratings)
Next-gen UIs will allow people to use plain English to interact with software. However, current published research focuses on abstract understanding, not on translating English into concrete software actions. Joseph Turian and Alex Nisnevich outline UPSHOT's English-to-SQL semantic parser and demonstrate how to build your own English-to-“your software application” parser. Read more.