Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Data Innovations conference sessions

Tools beyond Hadoop—such as Cassandra, Storm, Accumulo, Kafka, and AWS—and how they fit into the data science toolkit.

Tuesday, September 29

9:00am–12:30pm Tuesday, 09/29/2015
Location: 3D 04/09 Level: Intermediate
Jesse Anderson (Big Data Institute), Ewen Cheslack-Postava (Confluent)
Average rating: ***..
(3.25, 12 ratings)
This is a hands-on workshop where you’ll learn how to leverage the capabilities of Kafka to collect, manage, and process stream data for big data projects and general purpose enterprise data integration needs alike. When your data is captured in real-time and available as real-time subscriptions, you can start to compute new datasets in real-time off these original feeds. Read more.

Wednesday, September 30

11:20am–12:00pm Wednesday, 09/30/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Joe Hellerstein (UC Berkeley)
Average rating: ****.
(4.22, 18 ratings)
As the Hadoop ecosystem grows more complex, there is widespread desire for open metadata solutions: common ground for collaboration across users, and interoperability across software solutions. We motivate a new class of open metadata services for big data, via science and enterprise use cases. We also set out challenges for a new class of "meta-on-use" approaches fit for agile analytics. Read more.
1:15pm–1:55pm Wednesday, 09/30/2015
Location: 1 E18 / 1 E19 Level: Advanced
Roy Ben Alta (Amazon Web Services)
Average rating: ***..
(3.77, 13 ratings)
Amazon Kinesis is a fully managed service for real-time streaming big data ingestion and processing. This talk explores Kinesis concepts in detail, including best practices for scaling your core streaming data ingestion pipeline. We then discuss building and deploying Kinesis processing applications using capabilities like Kinesis Client Libraries, AWS Lambda, and Amazon EMR (via Spark). Read more.
2:05pm–2:45pm Wednesday, 09/30/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Haoyuan Li (Alluxio)
Average rating: ***..
(3.94, 17 ratings)
Tachyon is a memory-centric fault-tolerant distributed storage system, which enables reliable file sharing at memory-speed. It is open source and is deployed at multiple companies. In addition, Tachyon has more than 80 contributors from over 30 institutions. In this talk, we present Tachyon's architecture, performance evaluation, and several use cases we have seen in the real world. Read more.
2:55pm–3:35pm Wednesday, 09/30/2015
Location: 1 E18 / 1 E19 Level: Advanced
Eric Schmidt (Google)
Average rating: ***..
(3.62, 16 ratings)
Big data processing is challenged by four conflicting desires: latency, accuracy, simplicity, and cost. Google Cloud Dataflow intelligently merges the desired unified and open sourced programming model, backed by a fully managed cloud service. Dataflow enables developers to answer questions with the right level of latency and accuracy, with low operational overhead regardless of size/complexity. Read more.
4:35pm–5:15pm Wednesday, 09/30/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Martin Kleppmann (University of Cambridge)
Average rating: ****.
(4.14, 14 ratings)
Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is Step 1 toward becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems. Read more.
5:25pm–6:05pm Wednesday, 09/30/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Yonik Seeley (Cloudera)
Average rating: ****.
(4.27, 11 ratings)
This talk will cover how search and Solr have become a critical part of the Hadoop stack, and have also emerged as one of the highest performing solutions for analytics over big data. We'll also cover new analytics capabilities in Solr that marry full-text search, faceted search, statistics, and grouping, joining into a powerful engine for powering next-generation big data analytics applications. Read more.

Thursday, October 1

11:20am–12:00pm Thursday, 10/01/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Tags: media, featured
Kurt Brown (Netflix)
Average rating: ****.
(4.87, 31 ratings)
The Netflix Data Platform is a constantly evolving, large scale infrastructure running in the (AWS) cloud. We are especially focused on performance and ease of use, with initiatives including Presto integration, Spark, and our big data portal and API. This talk will dive into the various technologies we use, the motivations behind our approach, and the business benefits we get. Read more.
1:15pm–1:55pm Thursday, 10/01/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Neha Narkhede (Confluent)
Average rating: ****.
(4.50, 14 ratings)
Often the hardest step in processing streams is being able to collect all your data in a structured way. We present Copycat, a framework for data ingestion that addresses some common impedance mismatches between data sources and stream processing systems. Copycat uses Kafka as an intermediary, making it easy to get streaming, fault-tolerant data ingestion across a variety of data sources. Read more.
2:05pm–2:45pm Thursday, 10/01/2015
Location: 1 E18 / 1 E19 Level: Non-technical
Carlos Guestrin (Apple | University of Washington )
Average rating: ***..
(3.78, 9 ratings)
As companies increase the number of deployments of machine learning-based applications, the number of models that need to be monitored grow at a tremendous pace. In this talk, we outline some of the key challenges in large-scale deployments of machine learning models, then describe a methodology to manage such models in production to mitigate the technical debt. Read more.
2:55pm–3:35pm Thursday, 10/01/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Tags: geospatial
Ryan Smith (DigitalGlobe)
Average rating: ***..
(3.75, 4 ratings)
MrGeo is a geospatial toolkit designed to provide raster-based geospatial capabilities that can be performed at scale by leveraging the Hadoop ecosystem. This session will provide an overview of the MrGeo design for storing and processing large-scale raster datasets in the cloud, highlight core operations, and present performance benchmarks for some example operations on open data sets. Read more.
4:35pm–5:15pm Thursday, 10/01/2015
Location: 1 E18 / 1 E19 Level: Intermediate
Venky Ganti (Alation)
Average rating: ***..
(3.00, 1 rating)
Recommendation engines are cognitive computing applications. Their algorithms “learn” from experience. What if a recommendation engine could help analysts sort through big data? Building a query recommendation engine is complex. We’ll share some of the technical challenges and learnings from building a cognitive application in daily use today, by analyst teams from eBay to Square. Read more.