Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Schedule: Hadoop & Beyond sessions

Tools beyond Hadoop—such as Cassandra, Storm, Accumulo, Kafka and Spark—and how they fit in the data science toolkit.

Track Hosts

Paco Nathan (Databricks)

Michael Armbrust (Databricks)

Wednesday, October 15

Add to your personal schedule
9:00am–5:00pm Wednesday, 10/15/2014
SOLD OUT
Location: Hall A 23/24
Paco Nathan (O'Reilly Media), Michael Armbrust (Databricks), Tathagata Das (Databricks), Matei Zaharia (Databricks), Reynold Xin (Databricks), Ameet Talwalkar (Determined AI), Holden Karau (IBM), Joseph Bradley (Databricks), Sameer Farooqui (Databricks), Patrick Wendell (Databricks)
Average rating: ***..
(3.75, 20 ratings)
Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more. Read more.
Add to your personal schedule
9:00am–12:30pm Wednesday, 10/15/2014
SOLD OUT
Location: 1 E05
Patrick McFadin (Datastax), Helena Edelson (Apple)
Average rating: **...
(2.80, 5 ratings)
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. Add in Apache Spark and Kafka, you have an amazing time series solution. We will talk data models, go through deployment and code to build a functional, real-time application. Languages used: Java, Scala Read more.

Thursday, October 16

Add to your personal schedule
11:00am–11:40am Thursday, 10/16/2014
Location: 1 E20/1 E21
Michael Stonebraker (Tamr, Inc.)
Average rating: ***..
(3.67, 12 ratings)
The explosion of internal data sources, external public data sources and feeds from the Internet of Things is causing a tsunami of diverse data sources for enterprises. Top-down data-integration tools and data scientist tools won’t scale to meet the demands of the modern enterprise. Learn how a scalable data curation platform can help enterprises connect and enrich their data to leverage it all. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 10/16/2014
Location: 1 E20/1 E21
Joe Hellerstein (UC Berkeley), Sean Kandel (Trifacta)
Average rating: ***..
(3.83, 12 ratings)
Data transformation — traditionally the domain of IT specialists — is emerging as a critical, widespread problem in data analytics. In this session we discuss the advantages of using a domain-specific language for data transformation tasks. We illustrate these issues with Wrangle, a DSL designed for interactive data transformation. Read more.
Add to your personal schedule
1:45pm–2:25pm Thursday, 10/16/2014
Location: 1 E20/1 E21
Fangjin Yang (Imply), Xavier Léauté (Confluent)
Average rating: ***..
(3.17, 6 ratings)
Organizations often showcase the virtues of their data platforms, but rarely share the challenges and decisions faced along the way. Our session describes how we architected our analytics stack around Druid, an open source distributed data store, and how we overcame the challenges around scaling the system, balancing features with cost, and making performance consistent. Read more.
Add to your personal schedule
2:35pm–3:15pm Thursday, 10/16/2014
Location: 1 E20/1 E21
Lior Abraham (Interana Inc)
Average rating: **...
(2.33, 6 ratings)
Leveraging our experience from working on some of the largest-scale high-growth applications at Facebook and other companies, including building the most popular data analysis tool Scuba, this talk outlines 10 lessons learned, along with best practices towards extracting the most value out of data, while avoiding common pitfalls. Read more.
Add to your personal schedule
4:15pm–4:55pm Thursday, 10/16/2014
Location: 1 E20/1 E21
Haoyuan Li (Alluxio)
Average rating: ****.
(4.36, 11 ratings)
An introduction to Tachyon, a memory centric storage system started from UC Berkeley. It enables different frameworks to share data at memory-speed. It is also a major component of Berkeley Data Analytics Stack (BDAS). The project is open source and is deployed at multiple companies. It has more than 30 contributors from over 10 institutions, including Yahoo, Intel, Redhat, Alibaba etc. Read more.
Add to your personal schedule
5:05pm–5:45pm Thursday, 10/16/2014
Location: 1 E20/1 E21
Sean Owen (Cloudera)
Average rating: ****.
(4.73, 11 ratings)
Apache Spark is a popular new paradigm for computation on Hadoop. It's particularly effective for iterative algorithms relevant to data science like clustering, which can be used to detect anomalies in data. Curious? Get a taste of Spark MLlib, Scala and k-means clustering in this walkthrough of anomaly detection as applied to network intrusion, using the KDD Cup '99 data set. Read more.
Add to your personal schedule
5:05pm–5:45pm Thursday, 10/16/2014
Location: 1 E6/1 E7
Additional, informal work session with the Spark Team. Read more.

Friday, October 17

Add to your personal schedule
11:00am–11:40am Friday, 10/17/2014
Location: 1 E20/1 E21
Philip (Flip) Kromer (CSC), Q McCallum (@qethanm)
Average rating: ***..
(3.54, 13 ratings)
What is the lambda architecture, and how do you put it to use for your streaming data? Flip Kromer and Q Ethan McCallum will explain how this works, using a live-updating recommendation engine as the supporting example. Read more.
Add to your personal schedule
11:50am–12:30pm Friday, 10/17/2014
Location: 1 E20/1 E21
Michael Armbrust (Databricks)
Average rating: ****.
(4.53, 15 ratings)
In this talk Michael will describe Spark SQL, the newest component of the Apache Spark stack. A key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query structured data with complex analytics in imperative or functional languages. Read more.
Add to your personal schedule
1:45pm–2:25pm Friday, 10/17/2014
Location: 1 E20/1 E21
Hossein Falaki (Databricks Inc.)
Average rating: ***..
(3.92, 24 ratings)
We will demonstrate how to combine visual tools with Spark to apply three specific techniques to visually explore big data using a) summarize and visualize, b) sample and visualize, and c) model and visualize. We will use a real big dataset, such as Wikipedia traffic logs, to demonstrate these techniques in a live demo. Read more.
Add to your personal schedule
2:35pm–3:15pm Friday, 10/17/2014
Location: 1 E20/1 E21
Anil Madan (PayPal)
Average rating: ***..
(3.71, 14 ratings)
Open Source Real Time BI using Storm, Hadoop, Titan, Druid & D3 Read more.
Add to your personal schedule
5:05pm–5:45pm Friday, 10/17/2014
Location: 1 E20/1 E21
David Jonker (Uncharted Software Inc.), Rob Harper (Uncharted)
Average rating: *****
(5.00, 4 ratings)
The widespread adoption of web-based maps provides a familiar set of interactions for exploring large data spaces. Building on these techniques, Tile-based visual analytics provides interactive visualization of billions of points of data or more. This session provides an overview of technical challenges and promise using applications created with the open source Aperture Tiles framework on GitHub. Read more.