Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY
 

Strata + Hadoop World 2014 Schedule

Use the calendar icon [calendar icon] next to each listing you want to attend. Then use the personal schedule button below to generate your schedule.

Schedule Views

List Grid
1D
Add Thursday Keynote Welcome to your personal schedule
8:45am Plenary
Room: 1D
Thursday Keynote Welcome Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Add Open Standards and the Modern Data Center to your personal schedule
8:55am Plenary
Room: 1D
Open Standards and the Modern Data Center Mike Olson (Cloudera)
Add What Would Google Do? Understanding the Future of Big Data to your personal schedule
9:10am Plenary
Room: 1D
What Would Google Do? Understanding the Future of Big Data M. C. Srivas (Uber)
Add Keynote with Miriah Meyer to your personal schedule
9:20am Plenary
Room: 1D
Keynote with Miriah Meyer Miriah Meyer (University of Utah)
Add Accelerating Parkinson’s Research with Big Data Technologies to your personal schedule
9:30am Plenary
Room: 1D
Accelerating Parkinson’s Research with Big Data Technologies Ron Kasabian (Intel)
Add Data & The New Era of Interactive Storytelling to your personal schedule
9:35am Plenary
Room: 1D
Data & The New Era of Interactive Storytelling Sharmila Mulligan (ClearStory Data)
Add Distributions More Interesting than Averages: Some Thoughts on Data Visualization to your personal schedule
9:40am Plenary
Room: 1D
Distributions More Interesting than Averages: Some Thoughts on Data Visualization Amanda Cox (The New York Times )
Add Spark Needs a Business Analyst Workflow to your personal schedule
9:50am Plenary
Room: 1D
Spark Needs a Business Analyst Workflow Ben Werther (Platfora)
Add Statistics Without the Agonizing Pain to your personal schedule
9:55am Plenary
Room: 1D
Statistics Without the Agonizing Pain John Rauser (Snapchat)
Add Crowdsourcing Humor: The New Yorker Caption Contest to your personal schedule
10:05am Plenary
Room: 1D
Crowdsourcing Humor: The New Yorker Caption Contest Bob Mankoff (The New Yorker Magazine)
Add All the Data and Still Not Enough! to your personal schedule
11:00am All the Data and Still Not Enough! Claudia Perlich (Dstillery)
Add The Great Debate: If You Can't Code, You Can't Be a Data Scientist to your personal schedule
11:50am The Great Debate: If You Can't Code, You Can't Be a Data Scientist Joseph Adler (Confluent), Hilary Mason (Fast Forward Labs), Scott Nicholson (Poynt), Lucian Lita (Intuit), Roger Magoulas (O'Reilly Media)
Add The Day Zach Galifianakis Saved Healthcare to your personal schedule
1:45pm The Day Zach Galifianakis Saved Healthcare Chris Harland (Microsoft)
Add Multi-language Data Science with IPython, IJulia, IR, and Friends to your personal schedule
4:15pm Multi-language Data Science with IPython, IJulia, IR, and Friends Brian Granger (Cal Poly San Luis Obispo), Fernando Perez (UC Berkeley and Lawrence Berkeley National Laboratory)
1 E8/1 E9
Add Designing with Data to your personal schedule
11:00am Designing with Data Jeffrey Heer (Trifacta | University of Washington)
Add Crowdsourcing Humor: The New Yorker Caption Contest to your personal schedule
11:50am Crowdsourcing Humor: The New Yorker Caption Contest Bob Mankoff (The New Yorker Magazine)
Add Data Science Bootcamp to your personal schedule
1:45pm Data Science Bootcamp Laurie Skelly (Datascope Analytics)
Add In the Data Lake to your personal schedule
2:35pm In the Data Lake Barry Devlin (9sight Consulting)
Add Unseating the Giants to your personal schedule
4:15pm Unseating the Giants Monte Zweben (Splice Machine Inc.)
Add What’s Holding Up Your Hadoop? to your personal schedule
5:05pm What’s Holding Up Your Hadoop? Eddie Garcia (Cloudera)
1 C03/1 C04
Add Customer Intelligence: Harnessing Elephants at Transamerica    to your personal schedule
11:50am Customer Intelligence: Harnessing Elephants at Transamerica Stephen Lloyd (Transamerica), Vishal Bamba (Transamerica), David Beaudoin (Transamerica)
Add Big Data: A Journey of Innovation to your personal schedule
1:45pm Big Data: A Journey of Innovation Sastry Durvasula (American Express), Kevin Murray (American Express)
Add Transitioning from Original Big Data to the New Big Data: L.L.Bean’s Journey to your personal schedule
2:35pm Transitioning from Original Big Data to the New Big Data: L.L.Bean’s Journey Chris Wilson (L.L.Bean), Doug Bryan (RichRelevance)
Add Unlocking Big Data at CERN to your personal schedule
4:15pm Unlocking Big Data at CERN Matthias Braeger (CERN), Manish Devgan (Software AG)
Add Big Data Modeling: How FICO is Turning DBAs and into Data Engineers to your personal schedule
5:05pm Big Data Modeling: How FICO is Turning DBAs and into Data Engineers Lelanie Moll (FICO), Deb Brooks (FICO), Silaphet Mounkhaty (FICO)
Hall A 23/24
Add From Raw Data to Analytics with No ETL to your personal schedule
11:00am From Raw Data to Analytics with No ETL Marcel Kornacker (Cloudera), Lenni Kuff (Cloudera)
Add SQL on Everything, in Memory to your personal schedule
11:50am SQL on Everything, in Memory Julian Hyde (Hortonworks)
Add From Oracle to Hadoop to your personal schedule
1:45pm From Oracle to Hadoop Guy Harrison (Dell Software), David Robson (Dell Software), Kathleen Ting (Cloudera)
Add Hive on Apache Tez: Benchmarked at Yahoo! Scale to your personal schedule
2:35pm Hive on Apache Tez: Benchmarked at Yahoo! Scale Mithun Radhakrishnan (Yahoo! Inc.)
Add Scaling Storm: Cluster Sizing and Performance Optimization to your personal schedule
4:15pm Scaling Storm: Cluster Sizing and Performance Optimization P. Taylor Goetz (Hortonworks )
Add Building Real-time Data Products at LinkedIn with Apache Samza to your personal schedule
5:05pm Building Real-time Data Products at LinkedIn with Apache Samza Martin Kleppmann (University of Cambridge)
1 E20/1 E21
Add Three Approaches to Scalable Data Curation to your personal schedule
11:00am Three Approaches to Scalable Data Curation Michael Stonebraker (Tamr, Inc.)
Add Advantages of a Domain-Specific Language Approach to Data Transformation to your personal schedule
11:50am Advantages of a Domain-Specific Language Approach to Data Transformation Joe Hellerstein (UC Berkeley), Sean Kandel (Trifacta)
Add Stories from the Trenches: The Challenges of Building an Analytics Stack to your personal schedule
1:45pm Stories from the Trenches: The Challenges of Building an Analytics Stack Fangjin Yang (Imply), Xavier Léauté (Confluent)
Add Lessons from Fast Analytics and creating Scuba to your personal schedule
2:35pm Lessons from Fast Analytics and creating Scuba Lior Abraham (Interana Inc)
1 E10/1 E11
Add Better Accountability Through Open Data to your personal schedule
11:00am Better Accountability Through Open Data Merici Vinton (OI Engine @ IDEO ), Micheál Keane (Civis Analytics)
Add Wonk, Meet Geek to your personal schedule
11:50am Wonk, Meet Geek Jim Adler (Metanautix)
Add You Have Zero Privacy, You Own Your Data, and Other Myths   to your personal schedule
1:45pm You Have Zero Privacy, You Own Your Data, and Other Myths Gilad Rosner (Internet of Things Privacy Forum)
Add Homelessness Prevention by the Numbers to your personal schedule
2:35pm Homelessness Prevention by the Numbers Stefan Heeke (SumAll.org), Adeen Flinker (SumAll.org)
Add Why Big Data Needs Thick Data to your personal schedule
4:15pm Why Big Data Needs Thick Data Tricia Wang (Constellate Data ), Matt LeMay (Constellate Data)
Add The Open Data 500: Building Businesses on Free Government Data to your personal schedule
5:05pm The Open Data 500: Building Businesses on Free Government Data Joel Gurin (Center for Open Data Enterprise), Laura Manley (The GovLab at NYU)
1 E12/1 E13
Add Generating Possible A/B Tests for Uber Via a City Simulation Framework to your personal schedule
11:00am Generating Possible A/B Tests for Uber Via a City Simulation Framework Bradley Voytek (UC San Diego and Uber, Inc.)
Add Rats and Garbage Cans: The Dirty Data that Makes Cities Cleaner to your personal schedule
11:50am Rats and Garbage Cans: The Dirty Data that Makes Cities Cleaner Brett Goldstein (University of Chicago)
Add The State GeoSpatial BigData to your personal schedule
1:45pm The State GeoSpatial BigData Mansour Raad (ESRI)
1 E14/1 E15
Add Solving the Right Problem to your personal schedule
11:00am Solving the Right Problem Max Shron (Warby Parker), sasha laundy (Warby Parker)
Add Transforming to a Data Driven Operations Model to your personal schedule
11:50am Transforming to a Data Driven Operations Model Denise Asplund (Cisco Systems, Inc)
Add From Experiments to Insights at Pinterest to your personal schedule
1:45pm From Experiments to Insights at Pinterest Andrea Burbank (Pinterest)
Add Unboxing Data Startups – Gilt Groupe, Airbnb, and Lookout to your personal schedule
4:15pm Unboxing Data Startups – Gilt Groupe, Airbnb, and Lookout Michael Abbott (Kleiner Perkins Caufield & Byers), Will Moss (Airbnb), Geoff Guerdat (Gilt Groupe), Emil Ong (Lookout)
Add Unboxing Data Startups II – Yelp and Box to your personal schedule
5:05pm Unboxing Data Startups II – Yelp and Box Michael Abbott (Kleiner Perkins Caufield & Byers), Michael Stoppelman (Yelp), Siva Subramanian (Box)
1 E6/1 E7
Add Disrupting the Traditional Analyst Workflow with Platfora and Spark to your personal schedule
11:50am Disrupting the Traditional Analyst Workflow with Platfora and Spark Peter Schlampp (Platfora), Ed Smith (AutoTrader)
Add Conquering the (Big) Data Conundrum: Harnessing Quantity WITH Quality to your personal schedule
1:45pm Conquering the (Big) Data Conundrum: Harnessing Quantity WITH Quality Nenshad Bardoliwalla (Paxata), Uday Hegde (Useready Inc.), Julia Bardmesser (Citi), O'Reilly Speaker Management (O'Reilly Media)
1 D03/1 D04
Add An End-to-End Approach to Offloading the Data Warehouse with Hadoop to your personal schedule
11:50am An End-to-End Approach to Offloading the Data Warehouse with Hadoop Jorge A. Lopez (Amazon Web Services)
Add Using Graph to Discover Unseen Relationships in Big Data to your personal schedule
1:45pm Using Graph to Discover Unseen Relationships in Big Data Mike Hoskins (Actian Corporation)
Add Unlocking Hadoop’s Potential with YARN to your personal schedule
2:35pm Unlocking Hadoop’s Potential with YARN Sanjay Radia (Hortonworks)
1 E16/ 1 E17
Add Got the T-shirt: Real Experiences from a Hadoop Veteran to your personal schedule
11:00am Got the T-shirt: Real Experiences from a Hadoop Veteran Jim Scott (MapR Technologies)
Add Big Data Architectural Patterns to your personal schedule
11:50am Big Data Architectural Patterns Todd Papaioannou (Splunk)
Add Fast Data Meets Big Data - What's your Strategy? to your personal schedule
2:35pm Fast Data Meets Big Data - What's your Strategy? Michael O'Connell (TIBCO Software Inc.)
Add NoSQL Solutions for Big Data Problems to your personal schedule
4:15pm NoSQL Solutions for Big Data Problems Don Pinto (Couchbase)
Add Drive Data Quality at Your Company: Create a Data Lake to your personal schedule
5:05pm Drive Data Quality at Your Company: Create a Data Lake George Corugedo (RedPoint Global)
1 E05
Add Hadoop Effortlessly: A Data Inventory is Key to Data Self-service to your personal schedule
1:45pm Hadoop Effortlessly: A Data Inventory is Key to Data Self-service Alex Gorelik (Waterline Data), Suresh Srinivas (Hortonworks), Mike Sutten (Kaiser Permanente), John Mount (Win-Vector LLC), Clark Farrey (Capital One), Sunil Soares (Information Asset)
10:30am Morning Break sponsored by ClearStory Data
Room: Expo Hall (1C)
3:15pm Afternoon Break sponsored by Intel
Room: Expo Hall (1C)
Add Expo Hall Reception to your personal schedule
5:45pm Plenary
Room: Expo Hall (1C)
Expo Hall Reception
Add Data After Dark: A Taste of Manhattan to your personal schedule
8:00pm Plenary
Room: Off Site
Data After Dark: A Taste of Manhattan
Add Lunch / Thursday Birds of a Feather to your personal schedule
12:30pm Lunch Sponsored by MapR Technologies
Room: North Hall and Hall 1A
Lunch / Thursday Birds of a Feather
7:15pm Dinner
Room: On Your Own
Add Hadoop Hustle in Central Park to your personal schedule
6:30am Plenary
Room: Central Park
Hadoop Hustle in Central Park
7:30am Coffee Break
Room: Hall E
8:45am-8:55am (10m)
Thursday Keynote Welcome
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata Program Chairs, Roger Magoulas, Doug Cutting, and Alistair Croll, welcome you to the first day of keynotes.
8:55am-9:10am (15m)
Open Standards and the Modern Data Center
Mike Olson (Cloudera)
Mike Olson, CSO and Chairman, Cloudera
9:10am-9:20am (10m) Sponsored
What Would Google Do? Understanding the Future of Big Data
M. C. Srivas (Uber)
If you want to know what's coming next in big data, just ask yourself, "what would Google do?
9:20am-9:30am (10m)
Keynote with Miriah Meyer
Miriah Meyer (University of Utah)
Miriah Meyer, Assistant Professor of Computer Science, University of Utah
9:30am-9:35am (5m) Sponsored
Accelerating Parkinson’s Research with Big Data Technologies
Ron Kasabian (Intel)
This talk introduces how Intel is working with scientists and physicians to help improve research, treatment, and drug development for Parkinson’s Disease using data science and enabling the Parkinson's research community to build upon an open platform for big data analytics.
9:35am-9:40am (5m) Sponsored
Data & The New Era of Interactive Storytelling
Sharmila Mulligan (ClearStory Data)
Data is an evolving story. It’s not a static snapshot of a point in time insight. With data from internal and external sources constantly updating, we are evolving from rear-view mirror dashboard views into an era of interactive Storytelling.
9:40am-9:50am (10m)
Distributions More Interesting than Averages: Some Thoughts on Data Visualization
Amanda Cox (The New York Times )
Amanda Cox, Graphics Operator, The New York Times
9:50am-9:55am (5m) Sponsored
Spark Needs a Business Analyst Workflow
Ben Werther (Platfora)
Spark represents the next-step function leap in what is possible with Hadoop, but what does that mean for business analysts that are swimming in multi-structured data? This presentation discusses the new workflow required so that business analysts can work with massive volumes of multi-structured data to find new insights today, instead of continually having to wait for IT to make big data small.
9:55am-10:05am (10m)
Statistics Without the Agonizing Pain
John Rauser (Snapchat)
There are two essential skills for the data scientist: engineering and statistics. A great many data scientists are very strong engineers but feel like impostors when it comes to statistics. In this talk John will argue that the ability to program a computer gives you special access to the deepest and most fundamental ideas in statistics.
10:05am-10:20am (15m)
Crowdsourcing Humor: The New Yorker Caption Contest
Bob Mankoff (The New Yorker Magazine)
Bob Mankoff, The New Yorker's cartoon editor, will analyze the lessons we learn from crowdsourced humor. Along the way, he'll explore how cartoons work (and sometimes don't); how he makes decisions about what cartoons to include; and what crowds can tell us about a good joke.
11:00am-11:40am (40m) Data Science
All the Data and Still Not Enough!
Claudia Perlich (Dstillery)
There is a symbiotic relationship between predictive modeling and Big Data. Performance gets better with more data and predictive models demonstrate like few other techniques the value of Big Data. However, there is a surprising paradox: when you need models most, even all the data is not enough or just not suitable. So in the days and age of Big Data there remains an art to predictive modeling.
11:50am-12:30pm (40m) Data Science
The Great Debate: If You Can't Code, You Can't Be a Data Scientist
Joseph Adler (Confluent), Hilary Mason (Fast Forward Labs), Scott Nicholson (Poynt), Lucian Lita (Intuit), Roger Magoulas (O'Reilly Media)
In this debate, two teams of the world's best data scientists will debate the following proposition: "If you can't code, you can't be a data scientist."
1:45pm-2:25pm (40m) Data Science
The Day Zach Galifianakis Saved Healthcare
Chris Harland (Microsoft)
An increasingly common task for data science is the measurement and attribution of experimental impact. Using examples from healthcare.gov, Microsoft advertising, and Bing experimentation, we will explore the strengths, weaknesses, and pitfalls of techniques for dealing with impact and attribution in scenarios/data in which control experiments were not possible or otherwise not performed.
2:35pm-3:15pm (40m) Data Science
Computing Professional Identity for the Economic Graph
Vitaly Gordon (LinkedIn)
A talk about how the largest professional social network in the world is digitally mapping the global economy to connect talent with opportunity at massive scale.
4:15pm-4:55pm (40m) Data Science
Multi-language Data Science with IPython, IJulia, IR, and Friends
Brian Granger (Cal Poly San Luis Obispo), Fernando Perez (UC Berkeley and Lawrence Berkeley National Laboratory)
The IPython Notebook is an open-source, web-based interactive computing environment. The Notebook enables users to author documents that combine live code, descriptive text, mathematical equations, images, videos, and arbitrary HTML. This talk will describe how IPython is evolving to support a wide range of programming languages relevant in data science, including Python, Julia, and R.
5:05pm-5:45pm (40m) Data Science
Using Data Science on Internet Search Behavior as a Proxy for Human Behavior
Juan Miguel Lavista (Microsoft)
Just in the US, we make over ~40 billion queries every month. From the time we wake up, search engines are one of the top activities we do online, this talk will show some examples on how this data can be used from funny things like determining which city wakes up earlier to more complex scenarios like finding adverse drug interactions.
11:00am-11:40am (40m) Design & Interfaces
Designing with Data
Jeffrey Heer (Trifacta | University of Washington)
Interaction and visual design are exacting exercises. Designing for data -- especially in messy and massive forms -- brings a new set of challenges. How can we help people of varying backgrounds effectively transform and understand data at scale?
11:50am-12:30pm (40m) Design & Interfaces
Crowdsourcing Humor: The New Yorker Caption Contest
Bob Mankoff (The New Yorker Magazine)
Bob Mankoff, The New Yorker's cartoon editor, will analyze the lessons we learn from crowdsourced humor. Along the way, he'll explore how cartoons work (and sometimes don't); how he makes decisions about what cartoons to include; and what crowds can tell us about a good joke.
1:45pm-2:25pm (40m) Data Science
Data Science Bootcamp
Laurie Skelly (Datascope Analytics)
Data scientists wear many hats -- how do you train a ready-for-prime-time data scientist in twelve weeks? We'll share some of the choices and models we used to create the Metis Data Science Bootcamp and select its first cohort of students.
2:35pm-3:15pm (40m) Enterprise Adoption
In the Data Lake
Barry Devlin (9sight Consulting)
“Leave the over-structured, complex Data Warehouse behind. Dive into the pure, sparkling waters of the Data Lake!” I suggest you enjoy the Instagram, but beware the hidden depths. The Data Lake is a misleading metaphor; it will become a watery grave for context, governance, and value. In reality, today's intricate information ecosystem demands a careful blend of architectures and technologies.
4:15pm-4:55pm (40m) Enterprise Adoption
Unseating the Giants
Monte Zweben (Splice Machine Inc.)
There is a wave of challengers in the database world focused on the scaling costs of traditional RDBMSs. These potential giant killers have capitalized on explosive data growth and disruptive technologies like distributed computing (e.g., Hadoop and NoSQL). We’ll discuss the new breed of database buyers, the redefinition of “enterprise,” and apply lessons from past database wars.
5:05pm-5:45pm (40m) Enterprise Adoption
What’s Holding Up Your Hadoop?
Eddie Garcia (Cloudera)
Recent studies show the vast majority of Hadoop projects are stuck in development, with very few ever reaching production status. And those programs that do convert from pilot to production often view Hadoop as little more than an ETL tool. This session looks at why Hadoop implementations often stall out in the development phase and what companies can do to make Hadoop “production ready.”
11:00am-11:40am (40m) Hadoop in Action
How Goldman Sachs is Using Knowledge to Create an Information Edge
Peter Ferns (Goldman Sachs & Co)
Goldman Sachs is a leading global investment banking, securities and investment management firm that provides a wide range of financial services. Goldman executes 100's of millions of financial transactions per day, across nearly every market in the world. Learn how Goldman is harnessing knowledge, data and compute power to maintain and increase its competitive edge.
11:50am-12:30pm (40m) Hadoop in Action
Customer Intelligence: Harnessing Elephants at Transamerica
Stephen Lloyd (Transamerica), Vishal Bamba (Transamerica), David Beaudoin (Transamerica)
Transamerica is a financial services company moving to a more customer centric model using Big Data. Our approach to this effort spans our Insurance, Annuity, and Retirement divisions. We went from a simple proof of concept to establishing Hadoop as a viable element of our enterprise data strategy. We cover core components of our solution and focus on lessons learned from our experience.
1:45pm-2:25pm (40m) Hadoop in Action
Big Data: A Journey of Innovation
Sastry Durvasula (American Express), Kevin Murray (American Express)
American Express is transforming for the digital age! Learn how we unleashed Big Data into our ecosystem and built on the strength of our core capabilities to remain relevant in a rapidly changing environment. New commerce opportunities and innovative products are being delivered, and the chance to provide actionable insights, social analysis, and predictive modeling is growing exponentially.
2:35pm-3:15pm (40m) Hadoop in Action
Transitioning from Original Big Data to the New Big Data: L.L.Bean’s Journey
Chris Wilson (L.L.Bean), Doug Bryan (RichRelevance)
The accumulation, access and analysis of customer data (“the original Big Data”) are ingrained for L.L.Bean, which has been doing customer modeling since the 1960’s. In line with today’s omnichannel imperative, however, the retailer has embraced a “new Big Data”-driven culture—democratizing data access and tools—in order to sustain its customer-centric philosophy.
4:15pm-4:55pm (40m) Hadoop in Action
Unlocking Big Data at CERN
Matthias Braeger (CERN), Manish Devgan (Software AG)
CERN, home to the Large Hadron Collider (LHC) is at the forefront of science and technology. Come to this session to learn how projects at CERN are leveraging In-memory data management and Hadoop to derive real-time insights from sensor data helping to manage the technical infrastructure of the Large Hadron Collider (LHC).
5:05pm-5:45pm (40m) Hadoop in Action
Big Data Modeling: How FICO is Turning DBAs and into Data Engineers
Lelanie Moll (FICO), Deb Brooks (FICO), Silaphet Mounkhaty (FICO)
FICO has been delivering analytic solutions, such as their renowned credit scores, for nearly 60 years. Big data technologies like Hadoop promise FICO analysts the ability to build models much faster, and with greater accuracy than before, but this new generation of tools challenge them to think differently.
11:00am-11:40am (40m) Hadoop Platform
From Raw Data to Analytics with No ETL
Marcel Kornacker (Cloudera), Lenni Kuff (Cloudera)
Find out how to run real-time analytics over raw data without requiring a manual ETL process targeted at an RDBMS. This talk describes Impala’s approach to on-the-fly data transformation and its support for nested data; examples demonstrate how this can be used to query raw data feeds in formats such as text, JSON and XML, at a performance level commonly associated with specialized engines.
11:50am-12:30pm (40m) Hadoop Platform
SQL on Everything, in Memory
Julian Hyde (Hortonworks)
Hyde shows how to quickly build a SQL interface to a NoSQL system using Optiq. He shows how to add rules and operators to Optiq to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
1:45pm-2:25pm (40m) Hadoop Platform
From Oracle to Hadoop
Guy Harrison (Dell Software), David Robson (Dell Software), Kathleen Ting (Cloudera)
When people think of big data processing, they think of Apache Hadoop, but that doesn't mean traditional databases don't play a role. In most cases users will still draw from data stored in RDBMS systems. Apache Sqoop can be used to unlock that data and transfer it to Hadoop, enabling users with information stored in existing SQL tables to use new analytic tools.
2:35pm-3:15pm (40m) Hadoop Platform
Hive on Apache Tez: Benchmarked at Yahoo! Scale
Mithun Radhakrishnan (Yahoo! Inc.)
The past year has seen the advent of various "low latency" solutions for querying big data such as Shark, Impala, and Presto. The Hive team at Yahoo has spent the past several months benchmarking several versions of Hive (and Tez), with several permutations of file-formats, compression, and query engine features, at various data sizes. In this talk, we present our tests, the results, and findings.
4:15pm-4:55pm (40m) Hadoop Platform
Scaling Storm: Cluster Sizing and Performance Optimization
P. Taylor Goetz (Hortonworks )
We will discuss the basics of scaling, common mistakes and misconceptions, how different technology decisions affect performance, and how to identify and scale around the bottlenecks in a Storm deployment.
5:05pm-5:45pm (40m) Hadoop Platform
Building Real-time Data Products at LinkedIn with Apache Samza
Martin Kleppmann (University of Cambridge)
Apache Samza is a framework for processing high-volume real-time event streams. In this session we will walk through our experiences of putting Samza into production at LinkedIn, discuss how it compares to other stream processing tools, and share the lessons we learnt about dealing with real-time data at scale.
11:00am-11:40am (40m) Hadoop & Beyond
Three Approaches to Scalable Data Curation
Michael Stonebraker (Tamr, Inc.)
The explosion of internal data sources, external public data sources and feeds from the Internet of Things is causing a tsunami of diverse data sources for enterprises. Top-down data-integration tools and data scientist tools won’t scale to meet the demands of the modern enterprise. Learn how a scalable data curation platform can help enterprises connect and enrich their data to leverage it all.
11:50am-12:30pm (40m) Hadoop & Beyond
Advantages of a Domain-Specific Language Approach to Data Transformation
Joe Hellerstein (UC Berkeley), Sean Kandel (Trifacta)
Data transformation — traditionally the domain of IT specialists — is emerging as a critical, widespread problem in data analytics. In this session we discuss the advantages of using a domain-specific language for data transformation tasks. We illustrate these issues with Wrangle, a DSL designed for interactive data transformation.
1:45pm-2:25pm (40m) Hadoop & Beyond
Stories from the Trenches: The Challenges of Building an Analytics Stack
Fangjin Yang (Imply), Xavier Léauté (Confluent)
Organizations often showcase the virtues of their data platforms, but rarely share the challenges and decisions faced along the way. Our session describes how we architected our analytics stack around Druid, an open source distributed data store, and how we overcame the challenges around scaling the system, balancing features with cost, and making performance consistent.
2:35pm-3:15pm (40m) Hadoop & Beyond
Lessons from Fast Analytics and creating Scuba
Lior Abraham (Interana Inc)
Leveraging our experience from working on some of the largest-scale high-growth applications at Facebook and other companies, including building the most popular data analysis tool Scuba, this talk outlines 10 lessons learned, along with best practices towards extracting the most value out of data, while avoiding common pitfalls.
4:15pm-4:55pm (40m) Hadoop & Beyond
Tachyon: A Memory Centric Storage System for Big Data Computing
Haoyuan Li (Alluxio)
An introduction to Tachyon, a memory centric storage system started from UC Berkeley. It enables different frameworks to share data at memory-speed. It is also a major component of Berkeley Data Analytics Stack (BDAS). The project is open source and is deployed at multiple companies. It has more than 30 contributors from over 10 institutions, including Yahoo, Intel, Redhat, Alibaba etc.
5:05pm-5:45pm (40m) Hadoop & Beyond
A Gentle Introduction to Apache Spark and Clustering for Anomaly Detection
Sean Owen (Cloudera)
Apache Spark is a popular new paradigm for computation on Hadoop. It's particularly effective for iterative algorithms relevant to data science like clustering, which can be used to detect anomalies in data. Curious? Get a taste of Spark MLlib, Scala and k-means clustering in this walkthrough of anomaly detection as applied to network intrusion, using the KDD Cup '99 data set.
11:00am-11:40am (40m) Law, Ethics & Open Data
Better Accountability Through Open Data
Merici Vinton (OI Engine @ IDEO ), Micheál Keane (Civis Analytics)
An open data in government love story / case study - how a team of techies overcame political and procedural hurdles to change the financial marketplace.
11:50am-12:30pm (40m) Law, Ethics & Open Data
Wonk, Meet Geek
Jim Adler (Metanautix)
Bad press, FTC consent decrees, and White House reports have all put a spotlight on bad data practices. Data scientists and designers have become increasingly aware of how privacy principles should guide their work. So, the geeks have met the wonks. Now, it’s time for the wonks to meet the geeks and use data analytics to keep pace with burgeoning data volumes, velocities, and innovations.
1:45pm-2:25pm (40m) Law, Ethics & Open Data
You Have Zero Privacy, You Own Your Data, and Other Myths
Gilad Rosner (Internet of Things Privacy Forum)
While the inexorable march of technology does threaten historical notions of privacy, privacy IS very much alive – a shifting, vital conversation society has with itself and its machines. This talk explores the principles of transparency, unlinkability, and intervenability to build a foundation for a design ethos for technologists.
2:35pm-3:15pm (40m) Law, Ethics & Open Data
Homelessness Prevention by the Numbers
Stefan Heeke (SumAll.org), Adeen Flinker (SumAll.org)
The story of using predictive analytics for homelessness prevention in New York City. SumAll.org is currently piloting this approach with the city’s department of homeless services. Predicting at-risk families in a timely manner and micro-targeting social services is a game-changer. SumAll.org is a data analytics nonprofit, dedicated to leveraging the power of data for social innovation.
4:15pm-4:55pm (40m) Law, Ethics & Open Data
Why Big Data Needs Thick Data
Tricia Wang (Constellate Data ), Matt LeMay (Constellate Data)
This session examines the risks of over-reliance on big data and the need to bring in Thick Data—qualitative methods used by ethnographers.
5:05pm-5:45pm (40m) Business & Industry
The Open Data 500: Building Businesses on Free Government Data
Joel Gurin (Center for Open Data Enterprise), Laura Manley (The GovLab at NYU)
Open government data on healthcare, finance, education, energy, and other areas has become a major business resource. Joel Gurin, author of Open Data Now and director of the Open Data 500 study, will show how both startups and established companies are putting open data to work. He'll cover Open Data and Big Data, business models for open-data companies, and lessons from a range of case studies.
11:00am-11:40am (40m) Connected World
Generating Possible A/B Tests for Uber Via a City Simulation Framework
Bradley Voytek (UC San Diego and Uber, Inc.)
Uber has created an AI city simulation framework to optimize its dispatching system, minimize user wait times, and maximize driver partner earnings. Based on agent-based and swarm intelligence models, this framework generates plausible optimizations across many interacting, dynamic, non-linear parameters on a city-by-city basis.
11:50am-12:30pm (40m) Connected World
Rats and Garbage Cans: The Dirty Data that Makes Cities Cleaner
Brett Goldstein (University of Chicago)
How far can we take open data--and where can it take us? Brett Goldstein, who helped pioneer Chicago’s cutting-edge efforts in open data and analytics as CIO and CDO, will speak on how these act as a force multiplier on government efforts and can lead to smarter and more inclusive policy-making, while enhancing the government’s ability to anticipate and react to the needs of the public.
1:45pm-2:25pm (40m) Connected World
The State GeoSpatial BigData
Mansour Raad (ESRI)
GeoSpatial BigData and types are special "animals" when it comes to storage, discovery and processing. This session will explore the various non-traditional ways to stream, extract, batch and visualize GeoSpatial Information for deeper geo-insight, such as "Where are the 3 nearest facilities to each of my customers based on current traffic conditions...nationwide ?"
2:35pm-3:15pm (40m) Connected World
Architecting World's Largest Biometric Identity System - Aadhaar Experience
Pramod Varma (UIDAI)
Aadhaar, India's Unique Identity Project, is the largest biometric identity system in the world with more than 600 million people. Its strength lies in its design simplicity, sound strategy, and technology backbone issuing 1 million identity numbers and doing 600 trillion biometric matches every day! Pramod Varma, who is the Chief Architect of Aadhaar, shares his experience from this project.
4:15pm-4:55pm (40m) Connected World
Pairing EMR Data with an Open Commons to Engage Communities, Provide Work Force Development and Predict Community Health Futures
Brigitte Piniewski (nonaffiliated )
This session will help data scientists support healthcare leaders to harmonize health data with Open Source community data commons approaches. This enhances the value of mandated EMR adoption beyond Meaningful Use requirements by creating evidence-based community health intelligence at the pace and point of change, the everyday lives and activities of community members.
5:05pm-5:45pm (40m) Business & Industry
Decided by Data: Case Studies from a Data Driven Product Culture
Nellwyn Thomas (Etsy)
At Etsy, we run dozens of experiments simultaneously and we have terabytes of data generated by the tens of millions of members of our community. We've worked hard to establish a product development process informed by -- and often driven by -- data. In this talk, Nell will discuss the tensions that arise in a data-driven product culture.
11:00am-11:40am (40m) Business & Industry
Solving the Right Problem
Max Shron (Warby Parker), sasha laundy (Warby Parker)
Business problems don’t reveal themselves neatly as data problems. The data community is obsessed with tools and techniques, but the real challenge is understanding how to solve problems with data. How do we bridge the gap? In this talk, we will teach you a methodology for figuring out the right problems to solve and making sure that the work stays smart.
11:50am-12:30pm (40m) Business & Industry
Transforming to a Data Driven Operations Model
Denise Asplund (Cisco Systems, Inc)
This talk highlights William's success, challenges, and experiences creating a data driven operations model into Cisco’s engineering services organization. William highlights the role of data, the need for scale and security, the opportunity for new technology to accelerate business, the role of IT to help guide/partner, and the mind shift and cultural changes along the journey.
1:45pm-2:25pm (40m) Business & Industry
From Experiments to Insights at Pinterest
Andrea Burbank (Pinterest)
Over two years of running A/B testing at Pinterest on millions of users each day, Andrea learned about the nuances that can make or break an experimentation platform. Andrea will discuss how her approach to testing has adjusted over time to avoid critical errors at all levels, from organizational to analytical.
2:35pm-3:15pm (40m) Business & Industry
Case Study: -A Forensic Look at Success and Failure of Predictive Analytics in Healthcare
Eugene Kolker (Seattle Children's)
This discussion touches on the human response to analysis results, especially when they do not support long held beliefs and how this effects organizational change. This discussion also focuses on Predictive Analytics best practices, team skills, and a review of what it takes to build a sustainable Predictive Analytics program.
4:15pm-4:55pm (40m) Business & Industry
Unboxing Data Startups – Gilt Groupe, Airbnb, and Lookout
Michael Abbott (Kleiner Perkins Caufield & Byers), Will Moss (Airbnb), Geoff Guerdat (Gilt Groupe), Emil Ong (Lookout)
In this session, Kleiner Perkins Caufield & Byers General Partner Michael Abbott speaks with Geoff Guerdat of the Gilt Groupe, Will Moss of Airbnb, and Emil Ong of Lookout, to unbox their respective companies and examine the technology, architecture, and innovations they’ve harnessed to deliver superior products and services.
5:05pm-5:45pm (40m) Business & Industry
Unboxing Data Startups II – Yelp and Box
Michael Abbott (Kleiner Perkins Caufield & Byers), Michael Stoppelman (Yelp), Siva Subramanian (Box)
In this session, Kleiner Perkins Caufield & Byers General Partner Michael Abbott speaks with Michael Stoppelman of Yelp and Siva Subramanian of Box to unbox their respective companies and examine the technology, architecture, and innovations they’ve harnessed to deliver superior products and services.
11:00am-11:40am (40m) Sponsored
Building an E2E Data Analytics Architecture for IOT
Vin Sharma (Intel)
This session will outline Intel’s vision of an E2E Data Analytics Architecture for IoT as well as how we are enabling companies to elevate and transform the way they interact with their customers.
11:50am-12:30pm (40m) Sponsored
Disrupting the Traditional Analyst Workflow with Platfora and Spark
Peter Schlampp (Platfora), Ed Smith (AutoTrader)
Up to 90% of your data is coming in new forms, in greater size, and at increasing speed. This multi-structured data requires a new workflow, putting the power of Hadoop and Spark into the hands of business analysts. In this session, we will share how Fortune 500 analysts have transformed their workflow by gaining insights into their business once never possible.
1:45pm-2:25pm (40m) Sponsored
Conquering the (Big) Data Conundrum: Harnessing Quantity WITH Quality
Nenshad Bardoliwalla (Paxata), Uday Hegde (Useready Inc.), Julia Bardmesser (Citi), O'Reilly Speaker Management (O'Reilly Media)
Today’s unstructured data is raw and complex, but everyone agrees it can provide context and hidden insights when it is easily accessed during the business intelligence lifecycle. . .
2:35pm-3:15pm (40m) Sponsored
Building Real-Time Platforms with MemSQL and Apache Spark
Eric Frenkiel (MemSQL)
This session will cover how MemSQL’s hybrid transactional and analytic data processing capabilities and Apache Spark integration enable businesses to build real-time platforms for applications like operational analytics, position monitoring, and anomaly detection.
4:15pm-4:55pm (40m) Sponsored
Real-time streaming and analytics with Amazon Elastic MapReduce and Amazon Kinesis
Steve McPherson (Amazon Web Services)
Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources.
5:05pm-5:45pm (40m) Hadoop & Beyond
Spark Camp / Additional Work Space
Additional, informal work session with the Spark Team.
11:00am-11:40am (40m) Sponsored
See the Fastest Spark-Powered Disparate Data Blending & Analysis Solution
Vaibhav Nivargi (ClearStory Data)
In this session, you will learn why it’s powered by Spark, hear key business use cases from customers across various industries using it and gain understanding of the five fundamentals of speeding disparate data analysis.
11:50am-12:30pm (40m) Sponsored
An End-to-End Approach to Offloading the Data Warehouse with Hadoop
Jorge A. Lopez (Amazon Web Services)
Shifting workloads from the enterprise data warehouse (EDW) to Hadoop reduces costs, enables you to keep that data longer, and frees up EDW capacity for fast analytics. Check out our live demo and learn a proven framework for offloading workloads from the EDW to Hadoop: Identify & prioritize what to offload; Shift workloads to Hadoop; Optimize & secure your environment; and Visualize new insights.
1:45pm-2:25pm (40m) Sponsored
Using Graph to Discover Unseen Relationships in Big Data
Mike Hoskins (Actian Corporation)
Big Data and Analytics is still a young space but novel new methods are on the way. Prominent among them is graph analytics. Actian will show radical and innovative graph analytic capabilities, from its investment in SPARQL City. Founded by database legend Barry Zane, SPARQL City and Actian are committed to delivering the industry’s highest performing in memory graph analysis engine.
2:35pm-3:15pm (40m) Sponsored
Unlocking Hadoop’s Potential with YARN
Sanjay Radia (Hortonworks)
In this talk Arun Murthy will share the very latest innovation from the community aimed at accelerating the interactive and realtime capabilities of enterprise Hadoop.
4:15pm-4:55pm (40m) Sponsored
Big Data SQL and Query Franchising: An Architecture for SQL Beyond Hadoop
Dan McClary (Oracle)
SQL is the natural language for querying data, but data lives in many places. We discuss the importance of SQL not only on Hadoop, but on relational databases, and noSQL stores. Additionally, we dive deep into the architecture of Big Data SQL, which can access all of these sources in a single query.
5:05pm-5:45pm (40m) Sponsored
Important Advances in Hadoop: A Panel Discussion
Joey Jablonski (Dell)
Join us for a panel discussion that includes customers, industry experts and partners who are ready to explore the latest advances in Hadoop, from affordability and appliances, to Apache Spark, simplification and security.
11:00am-11:40am (40m) Sponsored
Got the T-shirt: Real Experiences from a Hadoop Veteran
Jim Scott (MapR Technologies)
Learn the critical success factors for organizational success with Hadoop and building the right team and skill sets for high performance Hadoop success from a veteran of three successful Hadoop projects.
11:50am-12:30pm (40m) Sponsored
Big Data Architectural Patterns
Todd Papaioannou (Splunk)
In this session you will hear from big data experts with real world experience on the architectural patterns and platform integrations used to solve real business problems with data.
1:45pm-2:25pm (40m) Sponsored
Global Hadoop: Storage and Compute Challenges in Multi-Data Center Deployments
Jagane Sundar (WANdisco)
This session will examine the distribution and storage of data in HDFS across multiple datacenters in a single coordinated, Paxos-based file system over a WAN. Efficient use of compute resources in a globally distributed HDFS cluster is also discussed.
2:35pm-3:15pm (40m) Sponsored
Fast Data Meets Big Data - What's your Strategy?
Michael O'Connell (TIBCO Software Inc.)
Join TIBCO Software, an industry leader in infrastructure and analytics software, for a thought leadership discussion to learn how your organization can redefine its data strategy. Transition from a company of Big Data to Fast Data and convert your customers into fans while achieving a competitive advantage. 
4:15pm-4:55pm (40m) Sponsored
NoSQL Solutions for Big Data Problems
Don Pinto (Couchbase)
This session provides a brief overview of Couchbase Server, a document database and its underlying distributed architecture.
5:05pm-5:45pm (40m) Sponsored
Drive Data Quality at Your Company: Create a Data Lake
George Corugedo (RedPoint Global)
Deriving value from data depends on how well companies capture and manage that data. Learn how to create a centralized processing pool where data can be captured, cleansed, linked and structured in a consistent way. Use the scalability and flexibility of Hadoop to create a powerful processing and refinement engine to drive usable information across enterprise data bases and data marts.
1:45pm-2:25pm (40m) Sponsored
Hadoop Effortlessly: A Data Inventory is Key to Data Self-service
Alex Gorelik (Waterline Data), Suresh Srinivas (Hortonworks), Mike Sutten (Kaiser Permanente), John Mount (Win-Vector LLC), Clark Farrey (Capital One), Sunil Soares (Information Asset)
Companies are deploying Hadoop “data lakes” to provide unprecedented access to data for data science and analytics. However, the advantages of frictionless ingest, flexible schema on read, and lack of data governance, turn into increasingly insurmountable challenges to enable true data self-service, and create a barrier to the enterprise adoption of Hadoop.
10:30am-11:00am (30m)
Break: Morning Break sponsored by ClearStory Data
3:15pm-4:15pm (1h)
Break: Afternoon Break sponsored by Intel
5:45pm-7:15pm (1h 30m) Events
Expo Hall Reception
Join your fellow big data enthusiasts at the Strata Conference & Hadoop World Expo Hall Reception on Thursday, October 16.
8:00pm-11:00pm (3h) Events
Data After Dark: A Taste of Manhattan
Come join us for an eclectic taste of Hell’s Kitchen cuisine and entertainment. Mix and mingle with fellow attendees at six distinctly different places within a few blocks of each other, including a piano bar, swing dancing, Memphis bbq, cajun creole, southeast Asian, and rock & roll lounge.
12:30pm-1:45pm (1h 15m) Events
Lunch / Thursday Birds of a Feather
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics. NOTE: BoFs are happening during lunch, which is not accessible to Expo Plus and Expo Only pass holders.
7:15pm-8:00pm (45m)
Break: Dinner
6:30am-7:30am (1h) Events
Hadoop Hustle in Central Park
Cloudera invites you to join our 1st annual Hadoop Hustle during Strata + Hadoop World 2014. This event is part of NYC Data Week.
7:30am-8:45am (1h 15m)
Break: Coffee Break