Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY
 
1 E8 / 1 E9
11:20am Scaling Python analytics on Impala Wes McKinney (Two Sigma Investments)
1:15pm Mapping big data: A data driven market report Russell Jurney (Data Syndrome)
2:05pm Preserving signal in customer journeys Joy Thomas (Apigee), Jagdish Chand (Apigee)
2:25pm Advanced data science with Spark Streaming Albert Bifet (Télécom ParisTech), Silviu Maniu (Huawei)
2:55pm Data Science in the Wall Street Journal Juan Huerta (Dow Jones)
4:35pm Data modeling for data science: Simplify your workload with complex types Marcel Kornacker (Cloudera), Josh Wills (Cloudera), Alexander Behm (Cloudera)
5:25pm Running experiments with logged-out users: Solving the mixed group problem Raphael Lee (Airbnb), Victor Vazquez (Airbnb)
1 E10 / 1 E11
11:20am Goldman Sachs data lake Billy Newport (Goldman Sachs)
2:05pm How a global entertainment company successfully built a data lake for continued digital dominance Joe Caserta (Caserta Concepts), Elliott Cordo (Caserta Concepts, LLC)
2:55pm Where’s the puck headed? Michael Dauber (Amplify Partners), Shivon Zilis (Bloomberg Beta), Matthew Ocko (Data Collective), Roger Chen (Computable Labs), Jerry Chen (Greylock)
5:25pm Unboxing data startups Michael Abbott (Stanford University), Jooseong Kim (Pinterest), Sven Junkergård (Zephyr Health), Calvin French-Owen (Segment), Peter Reinhardt (Segment), Andrew First (Lean Plum), Shiva Shivakumar (Urban Engines)
1 E12/ 1 E13
1:15pm The data-driven future of biotechnology Aaron Kimball (Zymergen, Inc.)
2:55pm Continuous curation of event data for a customer event hub Arvind Prabhakar (StreamSets)
4:35pm A hierarchical data warehouse in Hadoop Amar Arsikere (infoworks.io)
1 E16 / 1 E17
11:20am The Jedi Masters Guide to Wrangling JSON Greg Rahn (Cloudera)
1:15pm Simplifying Hadoop: RecordService, a secure and unified data access path for compute frameworks Lenni Kuff (Facebook), Nong Li (Cloudera), Stephen Romanoff (Capital One )
2:55pm Native erasure coding support inside HDFS Zhe Zhang (LinkedIn), Weihua Jiang (Intel)
5:25pm OLTP on Hadoop: Reviewing the first Hadoop-based TPC-C benchmarks Monte Zweben (Splice Machine Inc.), John Leach (Splice Machine)
1 E18 / 1 E19
11:20am Big data at a crossroads: Time to go meta (on use) Joe Hellerstein (UC Berkeley)
4:35pm Data liberation and data integration with Kafka Martin Kleppmann (University of Cambridge)
5:25pm Real-time analytics with Solr Yonik Seeley (Cloudera)
1 E20 / 1 E21
11:20am What's coming for the Spark community Patrick Wendell (Databricks)
1:15pm Supercharging R with Spark for end-to-end data science Hossein Falaki (Databricks Inc.)
2:05pm Next-generation genomics analysis with Apache Spark Timothy Danford (Tamr, Inc.)
2:55pm Lifelogging for insights Håkan Jonsson (Sony Mobile Communications)
4:35pm Effective testing of Spark programs and jobs Holden Karau (Google)
5:25pm Estimating financial risk with Apache Spark Sandy Ryza (Clover Health)
3D 02/11
11:20am When it absolutely, positively, has to be there: Reliability guarantees in Kafka Gwen Shapira (Confluent), Jeff Holoman (Cloudera)
1:15pm What does your smart device know about you? Charles Givre (Deutsche Bank)
2:05pm Twitter Heron: Stream processing at scale Karthik Ramasamy (Streamlio)
2:55pm Streaming in the extreme Jim Scott (MapR Technologies)
4:35pm IoT with Spark Streaming: Practical lessons from real-world use cases Hari Shreedharan (Cloudera), Anand Iyer (Cloudera)
3D 03/10
11:20am Value in the details - understanding data through visual exploration Richard Brath (Uncharted Software), Rob Harper (Uncharted)
1:15pm Data inclusion for all Alex Kelly (General Motors), Kim Le (General Motors)
2:05pm Visualising Music Services Alan Hannaway (7digital)
2:55pm Knowledge and the geospatial mixing pot Andrew Hill (Textile)
4:35pm Music science: How data is changing what we listen to Sean Power (Watching Websites), Joy Johnson (AudioCommon), Mike Rosenthal (Mick Management), Rishi Malhotra (Saavn)
3D 04/09
11:20am Personal information out of context: Building a consumer subject review board Evan Selinger (Rochester Institute of Technology), Jules Polonetsky (Future of Privacy Forum)
1:15pm Protecting the humanity in data I: Ethics of algorithms/ethics of data activism/targeting services without excluding the needy Jake Porway (DataKind), Cathy O'Neil (Weapons of Math Destruction), Vladimir Dubovskiy (DonorsChoose.org), Kamalesh Rao (DataKind)
2:05pm Protecting the humanity in data II: Personalized crisis counseling/messiness of interpretation Jake Porway (DataKind), Bob Filbin (Crisis Text Line), danah boyd (Microsoft Research | Data & Society)
2:55pm How we amplify privilege with supervised machine learning Mike Lee Williams (Cloudera Fast Forward Labs)
4:35pm Fixing Chicago’s crime data Jay Margalus (MapR), Mike Emerick (MapR)
5:25pm Ethical big data - what's legal and what's right Steven Totman (Cloudera), Sam Heywood (Cloudera), Nick Curcuru (Mastercard)
3D 05/08
11:20am Hadoop in the cloud: An architectural how-to Jairam Ranganathan (Cloudera)
1:15pm Multi-tenant, multi-cluster, and multi-container Apache HBase deployment Jonathan Hsieh (Cloudera, Inc), Dima Spivak (StreamSets)
2:05pm The glue: Building the connectors and tools to manage big data warehouses Siwei Zhu (Scribd), Kevin Perko (Scribd)
2:55pm Failing fast and falling often is no way to run a cluster! Michael Segel (Segel & Associates.)
5:25pm Real-world NoSQL schema design Ted Dunning (MapR)
Hall B
2:05pm Data and Ethics DJ Patil (White House Office of Science and Technology Policy)
2:55pm Data and Ethics II DJ Patil (White House Office of Science and Technology Policy)
3D 06/07
11:20am Putting Modern BI to Work: Innovative Use Cases Ali Tore (ClearStory Data)
1:15pm Expand your mind to fit the big data Data Center: the scale and cost of information management architectures Robert Eve (Cisco), Robert Novak (Cisco), Nenshad Bardoliwalla (Paxata)
2:55pm Design patterns for real-time data analytics Sheetal Dolas (Hortonworks)
1 E6 / 1 E7
2:05pm End User Panel on Real-Time Data Analytics Eric Frenkiel (MemSQL), Noah Zucker (Novus Partners), Ian Hansen (Digital Ocean), Michael DePrizio (Akamai Technologies)
4:35pm How Pepsi wrangles the diverse data of consumer packaged goods Matthew Derda (Pepsi), Douglas Stradley (Trifacta)
1 E14
11:20am Big data analytics in the cloud Matt Winkler (Microsoft)
1:15pm Real data, real implementations: What actual customers are doing Andrew Brust (Datameer), Jeff Jarrell (American Airlines), Ryan Wright (Kelley Blue Book), Kendell Timmers
2:05pm Delivering trusted data for analyst autonomy and operational agility with a unified big data fabric Vishal Bamba (Transamerica), Murthy Mathiprakasam (Informatica)
4:35pm Enter the snake pit for fast and easy Spark and Cassandra Jon Haddad (The Last Pickle)
5:25pm Think like a data scientist: Build your big data blueprint Oreilly_BSchmarzo Bill (EMC Consulting)
1 E15
1:15pm The forces that will disrupt big data Anthony Dina (Dell)
2:05pm How Riot Games uses Platfora to improve League of Legends' performance Peter Schlampp (Platfora), Chris Kudelka (Riot Games)
2:55pm Hydrate a data lake in days with CDAP Jonathan Gray (Cask)
4:35pm Catalog, secure, and govern your Hadoop data lake Alex Gorelik (Waterline Data), Jim Kaskade (Janrain), David Tabacco (Merck & Co., Inc.), David Paige (Cox Automotive)
5:25pm Fast fish eat slow fish: How to move faster Samuel Cozannet (Canonical)
3D 01/12
9:00am Spark Development Bootcamp (Day 2) Laurent Weichberger (OmPoint Innovations, LLC)
1B 03
9:00am Practical data science on Hadoop (Day 2) Brandon MacKenzie (IBM), John Rollins (IBM), Jacques Roy (IBM), Chris Fregly (PipelineAI), Mokhtar Kandil (IBM)
1B 04
7:30am (Coffee Break - 7:00am - 8:45am)
Room: Javits North
8:45am Plenary
Room: Javits North
Wednesday keynote welcome Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
8:50am Plenary
Room: Javits North
The next generation Mike Olson (Cloudera)
9:05am Plenary
Room: Javits North
Playing with, and for, data AnnMarie Thomas (School of Engineering and Schulze School of Entrepreneurship, University of St. Thomas)
9:15am Plenary
Room: Javits North
What 0-50 million users in 7 days can teach us about big data Joseph Sirosh (Microsoft)
9:25am Plenary
Room: Javits North
Improving Medical Decision Making with Predictive Analytics on Big Data Ron Kasabian (Intel), Michael Draugelis (Penn Medicine)
9:30am Plenary
Room: Javits North
The race to modernize BI: What it is and why so urgent? Tim Howes (ClearStory Data)
9:35am Plenary
Room: Javits North
Unleashing the power of big data today Jim McHugh (Cisco)
9:40am Plenary
Room: Javits North
A Transition to Interactive Music Consumption + Data Joy Johnson (AudioCommon)
9:50am Plenary
Room: Javits North
Data vs creativity: The last battleground? David Boyle (MasterClass)
10:00am Plenary
Room: Javits North
On reflection: What the White House needs from you DJ Patil (White House Office of Science and Technology Policy)
10:10am Plenary
Room: Javits North
Improving decisions Katherine Milkman (Wharton School at the University of Pennsylvania)
10:25am Plenary
Room: Javits North
O'Reilly Announcements Ben Lorica (O'Reilly Media)
10:30am Plenary
Room: Javits North
Context Computing Jeff Jonas (IBM)
10:50am Morning Break sponsored by ClearStory Data
Room: 3E
3:35pm Afternoon Break sponsored by Bloomberg
Room: 3E
6:05pm Plenary
Room: 3E
Booth Crawl
12:00pm Lunch sponsored by Microsoft
Room: 3A & 3B
Lunch / Wednesday BoF Tables
6:30am Plenary
Room: Hudson River Park
Data Dash
8:00pm Plenary
Room: The High Line/Meatpacking District
Data After Dark: High Line Hop
7:05pm Dinner
Room: On Your Own
11:20am-12:00pm (40m) Data Science & Advanced Analytics
Scaling Python analytics on Impala
Wes McKinney (Two Sigma Investments)
Many data science and data analytics applications are written in Python or R, but developing and deploying these applications at scale or in production is a pain point for many users. We will discuss our new efforts to bridge the gap between familiar in-memory data tools and distributed data management systems using Python and Impala.
1:15pm-1:35pm (20m) Data Science & Advanced Analytics
Mapping big data: A data driven market report
Russell Jurney (Data Syndrome)
The talk covers the development of the O'Reilly Media Report, "Mapping big data: A data driven market report."
1:35pm-1:55pm (20m) Data Science & Advanced Analytics
Queering quant: How having all the data isn’t enough to represent a complex social phenomena
Lauralea Banks Edwards (Washington State University)
This presentation identifies some of the areas in data creation and analytics where we perpetuate the simplistic representation of the world. It uses queer theory to demonstrate alternative ways of creating and analyzing data to take non-normative cases into consideration.
2:05pm-2:25pm (20m) Data Science & Advanced Analytics
Preserving signal in customer journeys
Joy Thomas (Apigee), Jagdish Chand (Apigee)
Customer journey analytics systems of large corporations must handle a great volume of events on a daily basis. Apriori aggregation used by early systems often caused signal loss due to ever-changing customer activity rates. We will present a new method that identifies paths inherent in raw cross-channel data, and that captures traffic patterns via nodes of interest across all channels of data.
2:25pm-2:45pm (20m) Data Science & Advanced Analytics
Advanced data science with Spark Streaming
Albert Bifet (Télécom ParisTech), Silviu Maniu (Huawei)
Real-time analytics are becoming increasingly important due to the large amount of data that is being created continuously. Drawing from our experiences in Huawei Noah's Ark Lab, we present StreamDM, a new open source data mining and machine learning library designed on top of Spark Streaming. We will show its advanced methods, and how easily it can be used and extended.
2:55pm-3:35pm (40m) Data Science & Advanced Analytics
Data Science in the Wall Street Journal
Juan Huerta (Dow Jones)
In this presentation I will describe the way in which Data Science is helping the Wall Street Journal produce better journalism strategies, personalize our subscribers’ experience, and optimize revenue and overall customer engagement.
4:35pm-5:15pm (40m) Data Science & Advanced Analytics
Data modeling for data science: Simplify your workload with complex types
Marcel Kornacker (Cloudera), Josh Wills (Cloudera), Alexander Behm (Cloudera)
In this talk, we will explain how data scientists use nested data structures to increase analytic productivity. We will use two well-known relational schemas - TPC-H and Twitter - to demonstrate how to simplify data science workloads with nested schemas. Also, we will outline best practices for converting flat relational schemas into nested ones, and give examples of data science-style analysis.
5:25pm-6:05pm (40m) Data Science & Advanced Analytics
Running experiments with logged-out users: Solving the mixed group problem
Raphael Lee (Airbnb), Victor Vazquez (Airbnb)
More users than ever are accessing web applications from multiple devices. When logged-out users receive mixed experiment treatments, weird and wacky results can start appearing in your experiment analyses. Find out what we've learned about this problem at Airbnb and how our data scientists and engineers teamed up to solve it.
11:20am-12:00pm (40m) Data-driven Business
Goldman Sachs data lake
Billy Newport (Goldman Sachs)
The combination of data, technology, and analytics creates previously impossible business intelligence opportunities. How well companies can capture and manage their data so that it can be easily and consistently queried will be a key differentiator in deriving commercial value from data. Learn how Goldman is developing an enterprise platform to unify and manage data across the firm.
1:15pm-1:55pm (40m) Data-driven Business
Death of the click: How big data is killing your favorite metrics
Claudia Perlich (Dstillery)
This talk takes a provocative stand: many metrics we cherish lose their value because the granularity of modern data collection enables us to identify and optimize toward hidden signals that used to be noise, and now come to the forefront. One such metric is the click-through rate in advertising, but the mechanism is ubiquitous and we should pay close attention to the mechanism at work.
2:05pm-2:45pm (40m) Data-driven Business
How a global entertainment company successfully built a data lake for continued digital dominance
Joe Caserta (Caserta Concepts), Elliott Cordo (Caserta Concepts, LLC)
A global record company and a force in the music business partnered with award-winning data innovation consulting firm Caserta Concepts to re-architect its core data platform, with a data framework based on AWS, EMR, Redshift, and other big data technologies. This session presents the architecture, technologies, and techniques used to achieve an agile data ingestion and analytics platform.
2:55pm-3:35pm (40m) Data-driven Business
Where’s the puck headed?
Michael Dauber (Amplify Partners), Shivon Zilis (Bloomberg Beta), Matthew Ocko (Data Collective), Roger Chen (Computable Labs), Jerry Chen (Greylock)
To anticipate who will succeed and invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, we’ll consider the big trends in big data, asking top-tier VCs to look over the horizon and discuss the visions they have two or more years in the future.
4:35pm-5:15pm (40m) Data-driven Business
The science behind #TheDress: Measuring virality at BuzzFeed
Adam Kelleher (Buzzfeed)
At BuzzFeed, a technology and media company, the question of “virality of content via sharing” dominates. Now, for the first time since the company was founded in 2006, data scientists can identify ways pieces of content spread across multiple social networks. In this paper, we present a close look into the way BuzzFeed defines and analyzes the virality of content.
5:25pm-6:05pm (40m) Data-driven Business
Unboxing data startups
Michael Abbott (Stanford University), Jooseong Kim (Pinterest), Sven Junkergård (Zephyr Health), Calvin French-Owen (Segment), Peter Reinhardt (Segment), Andrew First (Lean Plum), Shiva Shivakumar (Urban Engines)
Most people are familiar with the basic principles driving today’s hottest big data and enterprise companies. But what’s really going on underneath the hood? In this session, Kleiner Perkins Caufield & Byers General Partner Michael Abbott unboxes a variety of startups in the space to examine the technology, architecture, and innovations they’ve harnessed to deliver superior products and services.
11:20am-12:00pm (40m) Hadoop Use Cases
Transitioning from reactive to proactive: Etsy's data platform team
Melissa Santos (Big Cartel)
Over the last year, my team has gone from being a Hadoop Infrastructure team that was constantly fixing problems and cleaning up messes, to declaring ourselves to be a Data Platform team, expanding into investigating new tools, teaching coworkers about big data, and consulting with other teams about how to meet their data needs.
1:15pm-1:55pm (40m) Hadoop Use Cases
The data-driven future of biotechnology
Aaron Kimball (Zymergen, Inc.)
Zymergen has industrialized the process of genome engineering to build microbes that produce chemicals at scale. High-throughput microbe development is driven by integrating machine learning and open source software for complex data storage, search, and bioinformatics. See how we built this futuristic vision for synthetic biology, and learn how NoSQL can power massive scale experimentation.
2:05pm-2:45pm (40m) Hadoop Use Cases
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud, a real-world case study
Jaipaul Agonus (FINRA)
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
2:55pm-3:35pm (40m) Hadoop Use Cases
Continuous curation of event data for a customer event hub
Arvind Prabhakar (StreamSets)
Modern data infrastructures operate on vast volumes of continuously produced data generated by independent channels. Enterprises such as consumer banks that have many such channels are starting to implement a single view of customers that can power all customer touchpoints. In this session we present an architectural approach for implementing such a solution using a customer event hub.
4:35pm-5:15pm (40m) Hadoop Use Cases
A hierarchical data warehouse in Hadoop
Amar Arsikere (infoworks.io)
Enterprise data warehouses have become a large cost center. As their data volumes grow, enterprises want to move their warehouses on to Hadoop. But it is not an easy task. How do you solve this problem? The speakers have designed and deployed large scale data warehouses on Hadoop. In this talk, they will examine the technical underpinnings of their solution with a real-world example.
5:25pm-6:05pm (40m) Hadoop Use Cases
Migrating workloads from data warehouses to Hadoop
Alan Choi (Cloudera)
Many workloads are being migrated from data warehouses to Hadoop; but without a good methodology, the migration process can be challenging. In this talk, we’ll discuss such a methodology in detail: from cluster sizing, to query tuning, to production readiness.
11:20am-12:00pm (40m) Hadoop Use Cases
The Jedi Masters Guide to Wrangling JSON
Greg Rahn (Cloudera)
The flexibility and simplicity of JSON have made it one of the most common formats for data. Data engines need to be able to load, process, and query JSON and nested data types quickly and efficiently. There are multiple approaches to processing JSON data, each with trade offs. In this session we’ll compare and contrast the approaches taken by systems such as Hive, Drill, BigQuery, and others.
1:15pm-1:55pm (40m) Hadoop Internals & Development
Simplifying Hadoop: RecordService, a secure and unified data access path for compute frameworks
Lenni Kuff (Facebook), Nong Li (Cloudera), Stephen Romanoff (Capital One )
Hadoop is supremely flexible, but with that flexibility comes integration challenges. In this talk, we introduce a new service that eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.
2:05pm-2:45pm (40m) Hadoop Internals & Development
Hadoop's storage gap: Resolving transactional access/analytic performance trade-offs with Kudu
Todd Lipcon (Cloudera)
This session will investigate the trade-offs between real-time transactional access and fast analytic performance in Hadoop, from the perspective of storage engine internals. We will discuss recent advances, evaluate benchmark results from current generation Hadoop technologies, and propose potential ways ahead for the Hadoop ecosystem to conquer its newest set of challenges.
2:55pm-3:35pm (40m) Hadoop Internals & Development
Native erasure coding support inside HDFS
Zhe Zhang (LinkedIn), Weihua Jiang (Intel)
In this session, attendees will learn how erasure coding (HDFS-7285) can greatly reduce the storage overhead of HDFS without sacrificing data reliability.
4:35pm-5:15pm (40m) Hadoop Internals & Development
Transaction processing with Apache Hive, HBase, and Phoenix
Alan Gates (Hortonworks)
Hadoop gives the ability to keep all data together for shared use and analysis. People use Apache HBase for fast updates and low latency data access and Apache Hive for analytics. To improve sharing of this data, users need to be able to access their transactional and analytic data through one tool. This talk will cover work in the Hive, HBase, and Phoenix communities to deliver on this promise.
5:25pm-6:05pm (40m) Hadoop Internals & Development
OLTP on Hadoop: Reviewing the first Hadoop-based TPC-C benchmarks
Monte Zweben (Splice Machine Inc.), John Leach (Splice Machine)
Even after 25 years, the TPC-C benchmark still sets the standard for online transaction processing (OLTP) database benchmarking. It has traditionally been the arena for RDBMSs like Oracle Database, IBM DB2, and Microsoft SQL Server to do battle. Now, for the first time, a Hadoop database has successfully completed TPC-C benchmarks. Can it change the equation for OLTP workload price/performance?
11:20am-12:00pm (40m) Data Innovations
Big data at a crossroads: Time to go meta (on use)
Joe Hellerstein (UC Berkeley)
As the Hadoop ecosystem grows more complex, there is widespread desire for open metadata solutions: common ground for collaboration across users, and interoperability across software solutions. We motivate a new class of open metadata services for big data, via science and enterprise use cases. We also set out challenges for a new class of "meta-on-use" approaches fit for agile analytics.
1:15pm-1:55pm (40m) Data Innovations
Amazon Kinesis deep dive: Real-time streaming on Amazon Web Services
Roy Ben-Alta (Amazon Web Services)
Amazon Kinesis is a fully managed service for real-time streaming big data ingestion and processing. This talk explores Kinesis concepts in detail, including best practices for scaling your core streaming data ingestion pipeline. We then discuss building and deploying Kinesis processing applications using capabilities like Kinesis Client Libraries, AWS Lambda, and Amazon EMR (via Spark).
2:05pm-2:45pm (40m) Data Innovations
How companies are using Tachyon, a memory-centric distributed storage
Haoyuan Li (Alluxio)
Tachyon is a memory-centric fault-tolerant distributed storage system, which enables reliable file sharing at memory-speed. It is open source and is deployed at multiple companies. In addition, Tachyon has more than 80 contributors from over 30 institutions. In this talk, we present Tachyon's architecture, performance evaluation, and several use cases we have seen in the real world.
2:55pm-3:35pm (40m) Data Innovations
Google Cloud Dataflow - two worlds become a much better one
Eric Schmidt (Google)
Big data processing is challenged by four conflicting desires: latency, accuracy, simplicity, and cost. Google Cloud Dataflow intelligently merges the desired unified and open sourced programming model, backed by a fully managed cloud service. Dataflow enables developers to answer questions with the right level of latency and accuracy, with low operational overhead regardless of size/complexity.
4:35pm-5:15pm (40m) Data Innovations
Data liberation and data integration with Kafka
Martin Kleppmann (University of Cambridge)
Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is Step 1 toward becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.
5:25pm-6:05pm (40m) Data Innovations
Real-time analytics with Solr
Yonik Seeley (Cloudera)
This talk will cover how search and Solr have become a critical part of the Hadoop stack, and have also emerged as one of the highest performing solutions for analytics over big data. We'll also cover new analytics capabilities in Solr that marry full-text search, faceted search, statistics, and grouping, joining into a powerful engine for powering next-generation big data analytics applications.
11:20am-12:00pm (40m) Spark & Beyond
What's coming for the Spark community
Patrick Wendell (Databricks)
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
1:15pm-1:55pm (40m) Spark & Beyond
Supercharging R with Spark for end-to-end data science
Hossein Falaki (Databricks Inc.)
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist.
2:05pm-2:45pm (40m) Spark & Beyond
Next-generation genomics analysis with Apache Spark
Timothy Danford (Tamr, Inc.)
A revolution in DNA sequencing technology has led to exponential growth in the genomics data available to discover new drugs, diagnose patients, and understand the fundamental biology of human disease. Existing bioinformatics tools will have difficulty scaling to meet the challenges posed by this growth. Learn about next-generation tools for bioinformatics and genomics using Spark and Parquet.
2:55pm-3:35pm (40m) Spark & Beyond
Lifelogging for insights
Håkan Jonsson (Sony Mobile Communications)
In this talk we will show how Sony Mobile uses large scale analytics on Spark to generate insights to Lifelog users about themselves and the population, and how we use analytics to build a user lifecycle model that allows us to take actions toward increased user engagement and retention.
4:35pm-5:15pm (40m) Spark & Beyond
Effective testing of Spark programs and jobs
Holden Karau (Google)
This session explores best practices of creating both unit and integration tests for Spark programs as well as acceptance tests for the data produced by our Spark jobs. We will explore the difficulties with testing streaming programs, options for setting up integration testing with Spark, and also examine best practices for acceptance tests.
5:25pm-6:05pm (40m) Spark & Beyond
Estimating financial risk with Apache Spark
Sandy Ryza (Clover Health)
How much can you expect to lose? The financial statistic Value at Risk seeks to answer this question, but is computationally intensive to estimate. At Cloudera, we’ve assisted several organizations in using Spark to compute VaR and other financial statistics. The talk, which walks through a basic VaR calculation, aims to give a feel for what it is like to approach financial modeling with Spark.
11:20am-12:00pm (40m) IoT & Real-time
When it absolutely, positively, has to be there: Reliability guarantees in Kafka
Gwen Shapira (Confluent), Jeff Holoman (Cloudera)
Kafka provides the low latency, high throughput, high availability, and scale that financial services firms require. But can it also provide complete reliability? In this session, we will go over everything that happens to a message - from producer to consumer, and pinpoint all the places where data can be lost - if you are not careful.
1:15pm-1:55pm (40m) IoT & Real-time
What does your smart device know about you?
Charles Givre (Deutsche Bank)
Many people are acquiring smart devices, and yet do not have an understanding of the data these devices gather about them and what can be done with this data if it is aggregated over time. The talk will demonstrate what data several popular devices—including the Nest Thermostat and a few others—gather and show what can be learned about an individual from this data.
2:05pm-2:45pm (40m) IoT & Real-time
Twitter Heron: Stream processing at scale
Karthik Ramasamy (Streamlio)
This talk will present the design and implementation of a new system, called Heron, that is now the de facto stream data processing engine inside Twitter. Share our experiences in running Heron in production.
2:55pm-3:35pm (40m) IoT & Real-time
Streaming in the extreme
Jim Scott (MapR Technologies)
With the move to real-time data analytics and machine learning, streaming applications are becoming more relied upon than ever before. Discover how to build and deploy a globally scalable streaming system. This includes producing messages in one data center and consuming them in another data center, as well as how to make the guarantees that nothing is ever lost.
4:35pm-5:15pm (40m) IoT & Real-time
IoT with Spark Streaming: Practical lessons from real-world use cases
Hari Shreedharan (Cloudera), Anand Iyer (Cloudera)
Over the past year, Spark Streaming has emerged as the leading platform to implement IoT and similar real-time use cases. This session includes a brief introduction to Spark Streaming’s micro-batch architecture for real-time stream processing, as well as a live demo of an example use case that includes processing and alerting on-time series data (such as sensor data).
5:25pm-6:05pm (40m) IoT & Real-time
An open source approach to gathering and analyzing device-sourced health data
Ian Eslick (VitalLabs)
Capturing and integrating device-based and other health data for research is frustratingly difficult. We explain the open source technology frame​work for capturing and routing device-based health data for use by healthcare providers and for access, via a trusted analytic container, to ​​researchers we developed, working with O’Reilly Media and support from the Robert Wood Johnson Foundation.​
11:20am-12:00pm (40m) Design, User Experience, & Visualization
Value in the details - understanding data through visual exploration
Richard Brath (Uncharted Software), Rob Harper (Uncharted)
Direct visual exploratory analysis of big data yields insights that are otherwise overlooked. By plotting all the data, patterns that can be obscured by traditional visualization methods are preserved. This presentation highlights the power of visualizing whole data sets through examining a market order book and identifying pricing strategies.
1:15pm-1:55pm (40m) Design, User Experience, & Visualization
Data inclusion for all
Alex Kelly (General Motors), Kim Le (General Motors)
This session will demonstrate how data enables people to overcome their disabilities and live to their fullest. We will also point out critical underlying flaws of data interpretation (due to human bias), and offer action items for us to make the data world more inclusive, efficient, and connected.
2:05pm-2:45pm (40m) Design, User Experience, & Visualization
Visualising Music Services
Alan Hannaway (7digital)
7digital power a variety of music services with a diverse range of territories, devices and access models. They have been helping services transform the listening experience through visualising their data. Paul will demonstrate visualisations on listening bounce rate and content classification, giving examples of how these creative solutions to conveying information have helped engage people...
2:55pm-3:35pm (40m) Design, User Experience, & Visualization
Knowledge and the geospatial mixing pot
Andrew Hill (Textile)
You no longer need to be a remote sensing specialist to leverage real-time geospatial data from space. You don't need to be an expert to harvest social media on the cheap. Geospatial data analysis is a mixing pot that brings together your private data and streams of data from all over. We will talk about how we are bringing this mixing pot together for the future of understanding data.
4:35pm-5:15pm (40m) Design, User Experience, & Visualization
Music science: How data is changing what we listen to
Sean Power (Watching Websites), Joy Johnson (AudioCommon), Mike Rosenthal (Mick Management), Rishi Malhotra (Saavn)
This panel brings together founders and technologists who live on the cutting edge of music science. We’ll look at the “Turing problems” of digital entertainment, as well as how providers strike a balance between human curation and machine optimization.
5:25pm-6:05pm (40m) Design, User Experience, & Visualization
Virtual reality: From immersive visualization to data-driven narrative
Hugh McGrory (datavized)
Data is all science, no art. Think of a film that inspired or moved you. Now imagine the filmmaker decided that instead of making the film, they would present the material to you in the form of a graph or a chart. That’s where we are with data.
11:20am-12:00pm (40m) Law, Ethics, & Open Data
Personal information out of context: Building a consumer subject review board
Evan Selinger (Rochester Institute of Technology), Jules Polonetsky (Future of Privacy Forum)
Ethical concerns about the use of personal information in new ways has led to calls for the creation of consumer subject review boards, which could evaluate, approve, or monitor out-of-context uses of information absent user consent. This conversation between a philosopher and lawyer will address how organizations can use existing ethical frameworks to create practical accountability mechanisms.
1:15pm-1:55pm (40m) Law, Ethics, & Open Data
Protecting the humanity in data I: Ethics of algorithms/ethics of data activism/targeting services without excluding the needy
Jake Porway (DataKind), Cathy O'Neil (Weapons of Math Destruction), Vladimir Dubovskiy (DonorsChoose.org), Kamalesh Rao (DataKind)
No matter how good the intentions, ethical questions are inherent in the work of using data for social good. How are organizations navigating ethical pitfalls in order to make an impact? The key is protecting the humanity behind the numbers. In this series of talks, we'll learn how organizations are dealing with ethical considerations inherent in projects that aim to use data for good.
2:05pm-2:45pm (40m) Law, Ethics, & Open Data
Protecting the humanity in data II: Personalized crisis counseling/messiness of interpretation
Jake Porway (DataKind), Bob Filbin (Crisis Text Line), danah boyd (Microsoft Research | Data & Society)
No matter how good the intentions, ethical questions are inherent in the work of using data for social good. How are organizations navigating ethical pitfalls in order to make an impact? The key is protecting the humanity behind the numbers. In this series of talks, we'll hear from four speakers on how they are dealing with ethical considerations inherent in projects that aim to use data for good.
2:55pm-3:35pm (40m) Law, Ethics, & Open Data
How we amplify privilege with supervised machine learning
Mike Lee Williams (Cloudera Fast Forward Labs)
Because of the way sentiment analysis algorithms are trained, they systematically amplify the voices of those who express themselves unsubtly and aggressively. I will extrapolate from this observation to show the ways in which supervised machine learning has the potential to amplify social and economic privilege.
4:35pm-5:15pm (40m) Law, Ethics, & Open Data
Fixing Chicago’s crime data
Jay Margalus (MapR), Mike Emerick (MapR)
Who will watch the watchmen? This session will cover data integrity problems in open government introduced by the human element. We’ll then explore possible methodologies that will allow us to derive value from open government data, while still keeping a skeptical eye on the validity of the data itself.
5:25pm-6:05pm (40m) Law, Ethics, & Open Data
Ethical big data - what's legal and what's right
Steven Totman (Cloudera), Sam Heywood (Cloudera), Nick Curcuru (Mastercard)
Technology offers amazing big data use cases, but according to Gartner it's important to avoid "crossing the creepy line." Governance and security experts from Cloudera and MasterCard discuss the legal and ethical usage of big data. Ethical behavior drives trust - they are inseparably linked. For customers to trust and continue to do business with us requires an ethical data usage framework.
11:20am-12:00pm (40m) Production Ready Hadoop
Hadoop in the cloud: An architectural how-to
Jairam Ranganathan (Cloudera)
Apache Hadoop was designed when cloud models were in their infancy. Despite this fact, Hadoop has proven remarkably adept at migrating its architecture to work well in the context of the cloud, as production workloads migrate to a cloud environment. This talk will cover several topics on adapting Hadoop to the cloud.
1:15pm-1:55pm (40m) Production Ready Hadoop
Multi-tenant, multi-cluster, and multi-container Apache HBase deployment
Jonathan Hsieh (Cloudera, Inc), Dima Spivak (StreamSets)
With the number of production Apache HBase clusters increasing, there is greater demand for running multiple applications on single clusters, for data reliability and availability, and for developers to better test their applications. We’ll lay out how these new demands can be addressed using multi-tenant, multi-cluster, or multi-container deployments, including the use of Docker.
2:05pm-2:45pm (40m) Production Ready Hadoop
The glue: Building the connectors and tools to manage big data warehouses
Siwei Zhu (Scribd), Kevin Perko (Scribd)
With the explosion of big data open source technologies, companies can now build a powerful data warehouse. But as they reach scale, they’ll find that patching together numerous projects requires building their own tools to manage the data pipeline. In this presentation we will talk about the tools you’ll likely need to build in-house to make your data infrastructure manageable.
2:55pm-3:35pm (40m) Production Ready Hadoop
Failing fast and falling often is no way to run a cluster!
Michael Segel (Segel & Associates.)
Today's Hadoop Cluster now has multiple single points of failures. This talk focuses on identifying these failings and how to mitigate them.
4:35pm-5:15pm (40m) Production Ready Hadoop
Building a production-ready data lake in the cloud
Prat Moghe (Cazena)
Hadoop’s ability to handle large amounts of varied data has been a driving force behind the explosion of big data. Many organizations’ ambitions to become more data-driven, however, are held back by a shortage of resources as well as the time and expense needed to purchase and set up hardware and software infrastructure. The cloud offers a natural alternative to overcome these barriers.
5:25pm-6:05pm (40m) Production Ready Hadoop
Real-world NoSQL schema design
Ted Dunning (MapR)
I will deconstruct a real-world database schema into the corresponding NoSQL design. Along the way, we will see how the number of tables drops by nearly 5x and the ease of understanding the design increases by a similar degree. In spite of radical changes, the resulting denormalized and nested data can still be queried with SQL by using Apache Drill. These methods are practical and easy to apply.
2:05pm-2:45pm (40m) Data Science & Advanced Analytics
Data and Ethics
DJ Patil (White House Office of Science and Technology Policy)
DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology Policy
2:55pm-3:35pm (40m) Data Science & Advanced Analytics
Data and Ethics II
DJ Patil (White House Office of Science and Technology Policy)
DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology Policy
4:35pm-5:15pm (40m) Data-driven Business
Data in a creative business: How to earn friends and influence people
David Boyle (MasterClass)
Drawing lessons from successes and failures in the music industry, book publishing and TV, David Boyle will share five lessons that are essential if you’re to use data to make a difference in creative businesses.
11:20am-12:00pm (40m) Sponsored
Putting Modern BI to Work: Innovative Use Cases
Ali Tore (ClearStory Data)
In this session, you will learn why organizations are embarking on a mission to understand the “now” of their businesses, what they are doing with their internal and external data to drive continuous insights, and how their businesses benefit from these insights.
1:15pm-1:55pm (40m) Sponsored
Expand your mind to fit the big data Data Center: the scale and cost of information management architectures
Robert Eve (Cisco), Robert Novak (Cisco), Nenshad Bardoliwalla (Paxata)
As big data becomes a pervasive force in the enterprise, many of our fundamental ideas around how to optimize compute, storage, network, and resource management are being stretched.
2:05pm-2:45pm (40m) Sponsored
Oozie or Easy: Managing Hadoop workflows the EASY way
Robby Dick (BMC Software)
This session describes how organizations are managing Hadoop and big data workflows with an enterprise workflow solution that provides a graphical user interface for managing all the complex components of the enterprise application fabric. They gain SLA management, forecasting and change impact analysis, auditing, reporting, and self-service via mobile devices.
2:55pm-3:35pm (40m) Sponsored
Design patterns for real-time data analytics
Sheetal Dolas (Hortonworks)
Businesses are moving from large-scale batch data analysis to large-scale real-time data analysis. Apache Storm has emerged as one of the most popular platforms for this purpose. This talk covers proven design patterns for real-time stream processing. They have been vetted in large-scale production deployments that process tens of billions of events/day and tens of terabytes of data/day.
4:35pm-5:15pm (40m) Sponsored
The 10 millisecond rule: Getting to 'Yes' with fast data and Hadoop
Bruce Reading (VoltDB)
You have 10 milliseconds. Less than the blink of an eye, the beat of a heart – that’s how much time you have to ingest fast streams of data, perform analytics on the streams, and take action. Ten milliseconds to win a customer, 10 milliseconds to make a sale, 10 milliseconds to save a life – it’s not much time.
5:25pm-6:05pm (40m) Sponsored
Combining open source software and cloud-native data processing services on Google Cloud Platform
Eric Brewer (Google)
In this talk, we will describe a Cloud-optimized deployment model for Spark and Hadoop, and explore how these tools and Cloud-native services complement each other to form the most productive and efficient data processing platform.
11:20am-12:00pm (40m) Sponsored
Where do we go from here? Lessons and landmarks from real-world Cisco UCS big data deployments
Robert Novak (Cisco)
Big data has moved beyond the bleeding-edge, early-adopter stage. If you're not using it now, you will be soon. But big data deployments are not a cookie-cutter, one-size-fits-all effort. Cisco Big Data Consulting Systems Engineer Robert Novak will present real-world deployment stories and use cases for big data on Cisco UCS, especially (but not exclusively) around Hadoop environments.
1:15pm-1:55pm (40m) Sponsored
Business impact from IoT? Just add data science
Sarah Aerni (Pivotal)
The promise of IoT is that it will forever change the way people and businesses interact with the world. Using illustrative use cases, Pivotal will demonstrate the fundamental concepts required to drive true impact from these connected devices. We will cover which models are most appropriate, what considerations around data access and processing are critical, and which tools available.
2:05pm-2:45pm (40m) Sponsored
End User Panel on Real-Time Data Analytics
Eric Frenkiel (MemSQL), Noah Zucker (Novus Partners), Ian Hansen (Digital Ocean), Michael DePrizio (Akamai Technologies)
In-memory is no longer just a trend: it’s an imperative, for high volume, real-time data workloads. With the relational, distributed MemSQL database, modern enterprises are unlocking value from gigabytes and terabytes of data. Learn about some of latest applications and deployments of in-memory technology from Akamai Technologies, Novus, and Digital Ocean.
2:55pm-3:35pm (40m) Sponsored
Meeting the needs of the business: Real-time and historical big data for comprehensive security analytics and operational insights
Alex Loffler (TELUS)
Security teams study many months and years of data for baselining and incident forensics, but IT operations may only want to store weeks or months of data to analyze for operational insights. And the two different needs can be difficult to reconcile. Learn how TELUS's security analysts provide value to both teams.
4:35pm-5:15pm (40m) Sponsored
How Pepsi wrangles the diverse data of consumer packaged goods
Matthew Derda (Pepsi), Douglas Stradley (Trifacta)
Pepsi analyst Matthew Derda and Trifacta Director Customer Success Doug Stradley discuss why data wrangling is critical to empowering analysts to efficiently access, and incorporate, diverse big data sources for organizational analysis. Get first-hand examples where traditional ETL and scripting approaches fall short, and why “self-service” approaches are critical to big data initiatives.
5:25pm-6:05pm (40m) Sponsored
Requirements for secure, multi-tenant Hadoop: It’s much more than YARN
Anant Chintamaneni (BlueData)
Hadoop multi-tenancy is becoming a must-have – in order to accommodate multiple lines of business, multiple concurrent Hadoop jobs, multiple versions of Hadoop, multiple applications, security isolation, and more. This session will discuss these requirements and share recommendations on how to deploy a secure multi-tenant Hadoop environment with simplicity, agility, and low management overhead.
11:20am-12:00pm (40m) Sponsored
Big data analytics in the cloud
Matt Winkler (Microsoft)
At Microsoft, we process exabytes of data to run our own businesses. Learn how you can process big data in the cloud at massive scale with no hardware to deploy, software to tune/configure, and infrastructure to manage. We’ll also talk about overcoming common obstacles in big data adoption such as a high learning curve, cost of implementation, tuning infrastructure, and providing security.
1:15pm-1:55pm (40m) Sponsored
Real data, real implementations: What actual customers are doing
Andrew Brust (Datameer), Jeff Jarrell (American Airlines), Ryan Wright (Kelley Blue Book), Kendell Timmers
Beyond the euphoria of what big data can do, and the stress that comes from feeling that you’re not doing enough, how can you really get started? What are some concrete things you can do and some reasonable results you can expect? This panel, featuring real customers who are technology implementation leaders, will help you answer these questions.
2:05pm-2:45pm (40m) Sponsored
Delivering trusted data for analyst autonomy and operational agility with a unified big data fabric
Vishal Bamba (Transamerica), Murthy Mathiprakasam (Informatica)
In this session, learn how leading customers have built a unified big data fabric on top of Hadoop, using technologies like Informatica to repeatably deliver trusted data assets to a large community of data consumers, for a multi-dimensional view of customers.
2:55pm-3:35pm (40m) Sponsored
Machine learning in big data – look forward or be left behind
Bill Porto (RedPoint Global)
This session covers why continual, adaptive optimization is a key to success with real world machine learning models. Bill will detail the applicability of machine learning tools with the pros/cons of each. Learn how to optimize processes to drive more predictable outcomes from business decisions. Tools for automating access to changing data and removal of noise and error will also be reviewed.
4:35pm-5:15pm (40m) Sponsored
Enter the snake pit for fast and easy Spark and Cassandra
Jon Haddad (The Last Pickle)
Everyone knows that Python isn’t suitable for massive scale analytics, right? Wrong. Spark 1.3 introduced data frames, which allow for high performance Spark batch jobs, streaming, and machine learning over massive datasets. In this talk you’ll learn how to combine Cassandra, a highly scalable, always-on OLTP data store, with PySpark, a framework for distributed computation.
5:25pm-6:05pm (40m) Sponsored
Think like a data scientist: Build your big data blueprint
Oreilly_BSchmarzo Bill (EMC Consulting)
Bill Schmarzo, EMC CTO of Global Services, and author of “Big Data: Understanding How Data Powers Big Business," will utilize a workshop approach to help you identify where and how to integrate data and analytics into your business strategies.
11:20am-12:00pm (40m) Sponsored
Roll your own big data analytics in the cloud without reinventing the wheel
Vin Sharma (Intel)
To accelerate enterprise deployment of big data analytics, Intel and partners introduced an open source trusted analytic platform-as-a-service for data scientists and app developers to build and deploy advanced analytics applications at cloud scale. Join us and discover how you can customize and develop your own big data solutions with this platform.
1:15pm-1:55pm (40m) Sponsored
The forces that will disrupt big data
Anthony Dina (Dell)
The only guarantee in life is change. That’s exactly what makes the world interesting and innovative, and that’s exactly what the large internet properties are counting on: to disrupt traditional businesses with an always-on, data-centric business model.
2:05pm-2:45pm (40m) Sponsored
How Riot Games uses Platfora to improve League of Legends' performance
Peter Schlampp (Platfora), Chris Kudelka (Riot Games)
League of Legends has more than 67 million players per month. The company needed an analytics solution that would work well with their push-model data pipeline. In this session, data engineer Chris Kudelka will discuss how their game designers use Riot's data pipeline and Platfora to measure and validate player-focused changes like improvements to game servers and client performance.
2:55pm-3:35pm (40m) Sponsored
Hydrate a data lake in days with CDAP
Jonathan Gray (Cask)
Data lakes represent a new data architecture that provides enterprises with the scale and flexibility required for big data: unbounded storage for unbounded questions. While Hadoop is the de facto standard for implementing data lakes today, significant time and effort are still required. This talk introduces Cask Hydrator, a new open source data lake framework and drag-and-drop UI built on CDAP.
4:35pm-5:15pm (40m) Sponsored
Catalog, secure, and govern your Hadoop data lake
Alex Gorelik (Waterline Data), Jim Kaskade (Janrain), David Tabacco (Merck & Co., Inc.), David Paige (Cox Automotive)
This talk is about the best practices approach to accelerate data discovery while complying with security and data governance needs. Learn how to implement an automated and governed inventory of your data assets. Open up your data lake with secure self-service to find and understand data quickly.
5:25pm-6:05pm (40m) Sponsored
Fast fish eat slow fish: How to move faster
Samuel Cozannet (Canonical)
Whether you’re a large enterprise or a startup, successfully competing with modern, nimble, fast-moving companies like Uber or Airbnb can only be done with modern, model-driven development environments and big data solutions. Infrastructure shouldn’t restrict the interactions between relational data and big data. Development shouldn’t slow analytics.
9:00am-5:00pm (8h) Training
Spark Development Bootcamp (Day 2)
Laurent Weichberger (OmPoint Innovations, LLC)
This three-day curriculum features advanced lectures and hands-on technical exercises for Spark usage in data exploration, analysis, and building big data applications.
9:00am-5:00pm (8h) Training
Practical data science on Hadoop (Day 2)
Brandon MacKenzie (IBM), John Rollins (IBM), Jacques Roy (IBM), Chris Fregly (PipelineAI), Mokhtar Kandil (IBM)
In this three-day course, you will: * Learn how to use machine learning, text analysis, and real-time analytics to solve frequently encountered, high-value business problems, * Understand data science methodology and end-to-end work flow of problem solution including data preparation, model building and validation, and model deployment, * Use Apache Spark and other tools for analytics.
9:00am-5:00pm (8h) Training
Designing and building big data applications (Day 2)
Nathan Neff (Cloudera)
Cloudera University’s three-day course for designing and building big data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the enterprise data hub (EDH).
7:30am-8:45am (1h 15m)
Break: (Coffee Break - 7:00am - 8:45am)
8:45am-8:50am (5m)
Wednesday keynote welcome
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program Chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
8:50am-9:05am (15m)
The next generation
Mike Olson (Cloudera)
Mike Olson, CSO and Chairman, Cloudera
9:05am-9:15am (10m)
Playing with, and for, data
AnnMarie Thomas (School of Engineering and Schulze School of Entrepreneurship, University of St. Thomas)
Unusual collaborations can often lead to new ways of taking, and analyzing data. This talk looks at lessons learned from working with chefs, circus performers, and preschoolers.
9:15am-9:25am (10m) Sponsored
What 0-50 million users in 7 days can teach us about big data
Joseph Sirosh (Microsoft)
Join Microsoft’s Joseph Sirosh for a behind-the-scenes sneak peek into the creation of the viral phenomenon How-Old.net. He'll cover how it got to 50 million users in 7 days, the unexpected big data challenges that came with it, and the surprising learnings they had about people and systems.
9:25am-9:30am (5m) Sponsored
Improving Medical Decision Making with Predictive Analytics on Big Data
Ron Kasabian (Intel), Michael Draugelis (Penn Medicine)
Even in this era of intense medical breakthroughs, many illnesses still evade accurate and timely diagnosis. Clinicians' must often rely on static diagnostic guidelines, that result in late care and too many false alarms. Half of all heart failure patients can go undiagnosed.
9:30am-9:35am (5m) Sponsored
The race to modernize BI: What it is and why so urgent?
Tim Howes (ClearStory Data)
This keynote unveils why rapid modernization of BI is taking place, the business use cases driving it, and what’s essential in next-generation solutions.
9:35am-9:40am (5m) Sponsored
Unleashing the power of big data today
Jim McHugh (Cisco)
IoE, IoT, and big data – three topics you hear and read about often in our various industries. Let’s quickly look at these market and technology dynamics, and see how they are each in their own way ’democratizing’ data access and analysis, resulting in new businesses, technologies, and improved community solutions throughout the world.
9:40am-9:50am (10m)
A Transition to Interactive Music Consumption + Data
Joy Johnson (AudioCommon)
Joy Johnson, VP, Mobile, AudioCommon
9:50am-10:00am (10m)
Data vs creativity: The last battleground?
David Boyle (MasterClass)
Are creative businesses the last battleground for data-driven decision making? Drawing lessons from successes and failures in the music industry, book publishing, and TV, David Boyle will argue for a negotiated settlement in the war between data and creative, and show how long-term and mutually beneficial peace can work.
10:00am-10:10am (10m)
On reflection: What the White House needs from you
DJ Patil (White House Office of Science and Technology Policy)
DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology Policy
10:10am-10:25am (15m)
Improving decisions
Katherine Milkman (Wharton School at the University of Pennsylvania)
Katherine will discuss recent behavioral science research suggesting how a number of simple, inexpensive tools can be used to encourage improved decisions.
10:25am-10:30am (5m)
O'Reilly Announcements
Ben Lorica (O'Reilly Media)
Ben Lorica, Program Director, O'Reilly Media.
10:30am-10:45am (15m)
Context Computing
Jeff Jonas (IBM)
Jeff Jonas, IBM Fellow; Chief Scientist, Context Computing
10:50am-11:20am (30m)
Break: Morning Break sponsored by ClearStory Data
3:35pm-4:35pm (1h)
Break: Afternoon Break sponsored by Bloomberg
6:05pm-7:05pm (1h) Events
Booth Crawl
Quench your thirst with vendor-hosted libations and snacks while you check out all the exhibitors in the Expo Hall.
12:00pm-1:15pm (1h 15m) Events
Lunch / Wednesday BoF Tables
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics.
6:30am-7:30am (1h) Events
Data Dash
Please join Cloudera and O'Reilly Media for the Data Dash run / walk, held in conjunction with Strata + Hadoop World in New York 2015.
8:00pm-10:30pm (2h 30m) Events
Data After Dark: High Line Hop
LOCATIONS: Tao Downtown Nightclub: 389 W. 16th St. • Avenue: 116 10th Ave. • Gaslight: 400 W. 14th St. • Catch NYC, 3rd Floor: 21 9th Avenue • The Penthouse, Red Room, & The Garden at The Park NYC: 118 10th Avenue
7:05pm-8:00pm (55m)
Break: Dinner