Skip to main content

Speaker Slides and Video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Amir Halfon (ScalingData)
Slides:   1-PDF 
The flexibility of Apache Hadoop is one of its biggest assets, letting organizations generate value from data that was previously considered too expensive to be stored and processed in traditional databases. But organizations still struggle to get the greatest business value out of their Hadoop deployments. One key concern is how to avoid ...
Ravi Hubbly (Leidos)
Slides:   1-PPTX 
Enterprises continue to rely on legacy mainframe-based systems even though utilizing these legacy systems is prone to risks. This is mainly because prior efforts at modernization of these legacy systems have been difficult. In this topic we will discuss usage scenarios where utilizing Hadoop has assisted in modernizing legacy systems and position businesses for big data benefits.
Robert Grossman (Open Data Group)
Slides:   1-PDF 
Many analytic and data science problems are about trying to understand the data well enough to build a model with good predictive power, but there are also some analytic problems that are best understood as involving an adversary. In this talk, we give an introduction to adversarial analytics, giving examples from fraud, real time bidding systems, high momentum trading, and cybersecurity.
Bahman Bahmani (Rakuten)
Slides:   1-PPTX 
We will show how scalable algorithm design can enable big data applications that would otherwise be simply infeasible even using the most modern big data architectures. Then, we provide effective techniques for designing such algorithms and explain the tradeoffs governing them. We will crystallize these techniques using concrete examples from machine learning to social network and text analytics.
Patricia Gorla (The Last Pickle)
Slides:   1-PDF 
Before you analyze your big data, you need a way to store and access it. Here we examine the benefits of using a highly-available, eventually consistent storage system, and what impact this has on real-time analytics. This session will prepare you to set up a multi-node working Cassandra and Hadoop cluster.
Tathagata Das (Databricks), Haoyuan Li (Alluxio), Ion Stoica (UC Berkeley), Reynold Xin (Databricks), Sameer Agarwal (UC Berkeley)
Slides:   1-PDF    2-PPTX    3-PPTX    4-PPTX    5-PPT    6-PDF 
An introduction to the open-source Berkeley Data Analytics Stack (BDAS). Spark is a high-speed cluster computing engine that supports rich analytics (e.g. machine learning) and lower-latency processing (e.g. streaming). Tachyon provides in-memory storage, letting Spark and Hadoop jobs share data efficiently. Shark and GraphX provide high-speed Hive SQL queries and graph processing on top of Spark.
Scott Sorensen (
Slides:   1-PDF 
New, affordable DNA sequencing will generate massive new flows of data. currently manages 4 petabytes of searchable data and is on track to increase this figure exponentially with its new DNA product. CTO, Scott Sorensen, explains how the company manages tremendous amounts of new data through two categories of Hadoop use cases: 1) analytics and 2) product features.
Nirmal Ranganathan (Rackspace), David Dobbins (Rackspace Hosting)
Slides:   1-PPTX 
We'll discuss some of the use cases for when a virtual Hadoop cluster makes sense and share some of our experiences and some of the decisions that drove the product design of Rackspace Cloud Big Data; an upcoming HDP as a service offering from Rackspace Hosting.
nick dimiduk (Hortonworks, Inc)
Slides:   external link,   2-PDF 
Your application is out-growing its database, you've started shopping NoSQL options. Maybe you've adopted Hadoop into your Data Warehouse. You've heard HBase might be an appropriate technology, but you need to know more. This talk is for you. To understand its use, first understand how it works. This talk explores the design of HBase and its critical paths to ground an understanding of its use.
Arun Murthy (Cloudera ), Alan Gates (Hortonworks), Owen O'Malley (Cloudera)
Slides:   1-PPTX 
Apache Hive is the de facto standard for SQL-in-Hadoop today with more enterprises relying on this open source project than any alternative. New enterprise requirements for Hive to become more real time or interactive have evolved… and the Hive community has responded. Please join Arun Murthy, Owen O'Malley and Alan Gates to learn more about Stinger and improvements to Apache Hive.
Ari Gesher (Kairos Aerospace), Danielle Kramer (Palantir Technologies)
Slides:   1-PDF 
AtlasDB is a bolt-on layer for a key-value stores (distributed or otherwise) that implements MVCC and guarantees ACID properties for eventually-consistent data stores. In this talk, we'll take a look at the protocol used to implement the transactions, talk about the performance tradeoffs from using transactions, and look at the transactions API it offers.
Douglas Merrill (ZestFinance)
Most people think success in big data analysis is about the right mix of vast amounts of data, mathematics and Ph.D.’s (oh my!). Those people are wrong. You need artistry too. This talk will provide some examples of "pure" ML failures and give suggestions on how to build an appropriately artistic team.
Eddie Satterly (Splunk)
Slides:   1-PDF 
In this session you will hear from big data experts with real world experience on the architectural patterns and tools integrations used to solve real business problems with data.
Ken Rudin (Facebook)
In this talk, Ken will discuss several best practices focused on getting the biggest impact from big data and driving a proactive, data-driven culture.
John Akred (Silicon Valley Data Science), Richard Williamson (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
Slides:   external link
What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and realtime analytical workloads.
Quentin Clark (Microsoft)
The idea that big data will transform businesses and the world is indisputable, but are there enough resources to fully embrace this opportunity? Join Quentin Clark, Microsoft Corporate Vice President, who will share Microsoft’s bold goal to consumerize big data - simplifying the data science process and providing easy access to data with everyday tools.
Jim Kaskade (Infochimps)
Data and analytics is a means to an end. Jim highlights a new revolution of analytic applications with some touching examples in the healthcare industry with cancer research and medication therapy management.
Peta Clarke (Google), Donna Knutt (Black Girls Code)
Details to come..
Justin Makeig (MarkLogic)
Slides:   1-PDF 
Securely and cost-effectively managing petabytes of data from siloed systems is both a threat and opportunity for banking, healthcare, and other organizations in highly regulated industries. Drawn from production projects, this session will examine best practices around the use of Hadoop as part of a regulated data environment, including retention, provenance, privacy, and security.
Vaclav Petricek (eHarmony)
Slides:   1-PDF 
Humans have a mixed record in choosing romantic partners. Are looks or brains more important for a happy marriage? This session will show you how big data and large scale machine learning can help us model such a complex behavior and tell us which traits in a partner actually matter. Who knows - maybe hadoop will help you find Love ;-)
Matt Schumpert (Datameer)
Slides:   1-PDF 
With data scientists in short supply, it's surprising that much of their precious time is spent doing "data plumbing"—preparing data or servicing business users rather than doing actual data science. In this session, we'll look at the gradual evolution of tools that's moving us towards self-service data science.
Brandon Ballinger (Cardiogram)
Slides:   1-PDF 
Deep learning has upset the best results in speech recognition, computer vision, and other fields. How do deep neural nets work? What makes them different than the classical neural nets of the 70's? How is deep learning getting us closer to the original dream of AI -- machines that can think?
Samuel Kommu (Cisco Systems)
Slides:   1-PDF 
Is it possible to use a BigData cluster for other applications? Should the cluster be virtualized or on bare metal? Local storage or Shared? Which Hadoop version? Cisco will examine and discuss some of these concepts, to help plan and optimize a Big Data cluster running multiple applications without impacting performance.
Matt Asay (AWS)
Slides:   external link,   2-PDF 
For some, Hadoop is synonymous with "Big Data." But Hadoop is just one component of a successful Big Data architecture. NoSQL solutions like MongoDB also play a dominant role for storage and real-time data processing, and RDBMS has a place, too. This session will drill down on the different types of NoSQL databases and how they fill out Hadoop and RDBMS in a modern Big Data architecture.
Josh Klahr (Pivotal)
Data is coming at us from everywhere – in small quantities, large magnitudes, and in almost every format. As Pivotal’s Vice President of Data Platform Product Management, Josh Klahr has the know-how to provide insights on how to build an organization that strategically manages this data in today’s modern and complex enterprise environments.
Ravi Devireddy (Visa Inc), Annika Jimenez (Pivotal)
In this talk Annika Jimenez will paint a picture of requirements for data science success, and Ravi Devireddy will discuss the challenges in cyber security, opportunities with hadoop & big-data, and present some use cases and applications. Both will share lessons learned on the bleeding edge of data science.
Vijay Agneeswaran (Walmart Labs), Pranay Tonpay (Impetus)
Slides:   1-PPTX 
We will talk about how we have implemented machine learning algorithms over Spark streaming to allow real time processing, namely the “Naïve Bayes” and “Logistic Regression” for Classification and “k-means” for Clustering. We have also implemented PMML support for the ML algorithms in Spark, to provide a very flexible means to import models and evaluate its performance.
David Parker (SAP)
Big Data is impacting society in ways never possible before – enabling us all to gain insights that can transform the way we do business, work with others, and live our lives. SAP recognizes that this transformation needs grassroots support...
Srini Srinivasan (Aerospike Inc.)
Slides:   1-PPTX 
Internet environments for consumer-facing applications routinely demand high throughput while SLAs require100% uptime. This session reviews 10 practices for ensuring high performance and availability based on the real-world lessons of large-scale ad sector deployments where speed means 5 milliseconds, scale is 200,000 to 2 million TPS against terabytes of data, and downtime is not an option.
Amy Gaskins (Panopticon)
Slides:   1-PPTX 
The Army's Every Soldier is a Sensor (ES2) concept is entrenched in the belief that all soldiers, no matter their rank or specialty, can provide useful information on the battlefield. While deployed to Kandahar, Afghanistan, the 43d Sustainment Brigade put ES2 to the test: training soldiers to obtain critical information about corruption and using it to figure out where our money actually goes.
Amie Elcan (CenturyLink)
Slides:   1-PDF 
As use of the Internet evolves, the data collected about Internet traffic must evolve in parallel to ensure the performance of applications and to keep access affordable. The ability to characterize how the Internet is being used is essential to the telecom industry. Case studies using R and Python Pandas will be presented to demonstrate the power of analytics to answer strategic questions.
Tony Salvador (Intel Corporation )
This talk will cover five major mobile trajectories for the next 10 years creating a brand new world : Seven billion futures, Hyper Digitization, Hyper Individualism, Hyper Collectivity & Hyper Differentiation.
Daniel Abadi (Yale University), Matthew Grace (Objective Logistics)
Slides:   1-PPTX 
Although there are several SQL-on-Hadoop tools (a concept that Hadapt pioneered in 2009), these tools still rely on ETL (or MapReduce jobs) to structure raw data into a SQL-queryable format. Hear how Hadapt continues to lead the innovation curve with the Data-Driven Schema and Multi-Structured Tables, dramatically improving time-to-insight and depth of analytic possibility.
Ben Werther (Platfora)
During the session attendees will learn how Big Data Analytics is the difference between fact-based enterprises and those focused on the shallow BI beauty contest.
Henry Robinson (Cloudera)
Slides:   1-PDF 
The increasing diversity of frameworks and workloads that run atop a Hadoop cluster gives more flexibility and power to users, but make it very difficult for an administrator to ensure that SLAs are met while allowing exploratory, ad-hoc usage to continue to use all spare capacity. We present our vision and implementation for generalised resource management on Hadoop, suitable for all uses.
Milan Vaclavik (CenturyLink Technology Solutions)
Slides:   1-PPTX 
Depending on who you talk to, Hadoop is either a massive disruption in IT, or a logical progression of existing technology trends. In this session, Savvis executives will provide a straightforward view of how Hadoop and related big data market dynamics fit into the broader IT market landscape. They will discuss why Hadoop alone is not a panacea for achieving information insight success...
Carlos Guestrin (Apple | University of Washington ), Joseph Gonzalez (UC Berkeley)
Slides:   1-PDF 
GraphLab is like Hadoop for graphs. Users express graph processing algorithms using a simple API and the GraphLab runtime efficiently executes that computation on multicore and distributed architectures. By leveraging advances in graph representation, asynchronous communication, and scheduling, GraphLab is able to achieve orders-of-magnitude performance gains over existing systems like Hadoop.
Mark Slusar (Allstate)
Slides:   1-PPTX 
After a successful round of Hadoop Data Science projects, a company will make a sizable Hadoop commitment. People, process, and technology stand at the tipping point for an exciting adventure in innovation and evolution that creates new possibilities. This presentation educates attendees on the changes from the traditional methods to the new methods and paints a vision of the future.
Adam Kawa (GetInData)
Slides:   1-PDF 
A trip into Hadoop jungle to show the most interesting, exciting and surprising places where we have been to while growing fast from a 60 to 690-node Hadoop cluster. We will expose our JIRA tickets, real graphs, statistics, even excerpts from our dialogues. We will share the mistakes that we made and describe the fixes that finally domesticated this love-demanding yellow elephant and its friends.
Stephen Brobst (Teradata Corporation), Ari Zilka (Hortonworks)
Slides:   1-PDF 
Hortonworks Chief Product Officer, Ari Zilka, and Teradata CTO, Stephen Brobst, show you when to use Hadoop and when to use an MPP relational data warehouse. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. Two of the most trusted experts in their fields examine how big data technologies are being used in practical deployments.
Dan McClary (Oracle)
Slides:   1-PDF 
Organizations are experimenting with Hadoop, but spending too much time in configuration and maintenance. In this session, we'll consider the benefits of an appliance model and the future functionality of pre-integrated Hadoop clusters. Learn about the requirements for an enterprise Hadoop cluster and how a pre-integrated appliance can most efficiently deliver enterprise Hadoop needs.
Tanel Poder (gluent.)
Slides:   1-PDF 
If you are a developer or DBA with Oracle background and want to learn how Hadoop works, this session is for you. We will go through the Hadoop HDFS and MapReduce data processing flow and compare it to the already familiar Oracle database parallel processing - which should make understanding the internals of this new technology a breeze.
Mike Olson (Cloudera)
As Hadoop and the surrounding projects & vendors mature, their impact on the data management sector is growing. Mike will talk about his views on how that impact will change over the next five years. How central will Hadoop be to the data center of 2020? What industries will benefit most? Which technologies are at risk of displacement or encroachment?
Paul Kent (SAS)
Slides:   1-PPTX 
How does IT balance the tension between “one glorious cluster that serves them all” and “one cluster, one purpose – dedicated for the particular task and not to be interfered with by anything”. Kerberos, C-groups and YARN to the rescue! This talk describes the current practices and speculates how things get better under YARN.
Jing Zhao (Hortonworks, Inc.), Tsz-Wo Sze (Hortonworks Inc.)
Slides:   1-PDF 
In this talk, attendees will understand the high level design of HDFS snapshots, along with how snapshots can be used for data protection and disaster recovery. We will also talk about details of snapshot development and testing. In the end, we will explore how to build and improve other features on top of HDFS snapshots, including Distcp, HBase snapshots, and Hive table snapshots.
Slides:   1-PPTX 
Using social security data, statistician Hilary Parker looks into the popularity of certain names, and finds that the names “Hilary” and “Hillary” took a particularly steep tumble in 1992, the year that Hillary Clinton came onto the national radar. Compared to other names that dropped in popularity, the trajectories of “Hilary” and “Hillary” are quite distinct.
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
Slides:   external link
A modern CIO rationalizing a company’s data architecture must consider a mix of deployment options like a utility executive has to invest in a good generation mix. We articulate a framework for applying the deployment levers available to architects as they plot a course forward in this era of big data technologies, born of our deep experience implementing the world's largest data platforms.
Erin Shellman (Nordstrom), David Von Lehman (Nordstrom)
Slides:   external link
Nordstrom started modestly in 1901 as a small shoe store in Seattle, and has since expanded to 117 full-line department stores and 138 Rack stores across the country. The art of retailing has changed dramatically over the last century and retailers today are concerned with understanding customer behavior and preferences both in the physical world and online.
Slides:   1-PDF 
Readers and preparers of graphs: Learn to recognize and avoid some common graphical mistakes to understand your data better and make better decisions from data. Examples and mistakes will be different from those used in a similar presentation at the 2011 conference.
Ulrich Rueckert (Datameer)
Slides:   1-PDF 
Even if one has big data, sometimes there is a lack of key data. This is a problem for predictive analytics: if there is only a limited amount of training material (e.g. user ratings, categorized documents), then it is hard to generate accurate models. The talk introduces new semi-supervised learning methods to overcome this problem by utilizing the vast amount of unlabeled data.
Zack Exley (Brand New Congress), Sahar Massachi (Independent)
Slides:   1-PPTX    2-PDF 
There's something about AB testing that invites statistical malpractice, and that makes communication between academics and practitioners very difficult. Wikipedia's revenue is depends on doing testing right. We'd like to present simple methods that we believe accurately predict future performance from AB test results, while minimizing sample size, along with proofs from four years of test data.
Jorge Lopez (Amazon Web Services), Matthew Brandwein (Cloudera)
Slides:   1-PDF    external link
Mainframe is Big Data too! Leveraging it in Hadoop creates a remarkable competitive advantage, but exploiting it without the right tools is nearly impossible, requiring you to wrestle with thousands of lines of Java, Pig, Hive, COBOL and more. This session presents a smarter way to ingest and process mainframe data in Hadoop, and how to bridge the technical, skill and cost gaps between the two.
Mona Vernon (Thomson Reuters Labs)
Slides:   1-PDF 
This session will include some evidence for hypercompetition and the implications of data-driven decision making in this context. It will also include a set of strategic frameworks and innovation processes better suited to survive in a hypercompetitive environment.
Matthew Grace (Objective Logistics)
Slides:   1-PPTX 
Join Matt Grace, Co-Founder and CTO of Objective Logistics, as he highlights the challenges surrounding interactive analytics and the power of Hadapt’s Multi-Structured Tables and Flexible Schema. Leveraging these tools has enabled Objective Logistics to improve time-to-insight and depth of analytic possibilities for restaurateurs and retailers throughout the US.
James Stewart (, James Abley (Government Digital Service)
Slides:   1-PDF 
The UK Government team behind the GOV.UK website talk about their work on the Performance Platform, a suite of services and a cultural shift taking people away from immensely detailed value stream maps about a call-centre and paper process (which might be an inherently 5-day long journey), to something that's digital, lightweight, fast and pleasant to use.
Micheline Casey (Federal Reserve Board)
Slides:   1-PPT 
Traditionally, security has tended to mean "lock down and protect". But there is a balance between securing data while still supporting information sharing and reuse. This presentation is meant to educate data management professionals at all levels how to manage this balance.
David Parker (SAP)
Slides:   1-PPTX 
Learn how solutions from SAP and our Hadoop partners can help your organization gain unprecedented insight from Big Data.
Stephanie McReynolds (ClearStory Data), Vaibhav Nivargi (ClearStory Data), Brian Zotter (ClearStory Data), Stephen McDaniel (Freakalytics)
Slides:   1-PPTX 
See a whole new way to speed the data processing cycle, converge and analyze diverse data, and interact with insights. Because the old approach limits how much data you can access and slows down decision-making. Join us to see a whole new data architecture and data application that converges more data, faster, from diverse sources, and allows a new level of interactive insights.
Foster Provost ( NYU | Stern )
Predictive analytics is one of the most mature areas of data science and an area where "big data" often is associated with competitive advantage. However, concrete results supporting the advantage conferred by big data are few and far between.
Slides:   1-PDF 
Big data is transforming the cloud as it moves from web giants into the enterprise. To run today’s multiple workload types, infrastructure must be architected as a common software-defined platform that supports the key workload components for todays and tomorrow’s big data systems. We must plan now to accommodate explosive growth and the need for robust storage, networking and security.
Adam Fuchs (Sqrrl)
Slides:   1-PDF 
The National Security Agency works with some of the world’s largest, most complex, and most sensitive datasets. In order to analyze this data, NSA has developed some powerful tools, such as Apache Accumulo. Come learn about NSA’s key lessons learned about building a Big Data platform from the former Technical Director of the Accumulo project at the NSA.
Ted Dunning (MapR, now part of HPE)
Machine learning constructs such as Recommendation engines take a simplistic approach to data modeling: a single kind of user interaction with a single kind of item is used to suggest the same kind of interaction with the same kind of item. We will cover why this approach is flawed and present an easily implemented recommendation architecture and implementation style that addresses these flaws.
Baron Schwartz (VividCortex)
Slides:   external link
What if data doesn't need to be big? Many use cases can be served well by a Small Data mindset, trading off accuracy in return for decreased cost. Examples include Bloom Filters, moving averages, and downsampling. This talk presents ideas and options you might not have considered for reducing big problems to comparatively small and cheap ones.
Baron Schwartz (VividCortex)
Slides:   external link
What if data doesn't need to be big? Many use cases are served as well, or nearly as well, by a Small Data mindset, storage, processing, and algorithms. This talk presents ideas and options you might not have considered for reducing big problems to comparatively small and cheap ones.
Feng Peng (LinkTime Cloud)
Slides:   1-PDF 
At Twitter our Hadoop-centric data analytics pipeline has been rapidly growing in terms of both size and complexity. With thousands of evolving data sources and analytics programs, orchestrating the analytics production becomes extremely difficult without a systematic solution. We will describe our production challenges and illustrate how the service we built help us address them.
Jun Fang (Facebook)
Slides:   1-ZIP 
Morse is a new system developed in Facebook, to transform its ETL pipeline from daily batch to realtime. It continuously moves, transforms and loads data from distributed log and sharded mysql db, into Hive data warehouse. HBase is used as underlying storage for incrementally updated table, while the data is exposed as external table into Hive for read processing.
Dave Stokes (MySQL Community Team)
Slides:   1-PDF 
MySQL 5.6 includes a NoSQL interface, using an integrated memcached daemon that can automatically store data and retrieve it from InnoDB tables, turning the MySQL server into a fast “key-value store” for single-row insert, update, or delete operations. This session explores using this interface and other 'simple' options for those with MySQL Databases instances seeking to explore big data access.
Fangjin Yang (Imply), Nelson Ray (Metamarkets)
Slides:   1-ZIP 
Many exact queries require computation and storage that scale linearly or superlinearly in the data. However, many classes of problems exist for which exact query results are not necessary. We describe the roles of various approximation algorithms that allow Druid, a distributed datastore, to increase query speeds and minimize data volume while maintaining rigorous error bounds on the results.
Julien Le Dem (WeWork), Nong Li (Cloudera)
Slides:   1-PDF 
Parquet is a columnar file format for Hadoop that brings performance and storage benefits. It supports deeply nested data structures and is easy to extend and integrate with existing type systems.
Moderated by:
Roger Magoulas (O'Reilly Media)
David Winters (Teradata), Chris Selland (HP Vertica), Milan Vaclavik (CenturyLink Technology Solutions)
Slides:   1-PPTX 
In this panel session, Teradata's David Winters, Vertica's Chris Selland, and Savvis' Milan Vaclavik, join O'Reilly Media's Roger Magoulas to look at the platforms that make big data an enterprise reality.
Greg Rahn (Cloudera)
Slides:   external link
Impala brings SQL to Hadoop, but it also brings SQL performance tuning to those using the platform. This technical session will cover several topics in Impala performance analysis to aid in answering the question “why is my query slow?” as well as practical tips and techniques to get the best performance from Impala.
Slides:   1-PDF 
There is increasing demand to discover and explore data iteratively, interactively, and for real-time insights, which we lump together under the term Real-Time Analytical Processing (RTAP). This talk presents our efforts and experience on building the real-time analytical processing framework for several large websites, leveraging Spark and Shark research from UC Berkeley.
Jonathan Natkins (WibiData), Juliet Hougland (Cloudera)
Slides:   1-PPTX 
Consumer expectations have dramatically increased and retailers must present relevant content to maintain a competitive advantage. This presentation will demo an e-commerce application with real-time, personalized recommendations and discuss combining open-source system architecture, based on HBase and Kiji, with good predictive model design to build a scalable, real-time recommendation system.
Chris Lintz (Comcast), Gabriel Commeau (Comcast)
Slides:   1-PPTX 
Real-time analytics produced by IP video players help ensure that Comcast delivers the highest quality experience to customers. While ingesting as many messages as Tweets produced every day, these real-time insights are achieved through an in-house architecture leveraging Flume NG and Storm.
Sharmila Mulligan (ClearStory Data)
Is your big data analysis constrained by slow cycles, specialist-only access, and a process of one-shot, big data analysis? Traditional approaches are painful, costly and tedious. See a whole new way to speed the cycle, converge and analyze diverse data, and interact on insights.
Siddharth Seth (Hortonworks Inc), Hitesh Shah (Adobe)
Slides:   1-PPTX    external link
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. Learn how this barrier was removed and how new applications are being built and run on Apache Hadoop.
Paul Kent (SAS)
Slides:   1-PPTX 
Analytically focused organizations are building general purpose Hadoop Clusters and want to deploy a wide range of Analytic Software. As the level of data sharing goes up and the variety of tools used to access data increases, you’ll be faced with choices: what format to store your data in; what catalog to describe the data and its layouts; and how/when/where to decide between tools.
Ahmed Radwan (Google's Motorola Mobility)
Slides:   1-ZIP 
Multi-tenancy is a reality for large-scale data systems, but it poses concerns about exposure of sensitive data. Using anonymization techniques, sensitive data can be protected in ways that maintains user privacy while preserving the ability to use the data effectively for operational needs. In this talk, we explore the challenges and lessons learned in building solutions for data anonymization
Aaron Myers (Cloudera, Inc.), Shreepadma Venugopalan (Cloudera)
Slides:   1-PPTX 
When Hadoop is used for sensitive data, security requirements arise that require strong authentication, authorization of data/resources, and data confidentiality. This session covers how various parts of the Hadoop ecosystem can interact in a secure way to address these requirements. We will focus on the advanced Apache Hive authorization features enabled by the Apache Sentry (incubating) project
Jack Norris (MapR Technologies)
According to Gartner, Hadoop is near the top of the Hype Cycle. While some customers have questions about the enterprise capabilities of Hadoop, the answers are clear as production deployments continue to expand. This session will use successful customer experiences to highlight the power of Hadoop and separate the myths from reality.
M. C. Srivas (Bridgewater Associates )
Slides:   1-PDF 
This session steps through how to double performance for MapReduce jobs, achieve high-speed data ingestion, and execute HBase apps 10X faster with consistent low latency.
Paul Groom (Kognitio)
Slides:   1-PPTX 
Is Hadoop ready for high-concurrency complex BI? Even with Hadoop 2.0 on the way? Advanced analytics requires rip-roaring performance and fast, low-latency execution. Disk is not the solution, in-memory is where the hot BI data needs to reside. This informative session will offer expert advice, opinions from the ""bleeding edge,"" and some hidden secrets from 25 years of work with big data.
Sean Murphy (PingThings), Benjamin Bengfort (PingThings, Inc)
Slides:   external link
Much of the world’s data (and your own) is text. The key to unlocking its value is in a series of Natural Language Processing transformations that turn raw strings into a machine usable form. We will use Hadoop alongside Python’s NLTK to do these steps and discuss why each is necessary in your application.
Jim Englert (Gilt)
Slides:   1-PDF 
In July 2013 a team from Basho joined up with a team of Gilt engineers at Gilt's Dublin office to spend a few days testing how Riak would handle Gilt's production traffic on the company's main user store. In this talk Jim will discuss this process, the results of this stress test, and how Gilt--one of the top eCommerce companies in the U.S...
Slides:   1-ZIP    external link
Voice of the customer (VOC) data is a rapidly growing, unstructured, untapped data source – for your web site and across social media sites. Topic discovery through clustering of user verbatims, integrated with decision support data, can unleash valuable, actionable insights from millions of customers.
John Choi (IBM)
What is Big Data? What will it mean for my organization? What technologies do I need? In this session, we will provide a view of what Big Data really means for organizations and how people, processes, and technologies, when brought together, can catalyze a transformational journey.
Michael Chui (McKinsey Global Institute)
Michael Chui, Senior Fellow, McKinsey Global Institute
Colin Marc (Stripe)
Slides:   1-ZIP 
Most startups don't start to think about having a real analytics platform until it's too late, and Stripe is certainly no exception. In this session, I'll describe how we approached bulding such a platform, and walk through the steps (and missteps) we took in making our production data available in Hadoop - in realtime - for processing and querying.
Jim Stogdill (O'Reilly Media, Inc.)
During the first 50 years of the Information Age information technologists wired up your company's lizard brain. Now networks, sensors, data and algorithms are the base on which corporations will evolve into much more intelligent entities and the goal of technologists is shifting from automation to intelligentization.
Doug Cutting (Cloudera)
Doug will talk broadly about the future capability of Hadoop in the context of the road traveled so far. What are the limits of Hadoop? How should you think about workloads like SQL and Search? What's next?
Moderated by:
Jim Stogdill (O'Reilly Media, Inc.)
Mona Vernon (Thomson Reuters Labs), Trevor Hughes (International Association of Privacy Professionals), Randy Smerik (Osunatech, Inc.), Lisa Green (Domino Data Lab)
The Strata Great Debates return to New York with a discussion of the merits and drawbacks of what are rapidly becoming our prosthetic brains. In a vigorous Oxford Style debate, two teams try to convince the audience that they're right. We take score before and after their arguments, and declare a winner. Join us and help us decide whether a connected world is indeed a better one.
Shawndra Hill (University of Pennsylvania)
In this keynote I will discuss how TV networks and advertisers can derive value from all of the online social activity about TV.
Roger Magoulas (O'Reilly Media)
Roger Magoulas, incoming Strata chair and Director of Research at O'Reilly, will share insights into the state of data science as a profession and preview Strata in 2014.
Brian Dalessandro (Capital One)
Slides:   1-PPTX 
A common Data Science problem is that we have access to a lot of data but not enough of the right data. In many applications the right data is either impossible to collect or prohibitively expense to obtain. This talk will cover the basic strategies of Transfer Learning and will show how they can be leveraged to get the most out of the data you have rather than the data you want.
Philip Zeyliger (Cloudera)
Slides:   1-PDF 
All is quiet on the log file front, but yet the system is down. What next? This talk will cover the tricks of the trade for debugging distributed systems. Motivated by experience gained diagnosing Hadoop, we’ll dig into the JVM, Linux esoterica, and outlier visualization.
Lyndon Estes (Princeton University)
Slides:   1-FILE 
Knowing where farming occurs and where it will expand is crucial for understanding food security and our changing environment. However, the satellite-based maps we currently rely on are often inaccurate, particularly in Africa. Our project is harnessing open source software, big data, and crowdsourcing to create better crop field maps for Africa.
Will Marshall (Planet Labs)
Planet Labs is launching the largest ever fleet of Earth-imaging satellites in December. These will enable high resolution imagery of the entire planet to be taken on a more frequent basis. The data is of large potential value: humanitarian applications range from monitoring deforestation and the ice caps to disaster relief and improving agriculture yields in developing nations.
Jayant Shekhar (Sparkflows Inc.)
Slides:   1-PDF 
Hadoop has evolved significantly in recent years, today serving as a unified platform for near-real-time (NRT) and batch workflows, such as querying, analysis and alerting for logs and machine data. In this session, we'll dive into the details of using SolrCloud and Cloudera Impala together to serve search queries, by integrating Flume to stream events into Solr, Impala and HBase.
Antonio (Per data LLC), Joseph Rickert (Revolution Analytics)
Slides:   1-HTM    2-PDF 
This tutorial is aimed at R users who want to use Hadoop to work on big data and Hadoop users who want to do sophisticated analytics. We will introduce to R, Hadoop and the RHadoop project. We will then cover three R packages for Hadoop and the mapreduce model. We will present numerous examples of incremental complexity including the combination of rmr and RevoscaleR to solve modeling problems.
Richard Brath (Uncharted Software), David Jonker (Uncharted Software Inc.)
Slides:   1-PDF 
Visualizations of big graphs often look like spaghetti and can be difficult to use. Working backwards from the analytic questions, we will show some very different 2D and 3D visualizations for social networks. We'll also cover some of the challenges and discuss some open source tools.
Chris Perry (International Peace Institute), Marie O'Reilly (International Peace Institute)
Slides:   1-PDF 
If big data can be used to predict changes in consumer behavior, can it be used to predict whether rival factions will go to war?
Claudia Perlich (Dstillery)
Coverage of online advertising fraud finally hit the newsstand a few months ago. But the story really started much earlier. Somewhat surprisingly it was predictive modeling on large data streams from real time bid environment that was the first to pick up symptoms of the yet largest online advertising scam. We tell the tale where models “too good to be true” lead to quite a sinister discovery.
Jonathan Hsieh (Cloudera, Inc)
Slides:   1-PDF 
Apache HBase is a robust random-access distributed datastore built upon Apache Hadoop’s HDFS and Apache ZooKeeper. This talk will describe themes emerging from recent features slated for the upcoming post-0.96 release. These include improvements for multi-tenant deployments; a focus on predictable latencies; and the proliferation of new extensions for features traditionally from databases.
Erich Hochmuth (Monsanto), Amandeep Khurana (Cloudera)
Slides:   1-PDF 
Monsanto is building new technology driven products for their customers that will leverage big data. This talk describes how Monsanto is building these scalable applications with geospatial data, using Hadoop and HBase as the backend systems.


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts