Strata 2012 Speaker Slides & Video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later. Please note some speakers choose not to share their presentations.

Data Science

Ana Martinez (CityGrid Media), Kin Lane (API Evangelist)
Learn how Citygrid built a world class platform to aggregate the data powering it's publicly available local places, content and ads APIs using Hadoop, Solr and MongoDB.
How do you architect big data systems that leverage virtualization and platform as a service? We will walk through a layered approach to building a unified analytics platform using virtualization, provisioning tools and platform as a service.
Tony Middleton (HPCC Systems from LexisNexis Risk Solutions)
How to simplify the data integration process and save a significant amount of development time by automatically generating code for processes (data profiling, data cleansing, and record linkage). A case study will show a complex, Big Data linking application, where insurance data was converted to HPCC using the SALT tool and reduced 20,000+ lines of source code to a 48-line SALT specification.
Q McCallum (@qethanm)
The biggest problem in data science is ... the data itself.
Sarah Sproehnle (Cloudera, Inc.)
Learn now how to use a Hadoop cluster for data analysis using Java MapReduce, Apache Hive and Apache Pig, and get an overview of using the HBase Hadoop database. Some programming experience is strongly recommended for this session.
Dean Wampler (Anyscale), Jason Rutherglen (Datastax)
This hands-on tutorial teaches you how to setup and use Hive, a high-level, data warehouse tool for Hadoop. Hive provides a SQL-like query language, HiveQL, that is easy to learn for people with prior SQL experience, making Hive attractive for data warehousing teams. Hive leverages the power of Hadoop for working with massive data sets without requiring expertise in MapReduce programming.
Sarah Sproehnle (Cloudera, Inc.)
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems. No programming experience is required.
Joseph Rickert (Revolution Analytics)
This tutorial will enable anyone with some programming experience to begin analyzing data with the R programming language
Theo Schlossnagle (Circonus)
In today's environments, we're often forced to collect data before we know if it will be useful. This tendency leads toe seas of data, flowing in real-time with very little structure or understanding of what the data means. Given that, how can you tell when data "is normal?" Let's find out.
Ken Krugler (Scale Unlimited)
Want to extract and process Big Data from the web? This tutorial will show you how to use key open source technologies such as Hadoop, Cascading, Bixo, Tika, Mahout and Solr to create scalable, reliable web mining solutions.
Alyona Medelyan (Pingar), Anna Divoli (Pingar)
In this session we discuss approaches to mining unstructured data that gradually find their way into the real world. Text mining and analytics algorithms strive to identify documents’ categories, main topics, mentioned names and other entities; they summarize and detect sentiment. We describe case studies that take advantage of such algorithms in the legal, forensics and healthcare sectors.
Simon Rogers (Guardian), Michael Brunton-Spall (Guardian News and Media)
Presentation: external link
Learn first hand from award-winning Guardian journalists how they mix data, journalism and visualization to break and tell compelling stories: all at newsroom speeds.

Business & Industry

Piyush Lumba (Microsoft), Francis Irving (ScraperWiki Ltd.)
One of the most significant challenges faced by individuals and organizations is how to discover and collaborate with data within and across their organizations, which often stays trapped in application and organizational silos.
Christopher Berry (Syncapse)
Moneyball is to marketing science as CSI is to forensic science. The expectations are high and marketers are shouting "where's the insight?" and "ENHANCE!". Data is long and marketing scientists are short. We can only scale through technology. This is the story of how a developer and two marketing scientists became data scientists in crossing that gap.
Kirkland Barrett (Microsoft)
A high level overview of Microsoft IT's BI strategy and it's various applications, focusing on Self Service BI, Scorecards and Dashboards, Data Visualizations, and Leadership Decision making through robust BI tools.
Jacomo Corbo (QuantumBlack)
Measuring productivity remains a notoriously difficult problem. We will show how real-time collaboration data are being leveraged to measure, model and forecast organizational productivity and performance in the innovation teams at Boeing and in 3 Formula One teams. On the back of these forecasts, we will show how investment yields were improved by 15% and productivity raised by nearly 20%.
J. C. Herz (Ion Channel)
This talk uses the OODA Loop concept (Observe, Orient, Decide, Act) as a framework to categorize Big Data use cases and data-driven services and the front-ends to those services. Rather than starting with the underlying technology or the data sources, the OODA loop starts with WHY the user needs information. It answers the question of when a black box beats an analytic tool, and vice versa.

Visualization & Interface

Noah Illinsky (Amazon Web Services)
This workshop is a jumpstart lesson on how to get from a blank page and a pile of data to a useful data visualization. We'll focus on the design process, not specific tools. Bring your sample data and paper or a laptop; leave with new visualization ideas.
Max Gadney (After The Flood)
The use of video to communicate data is on the rise, but what is the most effective way to do this? Highlighting our current work with the BBC in this field we will look at best practice from storytelling principles to choosing the right visual treatment.
Jason Sundram (Facebook)
Presentation: Visualizing Geo Data Presentation [PDF]
With the explosion of mobile devices, there is a plethora of geo-tagged data available for mining and visualization. To make compelling visualizations, it is often necessary to build tools that allow users to easily explore, mine, map, and market this data. This talk will focus on how to use several open-source frameworks to build such visualizations.


Richard Merkin (Heritage Provider Network)
Dr. Richard Merkin, President and CEO of Heritage Provider Network, that was recently named one of Fast Company’s 10 most innovative healthcare companies for 2012, will announce the winner of the second progress prize in the $3 million dollar Heritage Health Prize competition.
Luke Lonergan (Greenplum, a division of EMC)
How are businesses using big data to connect with their customers, deliver new products or services faster and create a competitive advantage? Learn about the changing nature of customer intimacy and how the technologies and techniques around big data analysis provide business advantage in today's social, mobile environment – and why it is imperative to adopt a big data analytics strategy.
Gary Lang (MarkLogic)
Big Data is about extracting value from fast, huge, varied, complex data sets. But simply crunching data is only the first step. As adoption of MapReduce and data analytic technologies increases, forward thinking companies are starting to build applications on their core data assets.
Abhishek Mehta (Tresata)
How big data tools and technologies give us back our individual identity ... because if you didn't know you were unique and special, well, you are. Big data can be applied to solving socio-economic problems that rival the scale and importance of building ad optimization models.
Jonathan Gosier (AuDigent)
Big data isn't just an abstract problem for corporations, financial firms, and tech companies. To your mother, a 'big data' problem might simply be too much email, or a lost file on her computer. We need to democratize access to the tools used for understanding information by taking the hard-work out of drawing insight from excessive quantities of information.
Dave Campbell (Microsoft)
The explosion of data is both a challenge and opportunity for businesses. In order to thrive in this new world, organizations will need a technical strategy for sifting through all of this data and driving insights.
Pete Warden (TensorFlow)
Presentation: Embrace the Chaos Presentation [PDF]
Why unstructured data beats structured.
Mike Olson (Cloudera)
Tools for attacking big data problems originated at consumer internet companies, but the number and variety of big data problems have spread across industries and around the world. I'll present a brief summary of some of the critical social and business problems that we're attacking with the open source Apache Hadoop platform.
The increasing use of online software and digital devices in the classroom provides a source of high-frequency data streams that can be analyzed to better understand student progress, identify individual needs, and develop personal recommendations.
Flavio Villanustre (LexisNexis Risk Solutions and HPCC Systems)
Back in the late 80s artificial intelligence was set to take over the world; it didn’t happen. In 2012; AI has been stripped down, dressed up and reborn as machine learning. Will it take over the world this time? What makes a Big Data - Machine Learning solution ‘better’?
Usman Haque (
The expected massive growth of connected device, appliance and sensor markets in the coming years - often called 'The Internet of Things' - will need a more rich concept of 'open data' than is currently common.
Doug Cutting (Cloudera)
Apache Hadoop forms the kernel of an operating system for Big Data. This ecosystem of interdependent projects enables institutions to affordably explore ever vaster quantities of data. The platform is young, but it is strong and vibrant, built to evolve.
Ben Goldacre (Bad Science)
Negative results from clinical trials go missing far too often, leading us to overestimate the benefits of treatments. Attempts to remedy this problem haven't worked well. Ben Goldacre, both a doctor and data geek, will talk about how to fix this, and other, problems in medicine.
Coco Krumme (Haven | UC Berkeley)
Why data can tell us only so much about food, flavor, and our preferences.
Hal Varian (Google)
Google Insights for Search provides an index of search activity for millions of queries. These queries can sometimes help understand consumer behavior. Hal describes some of the issues that arise in trying to use this data for short-term economic forecasts and provide examples.

Hadoop & Big Data: Applied

Asad Khan (Microsoft)
As more companies adopt Hadoop to perform data intensive tasks for large data sets, there is a burning need to make Hadoop available to a broader set of developers. This talk covers two approaches Microsoft is exploring for this purpose: 1. JavaScript interfaces to run Hadoop jobs and 2. web interfaces for Hadoop that let developers write and run MapReduce jobs from any platform.

Policy & Privacy

Virginia Carlson (Urban Rubrics), Jake Porway (DataKind)
The “common good” challenge for Big Data is to deliver actionable information that can be used by nonprofits and civic orgs. But that challenge isn’t new. Existing data intermediaries for NGOs have a rich history of working in common-good territory. Let’s discuss. What is this history? What can we take away from it to inform new, perhaps disruptive, approaches to meet this challenge?

Domain Data

Ian White (Urban Mapping, Inc)
Federal transparency initiatives have spawned millions of rows of data, state and local programs engage developers and wonks with APIs, contests and data galore. Private industry offers attribute-laden device exhaust, forming a geo-footprint of who is going where, when, how and (maybe) for what. Who decides data provenance? Does curated data get treated the same as heterogeneous data?
Jen Zeralli (S&P Capital IQ), Jeff Sternberg (S&P Capital IQ)
Topics will span the data flow lifecycle from data collection, curation and quality, to aggregation and standardization of a multitude of complex data sources, to the creation of valuable analytics, including recommendations that connect users to the data.
John Mulholland (Fannie Mae)
Pascal Boillat, Fannie Mae’s Chief Information Officer, will address how changing data standards and implementation strategies is having a profound effect on the financial services industry.
Chris Moody (Gnip)
With billions of social activities passing through the ever-growing realtime social web each day, companies are beginning to harness the power of social data. In this session, participants will learn from real-world case studies in Financial Services, Emergency Response, Brand Analytics and other industries about how businesses are applying social data to their operations to drive value.
Leigh Dodds (Kasabi)
Facebook's Open Graph,, and a recent scramble towards a "Rosetta Stone" for geodata, are all examples of a trend towards linking data across the web. Weaving data into the web simplifies integration. Big Data offers ways to mine huge datasets for insight. Linked Data turns the web into a dataset
Marcel Salathé (Penn State University)
Who influences whom? Data science can help answering this question which is of fundamental importance to business, politics, public health and many others.


Don't miss Startup Showcase, Strata's live demo program and competition for startups and early-stage companies. With a panel of industry experts providing real-time feedback, Startup Showcase happens during Strata Conference on Wednesday, February 29, 2012.

Deep Data

Jacob Perkins (Weotta)
Learn various ways to bootstrap a custom corpus for training highly accurate natural language processing models. Real world examples will be presented with Python code samples using NLTK. Each example will show you how, starting from scratch, you can rapidly produce a highly accurate custom corpus for training the kinds of natural language processing models you need.
Claudia Perlich (Dstillery)
With the collection of almost every piece of information about your customers comes the ability to start asking your data the right question: Why do they do what they do? And even more: what would they do if I could interact with them. We show for the case of online display advertising, how causal analysis gives interesting new answers about the right (and wrong) ways of spending your money.
Michael Rys (Microsoft Corp.)
Contrary to popular belief, SQL and NoSQL are not at odds with each other, they are duals—in fact NoSQL should really be called coSQL. Recognizing this duality can change the way we think about which technology to use when, and what we need to invest in next.
Robert Lancaster (Orbitz Worldwide)
We examine the effectiveness of a statistical technique known as survival analysis to optimize the cache time-to-live for hotel rates in a hotel rate cache. We describe how we collect and prepare nearly a billion records per day utilizing MongoDB and Hadoop. Finally, we show how this analysis is improving the operation of our hotel rate cache.
Monica Rogati (Data Natives)
Getting training data for a recommender system is easy: if users clicked it, it’s a positive - if they didn’t, it’s a negative. … Or is it? In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.


Diego Saenz (Accenture)
What are the fundamental skills that a CEO needs to become “Data Driven”? In this session we will discuss the 3 essential skills that will enable CEOs to effectively lead their organizations into the Data Revolution. These organizations will harness the power of data to innovate, grow profits and beat the competition.
J. C. Herz (Ion Channel)
This presentation lays out some clear, concrete gating conditions for when it makes sense to pull the trigger on big data initiatives, and how they should be procured, depending on the use case, the data assets, and the resources available.
Moderated by:
Terence Craig (PatternBuilders)
Lora Cecere (Supply Chain Insights), Pervinder Johar (CCC Information Services), Marilyn Craig (Logitech)
The effect of big data on all business models cannot be denied. This panel of SCM experts looks at how business are using, or should be using, big data to drive supply chain management issues focusing on the broader manufacturing issues that must be addressed as well as practical tips that can be applied in dealing with supply chains that now span the globe.
Michael Hugos (Center for Systems Innovation [c4si])
In this session, business agility expert Michael Hugos will present examples from his work in applying immersive animation techniques and gaming dynamics, and discuss how they can address the challenges of consuming - and responding to - the data deluge, turning information overload into business advantage.
Felix Hamilton (e22 Alloy), Josh Gold (e22 Alloy)
There are many rapidly evolving technologies that provide objective metrics and analytics for most outward facing business interactions. The evolution of similar inward facing tools has not kept pace. In this presentation we discuss which sources of internal organizational data are frequently neglected, approaches for automating data collection, and what valuable insights can result from analysis.
Bill Schmarzo (EMC Consulting)
"Big data" provides the opportunity to combine new, rich data sources in novel ways to discover business insights. How do you use analytics to exploit this data so that it will yield real business value? Learn a proven technique that ensures you identify where and how big data analytics can be successfully deployed within your organization. Case study examples will demonstrate its use.
Marcia Tal (Tal Solutions, LLC)
In this session, Marcia Tal will demonstrate how significant business value is being realized through sophisticated understanding of intent and interconnectedness, at scale.
Alistair Croll (Solve For Interesting)
Presentation: Jumpstart Welcome Presentation [PDF]
Opening remarks by Program Chair, Alistair Croll, Founder, Bitcurrent
Mark Madsen (Teradata)
Mark Madsen talks about how regular businesses will eventually embrace a data-driven mindset, with some trademark 'Madsen' history background to put it in context. People throw around 'industrial revolution of data' and 'new oil' a lot without really thinking about what things like the scientific method, or steam power, or petrochemicals did as a result.

Hadoop & Big Data: Tech

Sean Byrnes (Flurry, Inc.)
Flurry provides an analytics and advertising platform for smartphone applications. Every month we track over 20 billion sessions across over 330 million devices. This talk will go over the Hadoop and HBase architecture we run and the challenges we face managing a massively growing data set.
R and Hadoop, the two hottest stars on the Analytics stage, were meant to be together. The open source RHadoop project was established to make it happen. We'll go over what RHadoop does for you, how to use it, and why you should add it to your toolset.
Nathan Marz (Twitter)
Storm is an open-source realtime computation system relied upon by Twitter for much of its analytics. Storm does for realtime computation what Hadoop did for batch computation. It has a huge range of applications and combines ease of use with a robust foundation.

Sponsored Session

Tim Estes (Digital Reasoning)
Data Scientists must deal with many Big Data challenges including volume, velocity and variety of data. These challenges require a new solution - Automated Understanding - a new evolution in software. In this session Tim Estes will show the power of this new capability on a large and valuable dataset that has never been deeply understood by software before.
Gary Lang (MarkLogic)
Gary Lang, Senior VP Engineering, MarkLogic, will discuss the concept of Big Data Applications and walk through three in-production implementations of Big Data Applications in action.
Carter Shanklin (VMware), Jags Ramnarayan (Vmware)
Today's users won't tolerate slow applications. More often than not, the database is the bottleneck in the application. Learn how VMware vFabric SQLFire can give you the speed and scale you need in a substantially simpler way. SQLFire is a memory-optimized and horizontally-scalable distributed SQL database. Attend this session to learn how SQLFire gives high performance without the complexity.
Moderated by:
Jim Tommaney (InfiniDB)
Fernanda Foertter (Genus plc)
Advances in columnar databases are creating bio-science opportunities that were previously not possible. Fernanda Foertter and the team at Genus discovered an innovative way to store and access the huge volumes of data being generated modeling genotypes. She and Jim Tommaney discuss the benefits of column storage and how InfiniDB’s Map Reduce empowers high performance Big Data analytics.


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at

View a complete list of Strata contacts