Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Speaker Slides & Video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Joy Johnson (AudioCommon)
Joy Johnson, VP, Mobile, AudioCommon
Albert Bifet (Télécom ParisTech), Silviu Maniu (Huawei)
Slides:   1-PDF 
Real-time analytics are becoming increasingly important due to the large amount of data that is being created continuously. Drawing from our experiences in Huawei Noah's Ark Lab, we present StreamDM, a new open source data mining and machine learning library designed on top of Spark Streaming. We will show its advanced methods, and how easily it can be used and extended.
John Akred (Silicon Valley Data Science), Julie Steele (Manifold), Scott Kurth (Silicon Valley Data Science)
Slides:   external link
Join the team behind the tutorial “Developing a modern enterprise data strategy," as they field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
Mark Madsen (Teradata)
Slides:   1-PDF 
The story of the correlation between beer and diaper sales is commonly used to explain product affinities in introductory data mining courses. Rarely does anyone ask about the origin of this story. Is it true? Why is it true? What does true mean anyway?
Martin Fowler (ThoughtWorks), Rachel Laycock (ThoughtWorks)
Slides:   1-FILE 
ThoughtWorks has developed into a 3000-person worldwide company, yet still has much of the pragmatic, informal culture from a decade ago when it was a tenth of the size. We’ll explore how we’ve managed to do this, and how we try to influence our clients - who are usually much more corporate and hierarchical.
Matt Winkler (C+E) (Microsoft)
Slides:   1-PDF 
At Microsoft, we process exabytes of data to run our own businesses. Learn how you can process big data in the cloud at massive scale with no hardware to deploy, software to tune/configure, and infrastructure to manage. We’ll also talk about overcoming common obstacles in big data adoption such as a high learning curve, cost of implementation, tuning infrastructure, and providing security.
Joe Hellerstein (UC Berkeley)
Slides:   external link
As the Hadoop ecosystem grows more complex, there is widespread desire for open metadata solutions: common ground for collaboration across users, and interoperability across software solutions. We motivate a new class of open metadata services for big data, via science and enterprise use cases. We also set out challenges for a new class of "meta-on-use" approaches fit for agile analytics.
Kurt Brown (Netflix)
Slides:   external link
The Netflix Data Platform is a constantly evolving, large scale infrastructure running in the (AWS) cloud. We are especially focused on performance and ease of use, with initiatives including Presto integration, Spark, and our big data portal and API. This talk will dive into the various technologies we use, the motivations behind our approach, and the business benefits we get.
Ron Bodkin (Google)
Slides:   1-PDF 
While schema on read is powerful, it’s just a first step on the journey to understanding effective ways of working with data in new big data systems. In this talk we highlight new patterns of working with data.
Tom White (Cloudera), Ryan Blue (Cloudera)
Slides:   1-PPTX 
In the second (afternoon) half of the Architecture Day tutorial, attendees will build a data application from the ground up. As a part of the tutorial, we will demonstrate how Kite codifies the best practices from the Hadoop Architecture Day morning session.
Prat Moghe (Cazena)
Slides:   1-PDF 
Hadoop’s ability to handle large amounts of varied data has been a driving force behind the explosion of big data. Many organizations’ ambitions to become more data-driven, however, are held back by a shortage of resources as well as the time and expense needed to purchase and set up hardware and software infrastructure. The cloud offers a natural alternative to overcome these barriers.
Jeff Jonas (IBM)
Slides:   1-PPTX 
Jeff Jonas, IBM Fellow; Chief Scientist, Context Computing
Mary Yoko Brannen (CLIA Consulting)
Slides:   1-PPTX 
In this talk, Mary Yoko Brannen will propose “ethnographic thinking” as a new way of diagnosing culture that can open up new avenues for innovation and ongoing strategic renewal.
kim rees (Periscopic)
Slides:   1-PDF 
If your employees touch your data, from the moment that person arrives on their first day and throughout their career, they should be immersed in data-design thinking. From IT to analysts to programmers to designers — each is responsible for bringing your data to life.
Martin Kleppmann (University of Cambridge)
Slides:   1-PDF 
Even the best data scientist can't do anything if they cannot easily get access to the necessary data. Simply making the data available is Step 1 toward becoming a data-driven organization. In this talk, we'll explore how Apache Kafka can replace slow, fragile ETL processes with real-time data pipelines, and discuss best practices for data formats and integration with existing systems.
Marcel Kornacker (Cloudera), Josh Wills (Cloudera), Alexander Behm (Cloudera)
Slides:   1-PPTX 
In this talk, we will explain how data scientists use nested data structures to increase analytic productivity. We will use two well-known relational schemas - TPC-H and Twitter - to demonstrate how to simplify data science workloads with nested schemas. Also, we will outline best practices for converting flat relational schemas into nested ones, and give examples of data science-style analysis.
Doug Wolfe (CIA)
In his ten-minute keynote, CIA Chief Information Officer Douglas Wolfe discusses how data science is a true team sport, and how the rapid evolution of this field continually improves the impact of the CIA mission.
Juan Huerta (Dow Jones)
Slides:   1-PDF 
In this presentation I will describe the way in which Data Science is helping the Wall Street Journal produce better journalism strategies, personalize our subscribers’ experience, and optimize revenue and overall customer engagement.
David Boyle (Audience Strategies)
Slides:   1-PDF 
Are creative businesses the last battleground for data-driven decision making? Drawing lessons from successes and failures in the music industry, book publishing, and TV, David Boyle will argue for a negotiated settlement in the war between data and creative, and show how long-term and mutually beneficial peace can work.
Claudia Perlich (Dstillery)
Slides:   1-PDF 
This talk takes a provocative stand: many metrics we cherish lose their value because the granularity of modern data collection enables us to identify and optimize toward hidden signals that used to be noise, and now come to the forefront. One such metric is the click-through rate in advertising, but the mechanism is ubiquitous and we should pay close attention to the mechanism at work.
John O'Duinn (Release Mechanix)
Slides:   1-PDF 
As our society evolves from "job for life" to project economy/gig economy/free-agent-nation, people are less willing to relocate for every new job. Building and cultivating a geo-distributed team requires a very crisp organization, which also helps people who are all "in the office."
Farrah Bostic (The Difference Engine)
Farrah Bostic, Founder, The Difference Engine
Holden Karau (Independent)
Slides:   external link
This session explores best practices of creating both unit and integration tests for Spark programs as well as acceptance tests for the data produced by our Spark jobs. We will explore the difficulties with testing streaming programs, options for setting up integration testing with Spark, and also examine best practices for acceptance tests.
Michael Segel (Segel & Associates.)
Slides:   1-BIN 
Today's Hadoop Cluster now has multiple single points of failures. This talk focuses on identifying these failings and how to mitigate them.
Jay Margalus (MapR), Mike Emerick (MapR)
Slides:   external link
Who will watch the watchmen? This session will cover data integrity problems in open government introduced by the human element. We’ll then explore possible methodologies that will allow us to derive value from open government data, while still keeping a skeptical eye on the validity of the data itself.
Robert Grossman (University of Chicago)
Slides:   1-PDF 
Large datasets have large numbers of anomalies, and the challenge is not just identifying anomalies but rank ordering them to create alerts, so that data scientists can examine the most interesting ones. We discuss three case studies that integrate machine learning and data engineering, and extract six techniques for identifying anomalies and ranking ordering them by their potential significance.
Ravi Prakash (Altiscale)
Slides:   1-PDF 
The HDFS File Browser now has improved accessibility and is easier to use! Hadoop 2.4.0 introduced a new UI for file browsing with WebHDFS. This feature set has been expanded to include write operations and file uploads. Authentication issues have been addressed and the file browser is now configured with HttpFS. We'll present a demonstration and overview of possible configuration requirements.
Jaipaul Agonus (FINRA)
Slides:   1-PDF    external link
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
Joe Caserta (Caserta Concepts), Elliott Cordo (Caserta Concepts, LLC)
Slides:   1-PPTX 
A global record company and a force in the music business partnered with award-winning data innovation consulting firm Caserta Concepts to re-architect its core data platform, with a data framework based on AWS, EMR, Redshift, and other big data technologies. This session presents the architecture, technologies, and techniques used to achieve an agile data ingestion and analytics platform.
Slides:   1-PPTX 
Reaching 100,000,000 antivirus users was a big challenge for Avira, but we managed to achieve the goal. The challenge that arises now is to convince our users to stay with us, by offering the best possible experience to each one of them. In this presentation we will share the entire flow of the user churn prevention, from building custom surveys to using machine learning algorithms.
Yael Garten (LinkedIn)
Slides:   1-PDF 
You’ve decided you need data scientists. You know who to hire. Now, what do you do with them? We’ll discuss examples of how companies like LinkedIn make business decisions from data. We’ll review the spectrum of data science, data quality and platforms, and how data scientists drive the art, science, and politics of defining KPIs to transform into a data-driven org.
Karen Rubin (Quantopian)
Slides:   external link,   2-PDF 
Karen Rubin has spent the last nine months exploring “What would happen if you invested in women CEOs?" In doing so, she has developed an investment algorithm that invests in the women-led companies of the Fortune 1000. Based on a simulation run from 2002-2014, this algorithm would have outperformed the S&P 500 by more than 200%. In this talk she will share her algorithm and results.
Ron Kasabian (Intel), Michael Draugelis (Penn Medicine)
Even in this era of intense medical breakthroughs, many illnesses still evade accurate and timely diagnosis. Clinicians' must often rely on static diagnostic guidelines, that result in late care and too many false alarms. Half of all heart failure patients can go undiagnosed.
Maria Konnikova (The New Yorker | Mastermind)
What do you do when you find a momentary break in your otherwise endless barrage of tasks? In this talk, Maria argues for the vital importance of recapturing the seeming nothingness of boredom, of harnessing the pauses of life for their creative potential. It is in boredom that the truly deep questions and discoveries lie.
Allen Downey (Olin College of Engineering)
Slides:   external link
Bayesian methods are well-suited for business applications because they provide concrete guidance for decision-making under uncertainty.  But many data science teams lack the background to take advantage of these methods.  In this presentation I will explain the advantages and suggest ways for teams to develop skills and add Bayesian methods to their toolkit.
Haden Land (Lockheed Martin IS&GS), Jason Loveland (Lockheed Martin)
Slides:   1-PPTX 
Lockheed Martin builds unmanned and manned human space systems, which require systems that are tested for all possible conditions – even for unforeseen situations. We present a test system that is a learning system built on big data technologies, that supports the testing of the Orion Multi-Purpose Crew Vehicle being designed for long-duration, human-rated deep space exploration.
Jenelle Bray (LinkedIn)
Slides:   1-PPTX 
LinkedIn’s Security Data Science group uses various reputation systems as input to models designed to stop fraud and abuse. This session will discuss how we build these reputation systems and compare instantaneous online reputation scores to more complex offline systems.
Astrid Atkinson (Google)
Slides:   1-PPTX 
In a rapidly evolving industry, the only thing that's permanent is the necessity for change. Requirements change, business priorities change, whole industries change - it's a manager's job to help teams navigate this uncertainty.
Jesse Anderson (Big Data Institute), Ewen Cheslack-Postava (Confluent)
Slides:   1-PDF 
This is a hands-on workshop where you’ll learn how to leverage the capabilities of Kafka to collect, manage, and process stream data for big data projects and general purpose enterprise data integration needs alike. When your data is captured in real-time and available as real-time subscriptions, you can start to compute new datasets in real-time off these original feeds.
Henry Robinson (Cloudera), Zuo Wang (Wanda), Arthur Peng (Intel)
Slides:   1-PDF 
Columnar data formats such as Apache Parquet promise much in terms of performance, but need help from modern CPUs to fully realize all the benefits. In this talk we'll show how the combination of the newest SIMD instruction sets, and an open-source columnar file format, can provide an enormous performance advantage. Our example system will be Impala, Parquet, and Intel's AVX2 instruction set.
Zhe Zhang (LinkedIn), Weihua Jiang (Intel)
Slides:   1-PDF 
In this session, attendees will learn how erasure coding (HDFS-7285) can greatly reduce the storage overhead of HDFS without sacrificing data reliability.
Daniel Weeks (Netflix)
Slides:   1-PDF 
The Big Data Platform team at Netflix continues to push big data processing in the cloud with the addition of Spark to our platform. Recent enhancements to Spark allow us to effectively leverage it for processing against a 10+ petabyte warehouse backed by S3. We will share our experiences and performance of production jobs along with the pains and gains of deploying Spark at scale on YARN.
DJ Patil (White House Office of Science and Technology Policy)
DJ Patil, U.S. Chief Data Scientist at White House Office of Science and Technology Policy
Paul Kent (SAS)
Slides:   external link
Imagine the possibilities of having all of your data in one place – at a reasonable cost – with the computing potential to learn from relationships between data in all domains. Advanced analytics and Hadoop are changing the way organizations approach big data.Hear tips from the future and learn about key patterns emerging from a wide cross section of Hadoop journeys. Perhaps they’ll inspire yours.
Paul Kent (SAS)
Slides:   external link
Imagine the possibilities of having all of your data in one place – at a reasonable cost – with the computing potential to learn from relationships between data in all domains. Advanced analytics and Hadoop are changing the way organizations approach big data. Hear tips from the future and learn about key patterns emerging from a wide cross section of Hadoop journeys.
AnnMarie Thomas (School of Engineering and Schulze School of Entrepreneurship, University of St. Thomas)
Unusual collaborations can often lead to new ways of taking, and analyzing data. This talk looks at lessons learned from working with chefs, circus performers, and preschoolers.
Joy Thomas (Apigee), Jagdish Chand (Apigee)
Slides:   1-PPTX 
Customer journey analytics systems of large corporations must handle a great volume of events on a daily basis. Apriori aggregation used by early systems often caused signal loss due to ever-changing customer activity rates. We will present a new method that identifies paths inherent in raw cross-channel data, and that captures traffic patterns via nodes of interest across all channels of data.
Sam Heywood (Cloudera), Nick Curcuru (Mastercard), Ritu Kama (Intel)
Slides:   1-PPTX 
Hadoop is widely used thanks to its ability to handle volume, velocity, and variety of data. However, this flexibility and scale presents challenges for securing and governing this data. To avoid your company making the front pages over a data breach, experts from MasterCard, Intel, and Cloudera share the Hadoop Security Maturity Model phase 0-4 and steps to get your cluster ready for a PCI audit.
Daniel Goroff (Alfred P. Sloan Foundation)
It is easy to make "false discoveries" when analyzing big data. It is harder to draw causal conclusions that are reliable and reproducible, especially when private or proprietary information is involved. Recent mathematical ideas, like differential privacy, offer new ways of reaching robust conclusions while provably protecting personal information.
Jake Porway (DataKind), Bob Filbin (Crisis Text Line), danah boyd (Microsoft Research | Data & Society)
Slides:   1-PDF 
No matter how good the intentions, ethical questions are inherent in the work of using data for social good. How are organizations navigating ethical pitfalls in order to make an impact? The key is protecting the humanity behind the numbers. In this series of talks, we'll hear from four speakers on how they are dealing with ethical considerations inherent in projects that aim to use data for good.
Travis Oliphant (Anaconda), Peter Wang (Anaconda), Kyle Kelley (Netflix), Andrew Odewahn (O'Reilly Media), Paige Bailey (Microsoft), Jeff Reback (Continuum Analytics), Andy Terrel (NumFOCUS), Bryan Van de Ven (Continuum Analytics), Sarah Bird (Aptivate), James Powell (NumFOCUS), Phil Cloud (Continuum), Jason Grout (Bloomberg LP), Chris Colbert (Anaconda Powered by Continuum Analytics), Owen Zhang (DataRobot), Peter Prettenhofer (DataRobot), Damon McDougall (UT Austin), Michael Droettboom (Space Telescope Science Institute), Jim Crist (Continuum Analytics), Benjamin Zaitlen (Anaconda), Andreas Mueller (NYU, scikit-learn)
Slides:   1-PDF 
Python has become an increasingly important part of the data engineer and analytic tool landscape. Pydata at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including IPython Notebook, NumPy/matplotlib for visualization, SciPy, scikit-learn, and how to scale Python performance, including how to handle large, distributed data sets.
Lauralea Banks Edwards (Washington State University)
Slides:   1-PPTX 
This presentation identifies some of the areas in data creation and analytics where we perpetuate the simplistic representation of the world. It uses queer theory to demonstrate alternative ways of creating and analyzing data to take non-normative cases into consideration.
Garrett Grolemund (RStudio), Yihui Xie (RStudio, Inc.), Nathan Stephens (RStudio, Inc.), Randall Prium (Calvin College)
Slides:   external link
From advanced visualization, collaboration, and reproducibility to data manipulation, R Day at Strata covers a raft of current topics that analysts and R users need to pay attention to. The R Day tutorials come from leading luminaries and R committers, the folks keeping the R ecosystem apace of the challenges facing analysts and others who work with data.
Rosaria Silipo ( AG)
Slides:   1-PDF 
In this project, we re-engineered a few barely-usable legacy solutions from the past, and made them viable again by exploiting the speed and performance of Hadoop platform-based execution.
Anant Chintamaneni (BlueData)
Slides:   1-PPTX 
Hadoop multi-tenancy is becoming a must-have – in order to accommodate multiple lines of business, multiple concurrent Hadoop jobs, multiple versions of Hadoop, multiple applications, security isolation, and more. This session will discuss these requirements and share recommendations on how to deploy a secure multi-tenant Hadoop environment with simplicity, agility, and low management overhead.
Lenni Kuff (Facebook), Nong Li (Cloudera), Stephen Romanoff (Capital One )
Slides:   1-PPTX 
Hadoop is supremely flexible, but with that flexibility comes integration challenges. In this talk, we introduce a new service that eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.
Laurent Weichberger (OmPoint Innovations, LLC)
Slides:   1-PDF    2-PDF    3-PDF    4-PDF 
This three-day curriculum features advanced lectures and hands-on technical exercises for Spark usage in data exploration, analysis, and building big data applications.
Laurent Weichberger (OmPoint Innovations, LLC)
Slides:   1-PDF    2-PDF    3-PDF    4-PDF    5-PDF 
This three-day curriculum features advanced lectures and hands-on technical exercises for Spark usage in data exploration, analysis, and building big data applications.
Laurent Weichberger (OmPoint Innovations, LLC)
Slides:   1-PDF    2-PDF    3-PDF    4-PDF    5-PDF 
This three-day curriculum features advanced lectures and hands-on technical exercises for Spark usage in data exploration, analysis, and building big data applications.
Dean Wampler (Anyscale)
Slides:   external link
Apache Spark is often seen as a replacement for MapReduce in Hadoop systems, but Spark clusters can also be deployed and managed by Mesos. This talk explains how to use Mesos for Spark applications. We'll examine the pros and cons of using Mesos vs. Hadoop YARN as a data platform, and discuss practical issues when running Spark on Mesos. We'll even discuss how to combine the two with Myriad.
Jim Scott (NVIDIA)
Slides:   1-PDF 
With the move to real-time data analytics and machine learning, streaming applications are becoming more relied upon than ever before. Discover how to build and deploy a globally scalable streaming system. This includes producing messages in one data center and consuming them in another data center, as well as how to make the guarantees that nothing is ever lost.
Bruce Reading (VoltDB)
Slides:   1-PDF 
You have 10 milliseconds. Less than the blink of an eye, the beat of a heart – that’s how much time you have to ingest fast streams of data, perform analytics on the streams, and take action. Ten milliseconds to win a customer, 10 milliseconds to make a sale, 10 milliseconds to save a life – it’s not much time.
Edd Wilder-James (Google)
Slides:   1-PDF 
Spark is white-hot, but why does it matter? Some technologies cause more excitement than others, and at first the only people who understand why are the developers who use them. This talk provides a tour through the hottest emerging data technologies of 2015 and explains why they’re exciting, in the context of the new capabilities and economies they bring.
Aaron Kimball (Zymergen, Inc.)
Slides:   1-PDF 
Zymergen has industrialized the process of genome engineering to build microbes that produce chemicals at scale. High-throughput microbe development is driven by integrating machine learning and open source software for complex data storage, search, and bioinformatics. See how we built this futuristic vision for synthetic biology, and learn how NoSQL can power massive scale experimentation.
Greg Rahn (Cloudera)
Slides:   external link
The flexibility and simplicity of JSON have made it one of the most common formats for data. Data engines need to be able to load, process, and query JSON and nested data types quickly and efficiently. There are multiple approaches to processing JSON data, each with trade offs. In this session we’ll compare and contrast the approaches taken by systems such as Hive, Drill, BigQuery, and others.
Mike Olson (Cloudera)
Mike Olson, CSO and Chairman, Cloudera
Tim Howes (ClearStory Data)
This keynote unveils why rapid modernization of BI is taking place, the business use cases driving it, and what’s essential in next-generation solutions.
Jim McHugh (Cisco)
IoE, IoT, and big data – three topics you hear and read about often in our various industries. Let’s quickly look at these market and technology dynamics, and see how they are each in their own way ’democratizing’ data access and analysis, resulting in new businesses, technologies, and improved community solutions throughout the world.
Richard Brath (Uncharted Software), Rob Harper (Uncharted)
Slides:   1-PDF 
Direct visual exploratory analysis of big data yields insights that are otherwise overlooked. By plotting all the data, patterns that can be obscured by traditional visualization methods are preserved. This presentation highlights the power of visualizing whole data sets through examining a market order book and identifying pricing strategies.
Joseph Sirosh (Compass)
Join Microsoft’s Joseph Sirosh for a behind-the-scenes sneak peek into the creation of the viral phenomenon He'll cover how it got to 50 million users in 7 days, the unexpected big data challenges that came with it, and the surprising learnings they had about people and systems.
Thomas Phelan (HPE BlueData)
Slides:   1-PPT 
This session will delve into the multiple different meanings of "virtualized HDFS." It will lead an investigation into the abstraction of the HDFS protocol in order to permit any storage device to deliver data to a Hadoop application in a performance critical environment. It will include a discussion and assessment of the work in this area done by projects such as Tachyon and MemHDFS.
Jake Porway (DataKind)
Jake Porway, founder and executive director of DataKind, unveils five keys for successful data science for good projects, based on the organization's three years of work rallying thousands of volunteers worldwide to give back.
Charles Givre (Deutsche Bank)
Slides:   1-PDF 
Many people are acquiring smart devices, and yet do not have an understanding of the data these devices gather about them and what can be done with this data if it is aggregated over time. The talk will demonstrate what data several popular devices—including the Nest Thermostat and a few others—gather and show what can be learned about an individual from this data.
Michael Freeman (University of Washington)
Slides:   1-PDF 
Data-driven decision-making can only be properly executed when the decision makers understand both the underlying data, and the types of manipulations that have been applied to it. In this session, we’ll explore what exactly we "do" to data (aggregation, "cleaning," statistical modeling, machine learning), and how to visually communicate about the processes and implications of our work.
Eric McNulty (Richer Earth)
Slides:   1-PPTX 
There are three big shifts in leadership development: from linear thinking to complex systems thinking where relationships are paramount; from focus as a noun to a verb -- continually recalibrating to ensure clarity of purpose, values, and performance; and from "they" to "you" -- the onus on leadership development now falls on each individual so take charge of your leadership future.