Skip to main content
Make Data Work
Oct 15–17, 2014 • New York, NY

Speaker Slides and Video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Joseph Sirosh (Microsoft)
Software and the rise of cloud services have given rise to revolutionary new economies – creating new markets for everything from self-published books, music and videos to mobile apps. Only a few years ago, it would have been hard to imagine developers authoring a million apps for smartphones. But that’s history.
Paul Zikopoulos (IBM CANADA)
In this session you'll see an application that builds on in-place existing technologies like Hadoop to deliver understandable results. You'll hear a story where analytics at rest was applied to unstructured data using a simple SQL-like development environment, a&findings were promoted to the frontier of the business to score, in real time, monetizable intent, assess reputations & more. .
Ron Kasabian (Intel)
This talk introduces how Intel is working with scientists and physicians to help improve research, treatment, and drug development for Parkinson’s Disease using data science and enabling the Parkinson's research community to build upon an open platform for big data analytics.
Greg Rahn (Cloudera)
Slides:   external link
In the last two years we've seen the introduction of several open-source SQL engines for Hadoop. There have been numerous marketing claims around SQL-on-Hadoop performance but what should you believe? How do these different engines compare on functionality? This talk will compare and contrast Hive, Impala, and Presto all from an non-vendor, unsponsored, independent point of view.
Donald Farmer (Qlik)
Slides:   1-PPTX 
Data continues to drive innovation, yet it’s how we interpret and use that data that becomes imperative to success. Using a design approach called Natural Analytics, technology can leverage the way our human curiosity searches and processes information.
Pramod Varma (UIDAI)
Slides:   1-PDF 
Aadhaar, India's Unique Identity Project, is the largest biometric identity system in the world with more than 600 million people. Its strength lies in its design simplicity, sound strategy, and technology backbone issuing 1 million identity numbers and doing 600 trillion biometric matches every day! Pramod Varma, who is the Chief Architect of Aadhaar, shares his experience from this project.
Merici Vinton (OI Engine @ IDEO )
Slides:   1-PPT 
While building the Consumer Financial Protection Bureau, we set out to make the consumer financial marketplace more accountable – and to use data to prevent the next financial crisis. The CFPB is now releasing the complaints received from people about their financial products (credit cards, bank accounts, mortgages, student loans, credit ratings, etc).
The world is a rapidly changing place, where time flies and technological innovations batter us fast and furiously. Hadoop is just nine years old; and just five years ago had nowhere near the audience, ecosystem, or impact it has now . . .
Douglas Moore (Think Big Analytics)
Slides:   external link
We debunk some popular approaches and attitudes we have encountered over the course of more than 50 real world Big Data implementations. We will describe each anti-pattern and its appeal--but also why they fail, and how to do it right.
Dan McClary (Oracle)
Slides:   1-PDF 
SQL is the natural language for querying data, but data lives in many places. We discuss the importance of SQL not only on Hadoop, but on relational databases, and noSQL stores. Additionally, we dive deep into the architecture of Big Data SQL, which can access all of these sources in a single query.
Ari Gesher (Palantir Technologies), John Grant (Palantir Technologies), Courtney Bowman (Palantir Technologies)
Slides:   1-PPTX 
Technologists focused on privacy and civil liberties will run through the material in their book. The workshop will cover how to think about privacy, privacy protection properties that a system can have and the architectures that implement them, related issues in information security, and privacy issues in data collection.
Martin Kleppmann (University of Cambridge)
Slides:   1-PDF 
Apache Samza is a framework for processing high-volume real-time event streams. In this session we will walk through our experiences of putting Samza into production at LinkedIn, discuss how it compares to other stream processing tools, and share the lessons we learnt about dealing with real-time data at scale.
Slides:   1-PDF 
In this talk, I will describe how we use data at Etsy to inform product development of the website and apps, from ideation to experimentation to launch, and everything in between. This talk will include specific examples of how we make decisions across the company using analytics as a critical input.
Jean-Daniel Cryans (Cloudera)
Slides:   1-PDF 
This presentation will show you how to get your Big Data into Apache HBase as fast as possible. Those 40 minutes will save you hours of debugging and tuning, with the added bonus of having a better understanding of how HBase works. You will learn things like the write path, bulk loading, HFiles, and more.
Farrah Bostic (The Difference Engine)
Slides:   1-PDF 
Amidst the fervor of big data, we forget that numbers don't speak for themselves—people speak for them, and people speak through them. The interpretive layer of research, design, and business decision making matters, in some ways more than ever. We have more data, more frameworks, more hypotheses, more beliefs than ever before competing for the attentions of CEOs and product teams alike.
Jon Kleinberg (Cornell University)
Slides:   1-PDF 
On-line social media systems are not simply venues for people to come together; they are also explicitly designed environments whose architectures serve to shape behavior. Here we consider several computational challenges for on-line social systems that illustrate this tension between organic interaction and algorithmic design.
Bob Mankoff (The New Yorker Magazine)
Bob Mankoff, The New Yorker's cartoon editor, will analyze the lessons we learn from crowdsourced humor. Along the way, he'll explore how cartoons work (and sometimes don't); how he makes decisions about what cartoons to include; and what crowds can tell us about a good joke.
Bob Mankoff (The New Yorker Magazine)
Bob Mankoff, The New Yorker's cartoon editor, will analyze the lessons we learn from crowdsourced humor. Along the way, he'll explore how cartoons work (and sometimes don't); how he makes decisions about what cartoons to include; and what crowds can tell us about a good joke.
Stephen Lloyd (Transamerica), Vishal Bamba (Transamerica), David Beaudoin (Transamerica)
Slides:   1-PPTX 
Transamerica is a financial services company moving to a more customer centric model using Big Data. Our approach to this effort spans our Insurance, Annuity, and Retirement divisions. We went from a simple proof of concept to establishing Hadoop as a viable element of our enterprise data strategy. We cover core components of our solution and focus on lessons learned from our experience.
Sharmila Mulligan (ClearStory Data)
Data is an evolving story. It’s not a static snapshot of a point in time insight. With data from internal and external sources constantly updating, we are evolving from rear-view mirror dashboard views into an era of interactive Storytelling.
Laurie Skelly (Datascope Analytics)
Slides:   1-PDF 
Data scientists wear many hats -- how do you train a ready-for-prime-time data scientist in twelve weeks? We'll share some of the choices and models we used to create the Metis Data Science Bootcamp and select its first cohort of students.
Josh Levy (Vast)
Slides:   1-PDF 
By reducing friction from deploying models and comparing competing models, data scientists can focus on high-value efforts. At Vast we've experimented with tools and strategies for this while shipping a suite of data products for consumers and agents in the midst of some of life’s biggest purchases. I'll share best practices and lessons learned, and help you free up time for the fun stuff.
Jeffrey Heer (Trifacta | University of Washington)
Slides:   1-PDF 
Interaction and visual design are exacting exercises. Designing for data -- especially in messy and massive forms -- brings a new set of challenges. How can we help people of varying backgrounds effectively transform and understand data at scale?
Slides:   1-PPTX 
We generally accept that more data is better, but at what cost? There are few published studies that describe how a firm should value data in economic terms. Without such tools, a firm can't price data or calculate the ROI from investing in it. This talk will be an overview of how managers and data scientists can measure the value of their data with an eye towards making better data investment.
George Corugedo (RedPoint Global)
Slides:   1-PPTX 
Deriving value from data depends on how well companies capture and manage that data. Learn how to create a centralized processing pool where data can be captured, cleansed, linked and structured in a consistent way. Use the scalability and flexibility of Hadoop to create a powerful processing and refinement engine to drive usable information across enterprise data bases and data marts.
Slides:   1-PPT 
It's been twenty years since Red Hat first launched Linux. Since then Red Hat has fueled the rapid adoption of open source technologies. As Big Data transitions into enterprise mode, Red Hat is again poised to facilitate the innovation and communities needed to empower multiple data stakeholders across your organization so you can truly open the possibilities of your data.
Guy Harrison (Dell Software), David Robson (Dell Software), Kathleen Ting (Cloudera)
Slides:   1-PPTX 
When people think of big data processing, they think of Apache Hadoop, but that doesn't mean traditional databases don't play a role. In most cases users will still draw from data stored in RDBMS systems. Apache Sqoop can be used to unlock that data and transfer it to Hadoop, enabling users with information stored in existing SQL tables to use new analytic tools.
Marcel Kornacker (Cloudera), Lenni Kuff (Cloudera)
Slides:   1-PDF 
Find out how to run real-time analytics over raw data without requiring a manual ETL process targeted at an RDBMS. This talk describes Impala’s approach to on-the-fly data transformation and its support for nested data; examples demonstrate how this can be used to query raw data feeds in formats such as text, JSON and XML, at a performance level commonly associated with specialized engines.
Sridhar Reddy (MapR Technologies), carol mcdonald (MapR Technologies)
Slides:   1-PDF 
This tutorial will help you get a jump start on HBase development. We’ll start with a quick overview of HBase, the HBase data model, and architecture, and then we’ll dive directly into code to help you understand how to build HBase applications. We will also offer guidelines for good schema design, and will cover a few advanced concepts such as using HBase for transactions.
Jim Scott (MapR Technologies)
Slides:   1-PDF 
Learn the critical success factors for organizational success with Hadoop and building the right team and skill sets for high performance Hadoop success from a veteran of three successful Hadoop projects.
Leo Meyerovich (Graphistry)
Slides:   1-PDF 
Shoving 1MM rows of query results into a chart or graph returns illegible results and kills interactivity. Smarter designs, however, will achieve data visibility. Furthermore, running on GPUs turns static designs into interactive tools. We will show how Graphistry does this in production with (a) new client/cloud GPU infrastructure and (b) GPU-accelerated languages like Superconductor.
Chris Nauroth (Hortonworks), Suresh Srinivas (Hortonworks)
Slides:   1-PPTX 
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
David Jonker (Uncharted Software Inc.), Rob Harper (Uncharted)
Slides:   1-PDF 
The widespread adoption of web-based maps provides a familiar set of interactions for exploring large data spaces. Building on these techniques, Tile-based visual analytics provides interactive visualization of billions of points of data or more. This session provides an overview of technical challenges and promise using applications created with the open source Aperture Tiles framework on GitHub.
Peter Ferns (Goldman Sachs & Co)
Slides:   1-PPTX 
In the presentation, we will highlight one significant business use-case where we are using Big Graph and graph analytics in the Compliance business for surveillance and forensics.
Majken Sander (TimeXtender)
Slides:   1-PDF 
A look at how we use public governmental data to answer questions about our customers and their behavior; data used by marketing, space management, and product managers. Other government data is used to support the company's sales forecast, which items to purchase, predict amount to be purchased, and determining which items to phase out. All Data driven management - made even easier with Hadoop.
Altan Khendup @madmongol (Teradata Corporation), Ron Bodkin (Teradata)
Slides:   1-PPTX 
Developing Big Data applications for real-world business processes can be complex: method of processing, variety of systems, # of data sources. Large web companies have implemented a generic, scalable, fault-tolerant data processing architecture: LAMDA. We’ll explore this evolving architecture, design principles, layers/components, & use cases/lessons learned from real-world implementations.
Cameron Turner (The Data Guild)
Slides:   1-PDF 
The Data Guild's Cameron Turner will share his experience in building a IIoT machine learning system to control one of the largest medical factories in the world. We will walk through the story from problem statement to solution and discuss the wins and pitfalls of data automation in an industrial setting.
Ailey Crow (Pivotal)
Slides:   1-PDF 
Automated image processing improves efficiency for a diverse range of applications from defect detection in manufacturing to tumor detection in medical images. We’ll go beyond traditional approaches to image processing, which fail for large image datasets, by leveraging Hadoop for processing a vast number of arbitrarily large images.
Brigitte Piniewski (nonaffiliated )
Slides:   1-PPTX 
This case study demonstrates the value of data partnerships and the use of high yield laboratory data to solve dynamic demand management problems. Provocative suggestions for the use of public data and crowd sourced approaches to minimize future demand for healthcare are also offered.
Barry Devlin (9sight Consulting)
Slides:   1-PPTX 
“Leave the over-structured, complex Data Warehouse behind. Dive into the pure, sparkling waters of the Data Lake!” I suggest you enjoy the Instagram, but beware the hidden depths. The Data Lake is a misleading metaphor; it will become a watery grave for context, governance, and value. In reality, today's intricate information ecosystem demands a careful blend of architectures and technologies.
Julia Angwin (ProPublica)
Julia Angwin discusses how much she has spent trying to protect her privacy, and raises the question of whether we want to live in a society where only the rich can buy their way out of ubiquitous surveillance.
Miriah Meyer (University of Utah)
Miriah Meyer, Assistant Professor of Computer Science, University of Utah
Joy Beatty (Seilevel)
Slides:   1-PPTX 
Normal people don't look at data just for fun; they analyze it to make business decisions. Implementing analytics in your organization is not as simple as dropping a packaged solution in place and hoping for the best. You still need requirements, just like any other project.
Mary Ann Wayer (Premier Inc)
Slides:   1-PPT 
In this case study, Premier's Mary Ann Wayer looks at how the company creates real-time dashboards atop big data infrastructure, letting hospitals explore a nation's healthcare in a few clicks.
Mike Olson (Cloudera)
Mike Olson, CSO and Chairman, Cloudera
George Legendre (IJP Architects London)
In this presentation, George L. Legendre, principal of IJP Architects and faculty at Harvard graduate School of Design, will show how the mathematical equations of pasta define the ultimate taxonomy of the genre.
Eli Collins (Cloudera)
In this presentation Eli Collins, Cloudera’s Chief Technologist, will discuss how we might both reap the benefits of data while avoiding its perils.
Igor Elbert (
Slides:   1-PDF 
For a long time Internet retailers have been trying to move items they sell closer to customers. Flash sale site takes it to the extreme: we apply machine learning to predict customers' cravings for fashion products in different geographic regions without purchase history to draw from.
Michael Rosenbaum (Pegged Software)
Slides:   external link
This session will cover the state of the rapidly growing but still small industry of applying emerging data techniques to hiring and team assembly. It will cover the methods used and results achieved in several verticals—healthcare and long term care, application development, and business process outsourcing.
Mike Armstrong (ZestFinance)
Slides:   1-PPTX 
Last year, Douglas Merrill, CEO of ZestFinance and former Google CIO, discussed how success in big data analysis requires not just machines and algorithms, but also human analysis, or “data artists". Building on this notion, Mike Armstrong, CMO of ZestFinance, will discuss how companies can find, identify, and correct data inaccuracies.
Anubhav Dhoot (Cloudera)
Slides:   1-PPTX 
This talk will cover resource management using YARN - the new resource management platform introduced in Hadoop 2.0. It will cover how it achieves effective cluster utilization, fair sharing of resources, and allow different type of applications to utilize the cluster. We will go over the architecture, recent improvements, and things coming down the pipeline.
Ben Werther (Platfora)
Spark represents the next-step function leap in what is possible with Hadoop, but what does that mean for business analysts that are swimming in multi-structured data? This presentation discusses the new workflow required so that business analysts can work with massive volumes of multi-structured data to find new insights today, instead of continually having to wait for IT to make big data small.
John Rauser (Snapchat)
There are two essential skills for the data scientist: engineering and statistics. A great many data scientists are very strong engineers but feel like impostors when it comes to statistics. In this talk John will argue that the ability to program a computer gives you special access to the deepest and most fundamental ideas in statistics.
Fangjin Yang (Imply), Xavier Léauté (Confluent)
Slides:   1-PDF 
Organizations often showcase the virtues of their data platforms, but rarely share the challenges and decisions faced along the way. Our session describes how we architected our analytics stack around Druid, an open source distributed data store, and how we overcame the challenges around scaling the system, balancing features with cost, and making performance consistent.
Karen Moon (Trendalytics)
Karen Moon will discuss the characteristics of unstructured data that makes identifying and synthesizing fashion trends particularly challenging and how getting it right can be a competitive advantage.
Dafna Shahaf (The Hebrew University of Jerusalem)
Slides:   1-PPTX 
The amount of data in the world is increasing at incredible rates. Large-scale data has potential to transform almost every aspect of our world, from science to business; for this potential to be realized, we must turn data into insight. In this talk, Dafna will describe two of his efforts to address this problem computationally.
Rohit Jain (Esgyn)
Slides:   1-PPTX 
We had a big, and growing, global platform for storing and sharing people's most precious memories. But we had a problem. Size and scale meant traditional RDBMS wouldn't work. It was slow, error prone, and gave users a worse experience the more successful we became...
Edd Wilder-James (Silicon Valley Data Science)
Slides:   1-PPTX 
The “data lake” vision has been adopted by industry vendors, but what does it mean, and where is it headed? This talk will cover the change in data architectures, its advantages, and help you understand where you and your vendors fit along the four stages of data lake maturity.
Kim Rees (Periscopic)
Slides:   1-PPTX 
We have the unfortunate tendency to fit our problems to the technology at hand. We should be looking for ways to bend technology to our problems...our big problems. Kim will take a long look into the future of data covering the controversial and hopeful areas of privacy, open data, hacking, ETL relief, latent machines, M2M, and mass crowdsourcing.
Andrew Hill (Set)
Slides:   external link
An important skill of today's data scientists is data communication. Mapping and other types of data visualization have been sufficient to try and demonstrate the trends and patterns these professions find in data. However, there is an important shift happening in the way we consume data that means as a community, we need to think about our ability to turn data into stories.
Shankar Vedantam, NPR Science Desk, NPR
Joel Gurin (Center for Open Data Enterprise), Laura Manley (The GovLab at NYU)
Slides:   1-PDF 
Open government data on healthcare, finance, education, energy, and other areas has become a major business resource. Joel Gurin, author of Open Data Now and director of the Open Data 500 study, will show how both startups and established companies are putting open data to work. He'll cover Open Data and Big Data, business models for open-data companies, and lessons from a range of case studies.
Rana el Kaliouby (Affectiva)
This keynote will share insights from the world’s largest repository of consumer emotions and present the challenges and opportunities that this data presents for machine learning as well as data mining and visualization.
Jana Eggers (Nara Logics)
Slides:   1-PPTX 
If all you can only hear the screaming voices in your data, you're likely only acting on what every other rational expert would see. What separates innovation from incremental improvement is the ability to listen to the weak signals from your data -- and customers, advisers, and partners.
Amy Gaskins (Panopticon)
Slides:   1-PDF 
The military's most elite teams are designed to solve complex problems and operate in countless environments. Does that sound like the expectation for your data science team? If their members are selected for skill diversity and trained for agility, then yours should be, too. Here's how to use special operations selection methodology to choose the right mix of people for your data science team.
Michael Stonebraker (Tamr, Inc.)
Slides:   1-PPTX 
The explosion of internal data sources, external public data sources and feeds from the Internet of Things is causing a tsunami of diverse data sources for enterprises. Top-down data-integration tools and data scientist tools won’t scale to meet the demands of the modern enterprise. Learn how a scalable data curation platform can help enterprises connect and enrich their data to leverage it all.
Denise Asplund (Cisco Systems, Inc)
Slides:   1-PPTX 
This talk highlights William's success, challenges, and experiences creating a data driven operations model into Cisco’s engineering services organization. William highlights the role of data, the need for scale and security, the opportunity for new technology to accelerate business, the role of IT to help guide/partner, and the mind shift and cultural changes along the journey.
Slides:   1-PPTX 
Becoming an organization that can make agile decisions from agile data requires agile analytics
Matthias Braeger (CERN), Manish Devgan (Software AG)
Slides:   1-PDF 
CERN, home to the Large Hadron Collider (LHC) is at the forefront of science and technology. Come to this session to learn how projects at CERN are leveraging In-memory data management and Hadoop to derive real-time insights from sensor data helping to manage the technical infrastructure of the Large Hadron Collider (LHC).
Monte Zweben (Splice Machine Inc.)
Slides:   1-PPTX 
There is a wave of challengers in the database world focused on the scaling costs of traditional RDBMSs. These potential giant killers have capitalized on explosive data growth and disruptive technologies like distributed computing (e.g., Hadoop and NoSQL). We’ll discuss the new breed of database buyers, the redefinition of “enterprise,” and apply lessons from past database wars.
Juan Miguel Lavista (Microsoft)
Slides:   1-PPTX 
Just in the US, we make over ~40 billion queries every month. From the time we wake up, search engines are one of the top activities we do online, this talk will show some examples on how this data can be used from funny things like determining which city wakes up earlier to more complex scenarios like finding adverse drug interactions.
M. C. Srivas (Uber)
If you want to know what's coming next in big data, just ask yourself, "what would Google do?
Eddie Garcia (Cloudera)
Slides:   1-PDF 
Recent studies show the vast majority of Hadoop projects are stuck in development, with very few ever reaching production status. And those programs that do convert from pilot to production often view Hadoop as little more than an ETL tool. This session looks at why Hadoop implementations often stall out in the development phase and what companies can do to make Hadoop “production ready.”
Jennifer Zeszut (Beckon)
Slides:   1-PDF 
As the number of ways to reach consumers proliferates, marketers are swimming in data. But then why does the latest American Marketing Association survey show that the percentage of marketing leaders actually using data to make decisions has dropped from an embarrassing 37% two years ago to a mere 29% at last check?
Jim Adler (Metanautix)
Slides:   1-PPTX 
Bad press, FTC consent decrees, and White House reports have all put a spotlight on bad data practices. Data scientists and designers have become increasingly aware of how privacy principles should guide their work. So, the geeks have met the wonks. Now, it’s time for the wonks to meet the geeks and use data analytics to keep pace with burgeoning data volumes, velocities, and innovations.
Gilad Rosner (Internet of Things Privacy Forum)
Slides:   1-PPT 
While the inexorable march of technology does threaten historical notions of privacy, privacy IS very much alive – a shifting, vital conversation society has with itself and its machines. This talk explores the principles of transparency, unlinkability, and intervenability to build a foundation for a design ethos for technologists.