Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Speaker Slides & Video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Slides:   1-PDF 
HP has integrated Hadoop into the core of our Big Data platform, solutions and services. We will introduce the HP Big Data Reference Architecture and a series of services that will accelerate your adoption of Hadoop. HP's new BDRA for Hadoop offers an extremely flexible and powerful platform, when HP Haven Big Data Software solutions can be used to augment Hadoop and build a smarter Data Lake.
Luke Han (Kyligence Inc), Yang Li (eBay)
Slides:   1-PPTX 
Apache Kylin is an open source distributed analytics engine contributed by eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop, supporting extremely large datasets. It was accepted as an Apache Incubator Project on Nov 25, 2014. Website:
Patrick Wendell (Databricks)
Slides:   1-PDF 
Apache Spark is a popular engine for fast and efficient data processing. This talk will cover recent feature additions to Spark, such as the elastic scaling support, new algorithms in MLlib, and the Spark SQL datasources API. It will also outline the Spark roadmap for upcoming months. Since this talk is not until May, specific roadmap details will be determined close to the talk itself.
Lars Trieloff (Blue Yonder)
Slides:   1-PDF 
While many companies are struggling to adopt big data and unlock its potential, facing challenges of visualization and democratization of insight, a number of industry leaders are leapfrogging big data adoption to circumvent the analyst bottleneck by going straight to automation of core business processes. This requires overcoming a set of tough cultural, technical, and scientific challenges.
Cory O'Connor (Google)
At Google, few things are so pervasive as Bigtable, the famous wide-column NoSQL database. It lies behind nearly every major Google product (Gmail, YouTube, Google Analytics), with its own class of internal memes, and a resource footprint unmatched anywhere else in the world.
Mike Haley (Autodesk, Inc.)
Jet engines, lifelike movie monsters, cancer-fighting nanorobots, and bespoke products. We live in a world where everything around us is designed by someone. The pace of innovation is escalating and with new methods of manufacturing, such as 3D printing, the demands placed on designers and design technology are increasing.
In this session, Phill Radley (Chief Data Architect at British Telecom) gives an overview of BT's internal multi-tenant hadoop platform. He explains their first production use case (master data management of BT UK Business Customer data) and gives a flavour of their use case pipeline.
Gareth Martin (HP Enterprise Services)
As the internet of things and connected car programs across the globe gain momentum and broaden in scope, check out this world record attempt; racing from North Cape, Norway to Cape Agulhas, South Africa. . .
Benedikt Koehler (DataLion)
Slides:   1-PDF 
This talk uses the example of smartphone tracking data to demonstrate how data science can produce something that almost cannot be distinguished from art. It shows the whole workflow from data-munging to contextualizing and integrating the data points, until the data scientist can use the resulting data science for interactive visual storytelling.
Scott Kurth (Silicon Valley Data Science), Julie Steele (Manifold)
Slides:   1-PDF 
As the necessity of having a data strategy is sinking in, the chief data officer (CDO) has emerged as a new member of the executive team focused on creating and implementing that strategy. This talk describes what that looks like across a variety of industries and organizations, and shares some best practices for getting the most out of your business data.
Alice Zheng (Amazon)
Slides:   1-PDF 
Building and deploying predictive applications require knowing how to evaluate, test, and track the performance of machine learning models over time. Using available off-the-shelf tools, this talk engages potential application builders on topics such as common evaluation metrics, A/B testing set up, tracking model performance, tracking usage via real-time feedback, and updating models.
Kevin Schmidt (Mind Candy Ltd), Luis Angel Vicente Sanchez (Mind Candy Ltd.)
Slides:   1-PDF 
Mobile gaming is a fast-moving field and needs metrics like daily active users or revenue in real-time to be able to fine-tune quickly. Approximation is needed to count those metrics, as the data volume would be too large to process exactly in real-time. We will demonstrate how to use Spark Streaming and probabilistic data structures to achieve a low error rate, even for many millions of users.
Marcel Kornacker (Cloudera)
Slides:   1-PPTX 
In this talk, attendees will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics — including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines.
Rick Farnell (Think Big, A Teradata Company)
After five years of enterprise adoption, Hadoop is now a critical data asset in your analytic and data platform strategy. Some companies, however, are struggling with making Hadoop work for their enterprise needs. . .
David Richards (WANdisco)
Healthcare is in the early stages of a revolution, as almost everything that determines our health is now becoming knowable. Data-driven healthcare represents an unheralded opportunity to make a huge leap forward. At this pivotal moment in medical history, we need to overcome an attitudinal aversion to utilizing the promise of data analysis to provide medical insight and save lives.
Tamara Dull (Amazon Web Services)
Join SAS’s Tamara Dull as she compares bike riding to current trends in big data adoption and explains why newer technologies like Hadoop aren’t always to blame.
carme artigas (Synergic Partners)
Slides:   1-PDF 
The presentation will show different examples of business sectors ( industry, services and public administrations) where data can become the business model. Data monetization strategies will be explained, linking this with the need of treating data as an economic asset, making data accessible inside and outside the company boundaries...
David Talby (Pacific AI), Claudiu Branzan (Accenture)
Slides:   1-PPTX 
Live demo using Python open-source libraries to build a hybrid machine-learning model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules, and anomaly detection. We’ll then show how we scaled to billions of events using Spark, and what it took to make the system perform and ready for production.
Tim Harford (The Financial Times)
We're always talking about "innovation", but - says Tim Harford - there are really two very different kinds of innovation. Using stories from sports, science, music, and military history, Tim will make you think different about where good ideas come from and how they should be encouraged.
Ben Lever (Ambiata)
Slides:   external link
Ivory is a new open-source, Hadoop-based data store that focuses on changing the way we approach the critical and time-consuming activity of scalable feature engineering. It both simplifies and adds rigour to data science pipelines, aiding in their transition from the lab to production environments.
Mark Samson (Cloudera)
Slides:   1-PPTX 
The Hadoop ecosystem makes it possible to build an enterprise data hub capable of storing and analysing a wide variety of data. However, a platform with such broad capability triggers a question: how to organise the myriad data sets in a way that allows users to explore and access the data they need? This session will propose an information architecture for Hadoop that enables this.
Julia Angwin (ProPublica)
We are being watched – by companies, by the government, by our neighbors. Technology has made powerful surveillance tools available to everyone. And now some of us are investing in counter-surveillance techniques and tactics.
Edd Wilder-James (Google)
Slides:   1-PPTX 
Creating value from data needs a new mindset. To fully exploit new big data tools and architectures, we need a new way of thinking: data as the raw material of growth. How do you share this understanding in your company, and how do you plan for success?
Cait O'Riordan (Financial Times)
Cait O'Riordan, VP of product, music, and platforms, Shazam
Julie Meyer (Ariadne Capital)
Julie Meyer, chairman, CEO, and founder, Ariadne Capital
Mark Torr (SAS)
Slides:   1-PPTX 
This presentation shares how SAS can help spread the use of Hadoop to less technical audiences, showcasing some of the end-user technologies already implemented at SAS customers that can help across the spectrum of data ingestion and management, visualization, and analytics.
Slides:   1-PDF 
Offering benefits is a classic and important strategy for acquisition of new customers and churn management. For measuring benefits with data, this model combines multivariate testing like A/B testing and Bayesian time series prediction modeling. The model is implemented in an R and CausalImpact package. This presentation will demonstrate the model structure and provide a case study.
Max Neunhöffer (ArangoDB)
Slides:   1-PDF 
We present a concrete case study of a situation where it was necessary to have different data models (documents and graphs) in the same database engine using a common query language. A single aircraft already contains some 6,000,000 parts, not counting components. Any single data model inevitably leads to inefficient queries, though queries are nevertheless crucial for the application.
Rod Smith (IBM Emerging Internet Technologies )
Big data and analytics continue to be a disruptive business force. Are we entering another phase – real-time digital business transformation, where businesses are realizing that the time to adjust to market and customer opportunities and threats is shrinking quickly?
Tyler Akidau (Google)
Slides:   external link
Learn what it takes to ditch your Big Data batch pipelines and go all-streaming-all-the-time, without compromising latency, correctness, or the flexibility to deal with changes in upstream data.
Mikio Braun (Zalando)
Slides:   1-PDF 
While the data management side of Big Data has seen tremendous progress in the past few years, bringing technologies like Hadoop or Spark together with advanced machine learning and data analysis methods is still a major challenge. In this talk, I will discuss recent advances, approaches, and patterns which are used to build truly scalable machine learning solutions.
Slides:   1-PPTX 
Cloudera Impala can be considered as an alternative solution to a relational database for data warehouse-like workloads. The CERN database community did a close evaluation of the Impala engine in respect to CERN's needs. In this presentation we will discuss our experience with the technology, and will report on a queries performance in comparison to data access using an Oracle RDBMS.
Yanpei Chen (Cloudera), Dileep Kumar (Cloudera Inc)
Slides:   1-PDF 
SQL-on-Hadoop systems that support business intelligence (BI) use cases must handle hundreds or even thousands of concurrent users. We will talk about how to scale your SQL-on-Hadoop system to a large number of concurrent users, and how to verify that your system can support BI.
Francis Irving (ScraperWiki Ltd.)
Slides:   1-PPT 
Better data collaboration is vital for every organization. For the UN's Humanitarian division it is particularly hard--they work in hundreds of countries, in emergencies and natural disasters. This talk describes the Humanitarian Data Exchange, answering such questions as: what motivates busy, front-line staff to share data? How do you measure the success of a data collaboration platform?
Simon Wardley (Leading Edge Forum)
Slides:   1-PDF 
Simon Wardley, researcher, Leading Edge Forum (CSC)
Dean Wampler (Anyscale)
Slides:   1-PDF 
Spark is often seen as a replacement for MapReduce in Hadoop systems, but Spark clusters can also be deployed and managed by Mesos. This talk explains how to use Mesos for Spark applications. Using example applications, we'll examine the pros and cons of using Mesos vs. Hadoop YARN as a data platform and discuss practical issues when running Spark on Mesos.
Martin Kleppmann (University of Cambridge)
Slides:   1-PDF 
Data is only useful if you can process it, analyse it, and create valuable products from it. If you have an idea for a new data-driven product, how long does it take you to get it into production? In this talk, we'll discuss Apache Kafka and Samza, open source tools created at LinkedIn with the goal of helping teams implement data products and ship them to production rapidly.
Frank Saeuberlich (Teradata)
Slides:   1-PDF 
Most organizations nowadays see the massive value potential in (big) data analytics. What most of them still fear is that starting an analytics initiative will result in a massive IT project that will take 12-18 months before first analytical results are achieved – and deploying the results to generate business value will take another 12-18 months. . .
Oana Calugar (AliveShoes )
Slides:   1-PDF 
Curiosity is one of the most valued skills for people working in Data Science. But how can we train it? Einstein said that "Curiosity is an important trait of a genius". Let’s explore how we can develop our curiosity with three exercises in the session: how to find pleasure in uncertainty; question the question we’re asking; and find a beginner's mind. With direct application to data science.
Joey Echeverria (Rocana)
Slides:   external link
As the volume of data and number of applications moving to Apache Hadoop has increased, so has the need to secure that data and those applications. In this presentation, we'll take a brief look at where Hadoop security is today and then peer into the future.
Shivon Zilis (Bloomberg Beta)
Slides:   1-PDF 
Shivon Zilis, venture capitalist and founding member of Bloomberg Beta
Charles Lamb (Cloudera), Andrew Wang (Cloudera)
Slides:   1-PPTX 
Encryption is a requirement for many business sectors dealing with confidential information. To meet these requirements, transparent, end-to-end encryption was added to HDFS. This protects data while it is in-flight and at-rest, and can be used compatibly with existing Hadoop apps. We will cover the design and implementation of transparent encryption in HDFS, as well as performance results.
Anand Subramanian (Gramener)
Slides:   external link
The election results page for the 2014 Indian general elections was hosted on CNN-IBN and The focus was on real-time analysis of results for users and TV anchors. With over 540 million voters and 100 million viewers, the volume and complexity of data both provide a design challenge. This talk focuses on the techniques behind this design.
Christine Foster (ShopKeep)
Slides:   external link
How to make data and analytics valuable to a business. How to improve a business with data and analytics. I'm a business person first, and an analyst second. I have seen many excellent data scientists fail to implement their ideas. I have also seen many excellent business people fail to generate value from data.