Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Data Science conference sessions

Inside the world of data practitioners, from the hard science of the latest algorithms and advances in machine learning to the thorny issues of cultural change and team-building.

Tuesday, 05 May

Add to your personal schedule
9:00–12:30 Tuesday, 5/05/2015
Location: St. James / Regents
Olivier Grisel (Inria & scikit-learn)
Average rating: ****.
(4.00, 5 ratings)
Three-hour hands-on introductory workshop on predictive modeling and machine learning with open source tools from the Python community such as scikit-learn and IPython. Read more.
Add to your personal schedule
13:30–17:00 Tuesday, 5/05/2015
Location: St. James / Regents
Garrett Grolemund (RStudio), Colin Gillespie (Jumping Rivers | Newcastle University)
Average rating: ***..
(3.75, 4 ratings)
Learn how to combine the best ideas of reproducible research into a simple, easy-to-use workflow with R. The Packrat, R Markdown, and Shiny packages let you (a) embed your code into reports to create a reproducible record of your work, (b) rerun the code to generate a new report as data and ideas change, and (c) export your reports into multiple formats, including pdfs and interactive web apps. Read more.

Wednesday, 06 May

Add to your personal schedule
10:55–11:35 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Tim Harford (The Financial Times)
Average rating: ****.
(4.89, 9 ratings)
We're always talking about "innovation", but - says Tim Harford - there are really two very different kinds of innovation. Using stories from sport, science, music and military history, Tim will make you think different about where good ideas come from and how they should be encouraged. Read more.
Add to your personal schedule
11:45–12:25 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Ben Lever (Ambiata)
Average rating: ***..
(3.00, 1 rating)
Ivory is a new open-source, Hadoop-based data store that focuses on changing the way we approach the critical and time-consuming activity of scalable feature engineering. It both simplifies and adds rigour to data science pipelines, aiding in their transition from the lab to production environments. Read more.
Add to your personal schedule
13:45–14:05 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Alice Zheng (Amazon)
Average rating: ***..
(3.75, 12 ratings)
Building and deploying predictive applications require knowing how to evaluate, test, and track the performance of machine learning models over time. Using available off-the-shelf tools, this talk engages potential application builders on topics such as common evaluation metrics, A/B testing set up, tracking model performance, tracking usage via real-time feedback, and updating models. Read more.
Add to your personal schedule
14:05–14:25 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Felipe Hoffa (Google)
Average rating: ***..
(3.44, 9 ratings)
How big is the human genome? What tools can we use to manage and understand it? Turns out the same tools used for traditional purposes (Hadoop, Spark, BigQuery, Dataflow, and SQL) can be applied to genomics. In this session we'll introduce the basics of managing genomes with our favorite big data tools, and draw parallels with more traditional use cases like analyzing view logs. Read more.
Add to your personal schedule
14:35–14:55 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Noel Welsh (Underscore Consulting)
Average rating: ***..
(3.78, 9 ratings)
A/B testing is easy; it's just an application of hypothesis testing, taught in every first year stats course. My goal in this talk is to convince you that this view is wrong. There is a lot of subtlety in creating a meaningful test, and this subtlety is important in practice. I'll cover issues from methodology to epistemology, giving insights and tools directly applicable to practice. Read more.
Add to your personal schedule
14:55–15:15 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Average rating: **...
(2.67, 9 ratings)
Offering benefits is a classic and important strategy for acquisition of new customers and churn management. For measuring benefits with data, this model combines multivariate testing like A/B testing and Bayesian time series prediction modeling. The model is implemented in an R and CausalImpact package. This presentation will demonstrate the model structure and provide a case study. Read more.
Add to your personal schedule
16:15–16:55 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Jeremy Heffner (Azavea)
Average rating: ****.
(4.94, 16 ratings)
We often face the need to analyze the count of discrete events which occur at a specific time and place, whether they are crime events, taxi requests, or phone calls. Forecasting these space-time events brings particular challenges: finding suitable tools for geographic processing, and techniques for modeling the data. The session will cover the lessons learned in building such a system. Read more.
Add to your personal schedule
17:05–17:45 Wednesday, 6/05/2015
Location: King's Suite - Balmoral
Carlos Guestrin (Apple | University of Washington )
Average rating: ****.
(4.59, 17 ratings)
Deep learning is a promising machine learning technique with a high barrier to entry. In this talk, we provide an easy entry into this field via "deep features" from pre-trained models. These features can be trained on one data set for one task and used to obtain good predictions on a different task, on a different data set. No prior experience is necessary. Read more.

Thursday, 07 May

Add to your personal schedule
10:55–11:35 Thursday, 7/05/2015
Location: King's Suite - Balmoral
Sean Owen (Cloudera)
Average rating: ****.
(4.94, 17 ratings)
Apache Spark has a lot to like for the data scientist: natively distributed, REPL, Scala and Python APIs, and a machine learning library, MLlib. Spark 1.2 includes an implementation of random decision forests, an important classifier/regressor algorithm. This talk will introduce Spark, Scala, and random decision forests, and demonstrate the process of analyzing a real-world data set with them. Read more.
Add to your personal schedule
11:45–12:25 Thursday, 7/05/2015
Location: King's Suite - Balmoral
Mikio Braun (Zalando SE)
Average rating: ****.
(4.40, 5 ratings)
While the data management side of Big Data has seen tremendous progress in the past few years, bringing technologies like Hadoop or Spark together with advanced machine learning and data analysis methods is still a major challenge. In this talk, I will discuss recent advances, approaches, and patterns which are used to build truly scalable machine learning solutions. Read more.
Add to your personal schedule
13:45–14:05 Thursday, 7/05/2015
Location: King's Suite - Balmoral
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services)
Average rating: ***..
(3.71, 7 ratings)
Live demo using Python open-source libraries to build a hybrid machine-learning model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules, and anomaly detection. We’ll then show how we scaled to billions of events using Spark, and what it took to make the system perform and ready for production. Read more.
Add to your personal schedule
14:05–14:25 Thursday, 7/05/2015
Location: King's Suite - Balmoral
Divanny Lamas (Context Relevant)
Average rating: **...
(2.00, 3 ratings)
Context relevant has defined the next-generation of financial information capabilities by applying rapid automated predictive analytics software to solve Wall Street’s toughest problems. The big data 2.0 era of automated, intelligent, and scalable systems allows Wall Street banks to finally take advantage of the massive value of the data they hold and better serve and protect their customers. Read more.
Add to your personal schedule
14:35–14:55 Thursday, 7/05/2015
Location: King's Suite - Balmoral
Jeroen Janssens (Data Science Workshops)
Average rating: ***..
(3.71, 7 ratings)
Hadoop, Storm, and Spark are fantastic frameworks for processing massive amounts of data in parallel. Every now and then, there is a one-off data science task that could really use some speeding up. For those kinds of tasks, it's probably not worthwhile to set up large frameworks. This presentation demonstrates GNU Parallel, which allows you to easily parallelize and distribute such tasks. Read more.
Add to your personal schedule
14:55–15:15 Thursday, 7/05/2015
Location: King's Suite - Balmoral
Richard Shaw (MapR)
Average rating: ****.
(4.33, 3 ratings)
Apache Spark is a powerful, unified data processing engine offering a number of APIs, from batch/SQL over streaming to manipulations over graphs. The core architecture of Spark has not necessarily been designed with a multi-user environment in mind. We will review existing and emerging approaches how to use Spark in multi-user environments, such as the Tachyon project. Read more.
Add to your personal schedule
16:15–16:55 Thursday, 7/05/2015
Location: King's Suite - Balmoral
Kevin Schmidt (Mind Candy Ltd), Luis Angel Vicente Sanchez (Mind Candy Ltd.)
Average rating: ****.
(4.50, 2 ratings)
Mobile gaming is a fast-moving field and needs metrics like daily active users or revenue in real-time to be able to fine-tune quickly. Approximation is needed to count those metrics, as the data volume would be too large to process exactly in real-time. We will demonstrate how to use Spark Streaming and probabilistic data structures to achieve a low error rate, even for many millions of users. Read more.
Add to your personal schedule
17:05–17:45 Thursday, 7/05/2015
Location: King's Suite - Balmoral
David Jonker (Uncharted Software Inc.), Scott Langevin (Uncharted Software Inc.)
Average rating: ****.
(4.00, 2 ratings)
This session demonstrates using open source tools and techniques for visually exploring massive node-link graphs in a web browser by visualizing all the data. Seeing all the data reveals informative patterns and provides important context to understanding insights. Examples will highlight large scale graph analysis of social networks, customer purchase history, and health care industry data. Read more.