Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data Science & Machine Learning

21-24 May 2018
London, UK

If you're in data, you need to understand machine learning

Machine learning lets you discover hidden insight from your data. It's a simple idea with phenomenal impact and sophisticated use cases like recommenders, text mining, real-time analytics, large-scale anomaly detection, and business forecasting.

At Strata, you’ll get a deeper and broader understanding of machine and deep learning—take a look at the sessions below.

Monday-Tuesday 21-22 May: 2-Day Training (Platinum & Training passes)
Tuesday 22 May: Tutorials (Gold & Silver passes)
Wednesday 23 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:45am
Morning break
Thursday 24 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:45am
Morning break
Add to your personal schedule
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 10 Level: Beginner
Secondary topics:  Text and Language processing and analysis
Barbara Fusinska (Google)
Natural language processing techniques help address tasks like text classification, information extraction, and content generation. Barbara Fusinska offers an overview of natural language processing and walks you through building a bag-of-words representation, using Python and its machine learning libraries, and then using it for text classification. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 15 Level: Intermediate
Vartika Singh (Cloudera), Juan Yu (Cloudera)
Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. Read more.
Add to your personal schedule
9:0017:00 Tuesday, 22 May 2018
Location: Capital Suite 2/3
Dan Jeavons (Shell), Hollie Lubbock (Fjord), Jivan Virdee (Fjord), Fausto Morales (Arundo), Marty Cochrane (Arundo), Jane McConnell (Teradata), Paul Ibberson (Teradata), Kevin Parent (Conduce), Javier Esplugas (DHL Supply Chain), Viola Melis (Typeform), Dave Fitch (The Data Lab), Federica Mutti (Data Reply ), Maria Assunta Palmieri (Data Reply ), Niranjan Thomas (Dow Jones), Erik Elgersma (FrieslandCampina)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 10 Level: Beginner
Secondary topics:  E-commerce and Retail, Media, Advertising, Entertainment
Neejole Patel (Virginia Tech)
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Neejole Patel walks you through using PyTorch to build a content recommendation model. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Text and Language processing and analysis
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services)
Natural language processing is a key component in many data science systems. David Talby and Claudiu Branzan lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions
Ihab Ilyas (University of Waterloo | Tamr)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Non-technical
Secondary topics:  Security and Privacy
Andrew Burt (Immuta)
The Strata Data conference in London takes place during one of the most important weeks in the history of data regulation, as GDPR begins to be enforced. Andrew Burt explores the effects of the GDPR on deploying machine learning models in the EU. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Advanced
Mathew Salvaris (Microsoft), Miguel Gonzalez-Fierro (Microsoft), Ilia Karmanov (Microsoft)
Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Intermediate
Secondary topics:  Visualization, Design, and UX
Jeff Fletcher (Cloudera)
As big data adoption grows, Apache Hadoop, Apache Spark, and machine learning technologies are increasingly being used to analyze ever-larger datasets, but we still have to keep telling stories about the data and making sure the message is clear. Jeff Fletcher details the tools and techniques that are relevant to data visualization practitioners working with large datasets and predictive models. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Expo Hall Level: Beginner
Secondary topics:  Media, Advertising, Entertainment
Daniel Gilbert (News UK), Jonathan Leslie (Pivigo)
In the era of 24-hour news and online newspapers, editors in the newsroom must quickly and efficiently make sense of the enormous amounts of data that they encounter and make decisions about their content. Daniel Gilbert and Jonathan Leslie discuss an ongoing partnership between News UK and Pivigo in which a team of data science trainees helped develop an AI platform to help in this task. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Secondary topics:  Financial Services
Baiju Devani (Aviva Canada), Étienne Chassé St-Laurent (Aviva Canada)
Risk-sharing pools allow insurers to get rid of risks they are forced to insure in highly regulated markets. Insurers thus cede both the risk and its premium. But are they ceding the right risk or simply giving up premium? Baiju Devani and Étienne Chassé St-Laurent share an applied machine learning approach that leverages an ensemble of models to gain a distinctive market advantage. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 12
Secondary topics:  Media, Advertising, Entertainment, Security and Privacy
Elisa Celis (EPFL)
There is a pressing need to design new algorithms that are socially responsible in how they learn and socially optimal in the manner in which they use information. Elisa Celis explores the emergence of bias in algorithmic decision making and presents first steps toward developing a systematic framework to control biases in classical problems, such as data summarization and personalization. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  E-commerce and Retail, Media, Advertising, Entertainment
In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 14
Secondary topics:  Media, Advertising, Entertainment, Security and Privacy
Guillaume Chaslot (AlgoTransparency)
An increasing number of ex-Google and ex-Facebook employees state that social media is starting to control us rather than the other way around. How can we determine if social media is a pure reflection of people's interests or if it pushes us toward specific narratives? Guillaume Chaslot explores methodologies to find out which narratives are favored by social media recommendation engines. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Expo Hall Level: Intermediate
Konstantinos Georgatzis (QuantumBlack), Martha Imprialou (QuantumBlack)
Konstantinos Georgatzis and Martha Imprialou explain how to interpret the predictions given by your black-box model and how machine learning is helping to drive decision making today. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Manas Ranjan Kar (Episource)
Episource is building a scalable NLP engine to help summarize medical charts and extract medical coding opportunities and their dependencies to recommend best possible ICD10 codes. Manas Ranjan Kar offers an overview of the wide variety of deep learning algorithms involved and the complex in-house training-data creation exercises that were required to make it work. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Telecom, Time Series and Graphs
Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Huawei)
Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei Noah’s Ark Lab and Telecom ParisTech. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Security and Privacy
Eran Avidan (Intel)
Deep learning is revolutionizing many domains within computer vision, but doing real-time analysis is challenging. Eran Avidan offers an overview of a novel architecture based on Redis, Docker, and TensorFlow that enables real-time analysis of high-resolution streaming video. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Non-technical
Jivan Virdee (Fjord), Hollie Lubbock (Fjord)
Artificial intelligence systems are powerful agents of change in our society, but as this technology becomes increasingly prevalent—transforming our understanding of ourselves and our society—issues around ethics and regulation will arise. Jivan Virdee and Hollie Lubbock explore how to address fairness, accountability, and the long-term effects on our society when designing with data. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Managing and Deploying Machine Learning
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  E-commerce and Retail, Financial Services, Time Series and Graphs
Mikio Braun (Zalando SE)
Time series data has many applications in industry, in particular predicting the future based on historical data. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn't. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Beginner
Aurélien Géron (Kiwisoft)
Convolutional neural networks (CNN) can now complete many computer vision tasks with superhuman ability. This is will have a large impact in manufacturing, by improving anomaly detection, product classification, analytics, and more. Aurélien Géron details common CNN architectures, explains how they can be applied to manufacturing, and covers potential challenges along the way. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Beginner
Secondary topics:  Visualization, Design, and UX
Brian O'Neill (Designing for Analytics)
Gartner says 85%+ of big data projects will fail. Your own company may have even spent millions on a recent project that isn’t really delivering the value or UX everyone hoped for. Brian O'Neill explains why CDOs, PMs, and business leaders who leverage design to prioritize utility, usability, and customer value will realize the best ROIs and demonstrates how to start evaluating your UX. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Non-technical
Secondary topics:  Text and Language processing and analysis
Naveed Ghaffar (Narrative Economics), Rashed Iqbal (UCLA)
Narratives are significant vectors of rapid change in culture, economic behavior, and the Zeitgeist of a society. Narrative economics studies the impact of popular human-interest stories on economic fluctuations. Naveed Ghaffar and Rashed Iqbal outline a framework that uses natural language understanding to extract and analyze narratives in human communication. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Time Series and Graphs
The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions, Media, Advertising, Entertainment
Olga Ermolin (MLS Listings)
Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Olga Ermolin details an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages transfer learning Siamese architecture based on VGG-16 CNN topology. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Beginner
Secondary topics:  Visualization, Design, and UX
Bargava Subramanian (Impel Labs), Amit Kapoor (narrativeVIZ Consulting)
Creating visualizations for data science requires an interactive setup that works at scale. Bargava Subramanian and Amit Kapoor explore the key architectural design considerations for such a system and discuss the four key trade-offs in this design space: rendering for data scale, computation for interaction speed, adapting to data complexity, and being responsive to data velocity. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Jorie Koster-Hale (Dataiku)
Because it's affected by a number of geospatial and temporal features, predicting crime poses a unique technical challenge. Jorie Koster-Hale shares an approach using a combination of open source data, machine learning, time series modeling, and geostatistics to determine where crime will occur, what predicts it, and what we can do to prevent it in the future. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Security and Privacy, Time Series and Graphs
Fabian Yamaguchi (ShiftLeft)
Fabian Yamaguchi offers an overview of Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Text and Language processing and analysis
Darren Cook (QQ Trend)
Darren Cook demonstrates how to use LSTMs, state-of-the-art tokenizers, dictionaries, and other data sources to tackle translation, focusing on one of the most difficult language pairs: Japanese to English. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Intermediate
Mark Grover (Lyft), Deepak Tiwari (Lyft)
Sure, you’ve got the best and fastest running SQL engine, but you’ve still got some problems: Users don’t know which tables exist or what they contain; sometimes bad things happen to your data, and you need to regenerate partitions but there is no tool to do so. Mark Grover explains how to make your team and your larger organization more productive when it comes to consuming data. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Non-technical
Secondary topics:  Text and Language processing and analysis
Radim Řehůřek (RARE Technologies Ltd.)
Radim Řehůřek shares lessons learned and tips for successful R&D in applied data science. You'll learn the primary gaps between the academic and industry skill sets, what businesses should look out for when applying cutting-edge research in practice, what researchers can do to increase the impact of their research, and what companies can do to promote, reward, and nurture good quality ML research. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Beginner
Jeroen Janssens (Data Science Workshops)
"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Beginner
Secondary topics:  Managing and Deploying Machine Learning
Ramesh Sridharan (Captricity)
Most uses of deep learning involve models trained with large datasets. Ramesh Sridharan explains how Captricity uses deep learning with tiny datasets at scale, training thousands of models using tens to hundreds of examples each. These models are dynamically trained using a automatic deployment framework, and carefully chosen metrics further exploit error properties of the resulting models. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Expo Hall Level: Beginner
Secondary topics:  Time Series and Graphs
Jared Lander (Lander Analytics)
Temporal data is being produced in ever greater quantity, but fortunately our time series capabilities are keeping pace. Jared Lander explores techniques for modeling time series, from traditional methods such as ARMA to more modern tools such as Prophet and machine learning models like XGBoost and neural nets. Along the way, Jared shares theory and code for training these models. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 14 Level: Intermediate
Secondary topics:  Managing and Deploying Machine Learning
Ted Dunning (MapR Technologies)
Ted Dunning offers an overview of the rendezvous architecture, geared to deal with much of the complexity involved in deploying models to production, thus allowing more time to be spent thinking and doing real data science. Ted covers the ideas behind the architecture, practical scenarios, and advantages and disadvantages of the architecture. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Secondary topics:  Financial Services
Mike Lee Williams (Cloudera Fast Forward Labs)
Interpretable models result in more accurate, safer and more profitable machine learning products, but interpretability can be hard to ensure. Michael Lee Williams examines the growing business case for interpretability, explores concrete applications including churn, finance and healthcare, and demonstrates the use of LIME, an open source, model-agnostic tool you can apply to your models today. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Financial Services
Calum Murray (Intuit)
Machine learning-based applications are becoming the new norm. Calum Murray shares five use cases at Intuit that use the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Beginner
Paco Nathan (O'Reilly Media)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Beginner
Kaylea Haynes (Peak )
Deciding how much stock to hold is a challenge for hire businesses. There is a fine balance between holding enough stock to fulfill hires and not holding too much stock so that overall utilization is too low to achieve the return on investment. Kaylea Haynes shares a case study on forecasting the demand for thousands of assets across multiple locations. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Data Platforms, Managing and Deploying Machine Learning
Moty Fania (Intel)
Moty Fania explains how Intel implemented an AI inference platform to enable internal visual inspection use cases and shares lessons learned along the way. The platform is based on open source technologies and was designed for real-time streaming and online actuation. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Financial Services, Text and Language processing and analysis
David Talby (Pacific AI), Saif Addin Ellafi (John Snow Labs), Paul Parau (UiPath)
Spark NLP is an open source library that natively extends Spark ML to provide natural language understanding capabilities with performance and scale that was not possible to date. David Talby explains how Spark NLP was used to augment the Recognos smart data extraction platform in order to automatically infer fuzzy, implied, and complex facts from long financial documents. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Elena Terenzi (Microsoft), Michael Lanzetta (Microsoft)
Olivia Klose and Elena Terenzi offer an overview of a collaboration between Microsoft and the Royal Holloway University that applied deep learning to locate illegal small-scale mines in Ghana using satellite imagery, scaled training using Kubernetes, and investigated the mines' impact on surrounding populations and environment. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Non-technical
David Asboth (Cox Automotive Data Solutions), Shaun McGirr (Cox Automotive Data Solutions)
Cox Automotive is the world’s largest automotive service organization, which means it can combine data from across the entire vehicle lifecycle. Cox is on a journey to turn this data into insights. David Asboth and Shaun McGirr share their experience building up a data science team at Cox and scaling the company's data science process from laptop to Hadoop cluster. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Secondary topics:  Financial Services, Time Series and Graphs
Francesca Lazzeri (Microsoft), Jaya Mathew (Microsoft)
Advancements in computing technologies and ecommerce platforms have amplified the risk of online fraud, which results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. Francesca Lazzeri and Jaya Mathew explain how to operationalize deep learning models with Azure ML to prevent fraud. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Beginner
Secondary topics:  Financial Services
Jonathan Leslie (Pivigo), Tom Harrison (Hackney Council), Maryam Qurashi (Pivigo)
One major challenge to social housing is determining how best to target interventions when tenants fall behind on rent payments. Jonathan Leslie and Tom Harrison discuss a recent project in which a team of data scientist trainees helped Hackney Council devise a more efficient, targeted strategy to detect and prioritize such situations. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Telecom
Sven Löffler (T-Systems)
Sven Löffler offers an overview of the Data Intelligence Hub, T-Systems's implementation of the Fraunhofer Industrial Data Space: a reference architecture for the standardized and secure data exchange between industries in the context of the internet of things. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Amit Kapoor (narrativeVIZ Consulting), Bargava Subramanian (Impel Labs)
Amit Kapoor and Bargava Subramanian lead three live demos of deep learning (DL) done in the browser—building explorable explanations to aid insight, building model inference applications, and rapid prototyping and training an ML model—using the emerging client-side JavaScript libraries for DL. Read more.