Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data Science & Machine Learning

21-24 May 2018
London, UK

If you're in data, you need to understand machine learning

Machine learning lets you discover hidden insight from your data. It's a simple idea with phenomenal impact and sophisticated use cases like recommenders, text mining, real-time analytics, large-scale anomaly detection, and business forecasting.

At Strata, you’ll get a deeper and broader understanding of machine and deep learning—take a look at the sessions below.

Add to your personal schedule
9:00 - 17:00 Monday, 21 May & Tuesday, 22 May
Location: S11C
The instructor walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 10 Level: Beginner
Barbara Fusinska (Katacoda)
Natural Language Processing techniques allow addressing tasks like text classification and information extraction and content generation. In this session, Barbara will walk the audience through the process of building the bag of words representation and using it for text classification. The goal of this tutorial is to build the intuition on the simple natural language processing task. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 12 Level: Intermediate
Vartika Singh (Cloudera), Jeffrey Shmain (Cloudera)
We go through approaches for preprocessing, training, inference and deployment across data sets (time-series, audio, video and text), leveraging Spark, extended ecosystem of libraries and Deep Learning Frameworks. We use respective (sample) data and code to understand implementation nuances, and subsequently highlight the bottlenecks and solutions for data/model at scale. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 10 Level: Beginner
Neejole Patel (Virginia Tech)
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Neejole Patel walks you through using PyTorch to build a content recommendation model. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 12 Level: Intermediate
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services)
Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Ihab Ilyas (University of Waterloo | Tamr)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Non-technical
Secondary topics:  Security and Privacy
Andrew Burt (Immuta)
Strata Data London 2018 will take place during one of the most important weeks in the history of data regulation - when GDPR begins to be enforced. This talk will explore the effects of the GDPR on deploying machine learning models in the EU. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Advanced
Mathew Salvaris (Microsoft), Miguel Gonzalez-Fierro (Microsoft), Ilia Karmanov (Microsoft)
In this talk we will present two platforms for running distributed deep learning training in the cloud. We will train a ResNet network on ImageNet dataset using some of the most popular deep learning frameworks. We will then compare and contrast the performance improvement as we scale the number of nodes as well as provides tips and details of the pitfalls of each framework and platform. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Intermediate
Jeff Fletcher (Cloudera)
As big data adoption grows, Apache Hadoop, Apache Spark and machine learning technologies are increasingly being used to analyse ever larger datasets. But we still have to keep telling stories about the data and making sure the message is clear. This talk will cover the tools and techniques that are relevant to data visualisation practitioners working with large data sets and predictive models. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Baiju Devani (Aviva Canada)
Risk sharing pools allow insurers to get rid of risks they are forced to insure in highly regulated markets. Insurers thus cede both the risk and its premium. But are we ceding the right risk or simply giving up premium ? We present an applied machine learning talk that leverages an ensemble of models and allows us to to get a distinctive market advantage and win through machine-learning. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Ted Dunning (MapR Technologies)
No matter how clever your learning algorithms two things will still be true - data and deployment logistics will dominate the effort - you will need more than 2 versions of your model, even in full production I will describe the rendezvous architecture and show how addresses these issues and many more, thus allowing more time to be spent thinking and doing real data science. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Non-technical
Michael Freeman (University of Washington)
Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. In this session, participants will learn a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 15/16 Level: Intermediate
Ira Cohen (Anodot)
The mobile world has so many moving parts, so a simple change to one element can cause havoc somewhere else. Resulting issues can annoy users and cause revenue leaks. This presentation will discuss ways to use anomaly detection to track everything mobile, from the service and roaming to specific apps, to fully optimize your mobile offerings. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
Learn how we used GPU accelerated open source technologies to improve our cyber defense platforms at NVIDIA. Leveraging software from the GPU Open Analytics Initiative, GOAI, we improved the performance and scale of our threat hunting activities. We will discuss how we accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Manas Ranjan Kar (Episource)
At Episource, we work on building Deep Learning frameworks and architectures to help summarize a medical chart, extract medical coding opportunities and their dependencies to recommend best possible ICD10 codes. This not only required building a wide variety of deep learning algorithms to account for natural language variations but also fairly complex in-house training data creation exercises Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
The rate of growth of data volume and velocity has been accelerating. Further, the variety of data sources also has been growing. This poses a significant challenge in extracting actionable insights in a timely fashion. The talk focuses on how marrying correlation analysis with anomaly detection can help to this end. Also, robust techniques shall be discussed to guide effective decision making. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 15/16 Level: Beginner
In-house data science teams often work with a range of business functions, apply diverse techniques and face unpredictable hurdles related to requirements, data, infrastructure and deployment. Traditional data science processes are too abstract to cope with the complexity of these environments. This session will use recent project examples at easyJet to highlight how we overcame these challenges. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: S11A Level: Advanced
Jacques Nadeau (Dremio)
This talk will deep-dive on a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. We'll start with an overview of the system design and deployment architecture. This includes coverage of cache lifecycle, update patterns, cache cohesion and appropriate use cases. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Lee Blum (Verint Systems)
Using an actual complex case study, Lee Blum will share how we built our Large Scale Cyber Defense system to serve our data scientists with versatile analytic operations on petabytes of data and trillions of records. He will discuss our extremely challenging use case, decision considerations, major design challenges, tips and tricks and the system’s overall results. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Beginner
Lucio Tolentino (Mashable)
When a particular story gets widespread attention we often say that it has gone viral, but the underlying process of how this happens is complex, with a multitude of different contributing factors. We present a series of experiments with random networks and viral content produced by Mashable that informs our understanding of viral news stories. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Mikio Braun (Zalando SE)
Time series data has many applications in industry, in particular predicting the future based on historical data. In this talk we will review time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations what works and what does not and in which context. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Beginner
Aurélien Géron (Kiwisoft)
Convolutional neural networks (CNN) can now complete many computer vision tasks with superhuman ability. This is will have a large impact in manufacturing, by improving anomaly detection, product classification, analytics, and more. Aurélien Géron details common CNN architectures, explains how they can be applied to manufacturing, and covers potential challenges along the way. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Ran Taig (Dell), Omer Sagi (Dell)
DevOps and QA engineers devote significant amount of time to investigate reoccurring issues. These issues are often represented by large configuration and log files so the process of investigating whether two issues are duplicates can be a very tedious task. This session presents a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues at Dell. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Non-technical
Naveed Ghaffar (Narrative Economics), Rashed Iqbal (UCLA)
Narratives are significant vectors of rapid change in culture, in zeitgeist, in economic behaviour. Introduced formally by Professor Robert Shiller in 2017, Narrative Economics studies the impact of popular human-interest stories on economic fluctuations. We present a framework that uses Natural Language Understanding for extracting and analysing narratives in human communication. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Heitor Murilo Gomes (Télécom-ParisTech), Albert Bifet (Huawei)
We present StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei Noah’s Ark Lab and Telecom ParisTech. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Olga Ermolin (MLS Listings)
Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Olga Ermolin details an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages transfer learning Siamese architecture based on VGG-16 CNN topology. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Jorie Koster-Hale (Dataiku)
Predicting crime poses a unique technical challenge - it is affected by many different geospatial and temporal features - weather, infrastructure, demographics, public events and government policy. Here, I use a combination of open source data, machine learning, time series modeling, and geostatistics to ask where crime will occur, what predicts it, and what we can do to prevent it in the future. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 12 Level: Intermediate
Secondary topics:  Security and Privacy
Fabian Yamaguchi (ShiftLeft)
While in the earlier days, code would generate data, with CPG we now generate data for the code so that we can understand it better. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Darren Cook (QQ Trend Ltd.)
Using LSTMs, state-of-the-art tokenizers, dictionaries and other data sources to tackle machine learning, focusing on one of the most difficult language pairs: Japanese to English. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 17 Level: Intermediate
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Holden Karau (Google), Rachel Warren (Independent), Anya Bida (Alpine Data)
Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using both historical and live job information using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest Dockerized the services powering its Home Feed and how it impacted the engineering productivity of its ML teams while increasing uptime and ease of deployment. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Non-technical
Radim Řehůřek (RARE Technologies Ltd.)
Radim Řehůřek shares lessons learned and tips for successful R&D in applied data science. You'll learn the primary gaps between the academic and industry skill sets, what businesses should look out for when applying cutting-edge research in practice, what researchers can do to increase the impact of their research, and what companies can do to promote, reward, and nurture good quality ML research. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Beginner
Ramesh Sridharan (Captricity)
Most uses of deep learning involve very deep models trained with large datasets. At Captricity, we're also using deep learning with tiny datasets at scale, training thousands of models using tens to hundreds of examples each. These models are dynamically trained using our automatic deployment framework, and carefully chosen metrics further exploit error properties of the resulting models. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: S11B Level: Beginner
Nanda Vijaydev (BlueData), Thomas Phelan (BlueData)
In the past, advanced machine learning techniques were only possible with a high-end proprietary stack. Today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. This session will focus on how to deploy TensorFlow and Spark, with Nvidia Cuda stack on Docker containers in a multi-tenant environment. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Michael Lee Williams (Fast Forward Labs)
Interpretable models result in more accurate, safer and more profitable machine learning products. But interpretability can be hard to ensure. In this talk, we'll look closely at the growing business case for interpretability, concrete applications including churn, finance and healthcare, and demonstrate the use of LIME, an open source, model-agnostic tool you can apply to your models today. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Calum Murray (Intuit)
Machine learning based applications are becoming the new norm. Intuit is using the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance. This talk looks at 5 use cases at Intuit. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: S11A Level: Non-technical
Thomas Dinsmore (Cloudera)
Data science transforms the organization. But executives struggle to build a culture of open data science, and transition from legacy commercial analytic tools. There are clear best practices organizations can use to accelerate adoption and success with open data science. We propose a model that helps organizations begin the journey, build momentum, and reduce reliance on legacy software. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Tony Xing (Microsoft), Bixiong Xu (Microsoft)
Introducing project Kensho, the one stop shop for business incident monitoring and auto insights within Microsoft, our path of infuse AI into the BI to serve different Microsoft teams. Our lesson learnt, the technology evolution, the good and bad, the architecture, the algorithms. And engineering + data science solved a common need which is applicable not only for Microsoft but the industry. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Beginner
Paco Nathan (O'Reilly Media)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Beginner
Kaylea Haynes (Peak )
Deciding how much stock to hold is a challenge for hire businesses. There is a fine balance between holding enough stock to fulfil hires and not holding too much stock so that overall utilisation is too low to achieve the return on investment. In this talk, we will describe a case study that we worked on involving forecasting the demand for thousands of assets across multiple locations. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Moty Fania (Intel)
In this session, Moty Fania will share Intel’s IT experience from implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming and online actuation. This session highlights the key learnings from this work with a thorough review of platform’s architecture Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Hope Wang (Intuit)
There’s increased demand of developing and scaling machine learning capabilities. A machine learning platform includes multiple phases which are iterative and overlapping with each other. Hope explains how to manage various artifacts, their associations, automate deployment in order to support the life-cycle of a model and build a cohesive Machine Learning platform. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Olivia Klose (Microsoft), Elena Terenzi (Microsoft)
Computer vision is becoming one of the focus areas in artificial intelligence (AI) to enable computers to see and perceive like humans. In a collaboration with the Royal Holloway University, we applied deep learning to locate small scale mines in Ghana using satellite imagery, scaled training using Kubernetes and investigated their impact on surrounding populations & environment. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 12 Level: Non-technical
David Asboth (Cox Automotive Data Solutions), Shaun McGirr (Cox Automotive Data Solutions)
Cox Automotive is the world’s largest automotive service organisation, and that means we can combine data from across the entire vehicle lifecycle. We are on a journey to turn this data into insights and want to share some of our experiences both in building up a data science team and scaling the data science process (from laptop to Hadoop cluster). Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Francesca Lazzeri (Microsoft), Jaya Mathew (Microsoft)
Advancements in computing technologies and e-commerce platforms have amplified the risk of online fraud. Failing to prevent fraud results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. In this talk we show how to operationalize deep learning models with AzureML to prevent fraud. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 15/16 Level: Non-technical
Simon Chan (Salesforce)
The promises of AI are great, but taking the steps to implement AI within an enterprise is challenging. The secret behind enterprise AI success often traces back to the underlying platform that accelerates AI development at scale. Based on years of experiences helping executives establish AI product strategies, Dr. Simon Chan walks through the AI platform journey that is right for your business. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: S11A Level: Beginner
Jason Bell (MastodonC)
Using Apache Kafka and DeepLearning4J Jason Bell presents a the design and implementation of a self learning knowledge system, the design rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data. Yet, real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Here, we review the monitoring methods and evaluate them for applicability in modern fast data and streaming applications. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Amit Kapoor (narrativeVIZ Consulting), Bargava Subramanian (Independent)
Amit Kapoor and Bargava Subramanian lead three live demos of deep learning (DL) done in the browser—building explorable explanations to aid insight, building model inference applications, and rapid prototyping and training an ML model—using the emerging client-side JavaScript libraries for DL. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 15/16 Level: Beginner
Daniel Gilbert (News UK), Jonathan Leslie (Pivigo)
In the era of 24-hour news and online newspapers, editors in the newsroom must be able to make fast decisions about their content and must quickly and efficiently make sense of the enormous amounts of data that they encounter. We will discuss an ongoing partnership between News UK and Pivigo in which a team of data science trainees helped develop an AI platform to help in this task. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 17 Level: Intermediate
Kevin Sigliano (IE Business School )
Financial and Consumer ROI demands that business leaders understand the drivers and dynamics of digital transformation and big data. Disrupting value propositions and continuous innovation are critical if you wish to dramatically improve the way your company engages, creates value and maximize financial results. Read more.