The dangers of data leakage in production machine learning systems

Martin Goodson (Evolution AI)

13:45–14:25 Wednesday, 16 October 2019

Location: Westminster Suite

AI Business Summit, Implementing AI

Secondary topics: Ethics, Security, and Privacy, Machine Learning

Average rating:

(4.75, 4 ratings)

Who is this presentation for?

Data scientists, engineers, and product managers

Level

Intermediate

Description

According to published research, data leakage is frequently found in public datasets, and it is likely to be at least as widespread in the private sector, where there’s less transparency.

Data leakage occurs when the model gains access to data that it shouldn’t have access to. AI systems can fail catastrophically in production if leakage is not dealt with properly. Martin Goodson details the main four manifestations of data leakage and explains how to recognize the warning signs. By mastering several key scientific principles, you can mitigate the risk of failure.

Prerequisite knowledge

Familiarity with supervised learning, classification, precision, recall, accuracy, cross-validation, and train and test split

What you'll learn

Learn the errors that data leakage causes and how to build systems that protect against common manifestations of data leakage

Martin Goodson

Evolution AI

Martin Goodson is the chief scientist and CEO of Evolution AI, where he specializes in large-scale natural language processing. Martin has designed data science products that are in use at companies like Dun & Bradstreet, Time Inc., John Lewis, and Condé Nast. Previously, Martin was a statistician at the University of Oxford, where he conducted research on statistical matching problems for DNA sequences. He runs the largest community of machine learning practitioners in Europe, Machine Learning London, and convenes the CBI/Royal Statistical Society roundtable, AI in Financial Services. Martin’s work has been covered by publications such as the Economist, Quartz, Business Insider, TechCrunch, and others.