Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Solving data cleaning and unification using human-guided machine learning

Ihab Ilyas (University of Waterloo)
14:5515:35 Wednesday, 1 May 2019
Data Science, Machine Learning & AI
Location: Capital Suite 14
Average rating: ****.
(4.71, 7 ratings)

Who is this presentation for?

  • CDOs, CAOs, CIOs, digital transformation leaders, and data architects



Prerequisite knowledge

  • Familiarity with basic notions of data clustering and record similarity metrics

What you'll learn

  • Understand how human-in-the-loop machine learning systems can improve the accuracy and scale at which data can be integrated


Last year, Ihab Ilyas covered two primary challenges in applying machine learning to data curation: entity consolidation and using probabilistic inference to suggest data repair for identified errors and anomalies. This year, he explores these limitations in greater detail and explains why data unification projects common to modern enterprises quickly require human-guided machine learning and a probabilistic model.

Machine learning is being used to address a host of data curation challenges but must be applied in the proper context to meet the scale of the problems at hand. Using machine learning to replicate traditional techniques will fail as data sources and volume expand, regardless of the speed ML affords. Data semantics and domain-specific knowledge must be integral to the solution.

Join in to learn how to provide sufficient accuracy and scalability in building data curation and gain deeper insight into why entity consolidation and data repair problems require machine learning, human expertise, and problem semantics to deliver a scalable, high-accuracy solution.

Photo of Ihab Ilyas

Ihab Ilyas

University of Waterloo

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He’s a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he’s an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.