Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Scalable machine learning for data cleaning

Ihab Ilyas (University of Waterloo | Tamr)
1:10pm–1:50pm Thursday, 09/13/2018
Data science and machine learning
Location: 1A 08 Level: Non-technical
Secondary topics:  Data preparation, governance and privacy
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • CIOs, CDOs, VPs of data management, and any other senior IT leaders

Prerequisite knowledge

  • A basic understanding of data management and data management technologies

What you'll learn

  • Learn how to curate data at scale to enable transformational analytics and business outcomes


Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.

Ihab focuses on two main problems: entity consolidation, which is arguably the most difficult data curation challenge because it is notoriously complex and hard to scale, and using probabilistic inference to enrich data and suggest data repair for identified errors and anomalies. The problem statement in both cases sounds deceptively simple: find all the records from a collection of multiple data sources that refer to the same real-world entity or use trusted data sources to suggest how to correct errors. However, both problems have been challenging researchers and practitioners for decades due to the fundamentally combinatorial explosion in the space of solutions and the lack of ground truth.

There’s a large body of work on this problem by both academia and industry. Techniques have included human curation, rules-based systems, and automatic discovery of clusters using predefined thresholds on record similarity Unfortunately, none of these techniques alone has been able to provide sufficient accuracy and scalability. Ihab provides deeper insight into the entity consolidation and data repair problems and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

Photo of Ihab Ilyas

Ihab Ilyas

University of Waterloo | Tamr

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Comments on this page are now closed.


09/18/2018 6:16pm EDT


Can you please share the slides.