Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Let the machines learn to improve data quality

Archana Anandakrishnan (American Express)
2:00pm–2:40pm Thursday, 09/13/2018
Data science and machine learning
Location: 1A 08 Level: Intermediate
Secondary topics:  Data preparation, governance and privacy, Financial Services
Average rating: ***..
(3.20, 5 ratings)

Who is this presentation for?

  • Data scientists and data science managers

Prerequisite knowledge

  • Familiarity with statistics and the ML modeling process

What you'll learn

  • Explore DataQC Studio, American Express's automated, scalable system for measurement and management of data quality
  • Learn how to take advantage of available open source packages to build your own data quality pipeline


The process of building a successful machine learning (ML) application hinges on the ability of a data scientist to develop a detailed understanding of the data and its attributes, such as variable type, range, outliers, and relationship with the dependent variable, thereby ensuring the data quality. The successful application of this model depends on the ability to obtain data in the same format and structure over and over again without significant changes in the statistical distribution of any of the attributes. Changes in the data require frequent model refreshes or can result in subpar model performances. For applications that rely on artificial intelligence (AI) to make decisions, these requirements can make it a significant challenge for a single or a group of human data scientists to ensure repeatable, high levels of data quality. A system that is capable of detecting a wide range of data issues while being fully automated is key to ensuring accurate and reliable models.

Archana Anandakrishnan offers an overview of DataQC Studio, American Express’s automated system built to identify data issues and data anomalies and create an exhaustive snapshot of the data. The tool has been built with Python and Spark to be able to scale to large datasets and is completely built with open source tools such the MLlib random forest classifier and the t-SNE implementation in scikit-learn.

By combining a variety of methods, DataQC Studio learns a confident quality score for any dataset that can be used to assess the integrity of a dataset before using in a model. The tool solves a fundamental problem of data quality management and its potential far exceeds any manual data quality management process. Archana demonstrates the power of the ML-powered DataQC pipeline built with open source software by showcasing its extensive use at American Express. Since ML models power many critical decisions at American Express, the accuracy of these models are important for managing risk and delivering a superior customer experience.

The methods described here are modular and adaptable to any domain where accurate decisions from ML models are critical.

Photo of Archana Anandakrishnan

Archana Anandakrishnan

American Express

Archana Anandakrishnan is a senior data scientist in the Decision Science Organization at American Express, where she works on developing data products that accelerate the modeling lifecycle and adoption of new methods at American Express. She is currently a lead developer and contributor to DataQC Studio. Previously, she was a postdoc researcher in particle physics at Cornell University. She is passionate about mentoring and is currently a workplace mentor with Big Brothers Big Sisters, NYC. Archana holds a PhD in physics from the Ohio State University.