The process of building a successful machine learning (ML) application hinges on the ability of a data scientist to develop a detailed understanding of the data and its attributes, such as variable type, range, outliers, and relationship with the dependent variable, thereby ensuring the data quality. The successful application of this model depends on the ability to obtain data in the same format and structure over and over again without significant changes in the statistical distribution of any of the attributes. Changes in the data require frequent model refreshes or can result in subpar model performances. For applications that rely on artificial intelligence (AI) to make decisions, these requirements can make it a significant challenge for a single or a group of human data scientists to ensure repeatable, high levels of data quality. A system that is capable of detecting a wide range of data issues while being fully automated is key to ensuring accurate and reliable models.
Archana Anandakrishnan offers an overview of DataQC Studio, American Express’s automated system built to identify data issues and data anomalies and create an exhaustive snapshot of the data. The tool has been built with Python and Spark to be able to scale to large datasets and is completely built with open source tools such the MLlib random forest classifier and the t-SNE implementation in scikit-learn.
By combining a variety of methods, DataQC Studio learns a confident quality score for any dataset that can be used to assess the integrity of a dataset before using in a model. The tool solves a fundamental problem of data quality management and its potential far exceeds any manual data quality management process. Archana demonstrates the power of the ML-powered DataQC pipeline built with open source software by showcasing its extensive use at American Express. Since ML models power many critical decisions at American Express, the accuracy of these models are important for managing risk and delivering a superior customer experience.
The methods described here are modular and adaptable to any domain where accurate decisions from ML models are critical.
Archana Anandakrishnan is a senior data scientist in the Decision Science Organization at American Express, where she works on developing data products that accelerate the modeling lifecycle and adoption of new methods at American Express. She is currently a lead developer and contributor to DataQC Studio. Previously, she was a postdoc researcher in particle physics at Cornell University. She is passionate about mentoring and is currently a workplace mentor with Big Brothers Big Sisters, NYC. Archana holds a PhD in physics from the Ohio State University.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com