A novel solution for a data augmentation and bias problem in NLP using TensorFlow

KC Tung (Microsoft)

11:50am–12:30pm Thursday, October 31, 2019

Location: Grand Ballroom C/D

Text, language, speech

Average rating:

(4.00, 1 rating)

View slides

Who is this presentation for?

Machine learning engineers, data scientists, architects, innovators, solution specialists, and technical account execs

Level

Intermediate

Description

The TensorFlow ecosystem contains many valuable assets. One of which is the highly acclaimed TensorFlow high-level API. It’s critical for a fast and lightweight approach to reducing lead time in deep learning model development and hypothesis testing. It’s now possible to quickly and easily develop a novel deep learning solution to meet an important need in practice: data bias and augmentation in NLP. Solving this problem would have a far-reaching impact in model bias, offensive-language detection, language personalization, and classification.

KC Tung details his work to satisfy a need of an enterprise customer (one of the largest airlines in the world) for a model that can accurately review, classify, and store texts from aircraft maintenance logs to comply with FAA regulations on aviation safety. The customer’s data is imbalanced and biased toward certain categories.

Training machine learning models with imbalanced data inevitably leads to model bias, and text generation is a novel and important approach for data augmentation. In NLP, many current approaches to augmenting minority data are unsupervised and are limited to synonym swap, insertion, deletion, or oversampling. These generalized approaches often lead to a trade-off between precision and recall. They also don’t work well in practice, as enterprise data is almost always domain specific. There needs to be a better framework to generate new corpus by learning from any domain-specific underrepresented text.

KC presents a novel deep learning framework built with TensorFlow to quickly achieve this goal. A benchmark model is trained on the balanced dataset. From this dataset a class is undersampled as the underrepresented, minority class text. Then a gated recurrent unit (GRU) model learns to generate more underrepresented text, which helps training a long short-term memory (LSTM) model that classifies text. The result on holdout data shows that the model trained with generated text is surprisingly effective. Classification accuracy, precision, and recall at each class are all on par with the benchmark model without compromising precision or recall. In short, this demonstrates the success of TensorFlow adoption for the enterprise customer in quickly leveraging and applying the TensorFlow high-level API in building a novel production-grade solution for deployment, demonstrating the effectiveness of a novel data-augmentation framework, identifying a “killer app” or a new core value for text generation, and best practices and guidance in navigating machine learning model bias and business impact.

KC also details how to containerize the TensorFlow application and serve it in a Kubernetes cluster in the cloud, all with open source Python libraries. The TensorFlow high-level API proves to be indispensable for a fast and high-quality deep learning model development experience. Most importantly, this TensorFlow model may be deployed as a container in the cloud, on-premises, or at the edge, providing great flexibility to meet various solution architecture or business needs.

Prerequisite knowledge

Experience with TensorFlow, Keras, or other machine learning frameworks (useful but not required)
Familiarity with NLP, deep learning, text classification, and text generation (useful but not required)

What you'll learn

Discover TensorFlow high-level API for production grade model, quick starts for deep learning model development with hidden gems in tf.data examples, a new "killer app" for machine text generation using TensorFlow, and reference architecture for TensorFlow model deployment in the cloud or at the edge

KC Tung

Microsoft

KC Tung is an AI architect at Microsoft. Previously, he has been a cloud architect, ML engineer, and data scientist with hands-on experience and success in the development and serving of AI, deep learning, computer vision, and natural language processing (NLP) models in many enterprise use case-driven architectures, using open source machine learning libraries such as TensorFlow, Keras, PyTorch, and H2O. His specialties are AI and ML in end-to-end model and data structure design, testing, and serving in the cloud or on-premises, and technical core, the design of experiments, hypothesis development, and reference architecture for AI and ML in cloud-centric implementation. KC holds a PhD in molecular biophysics from the University of Texas Southwestern Medical Center in Dallas, Texas.