Pensieve—a natural language processing (NLP) project that classifies reviews for their sentiment, reason for sentiment, high-level content, and low-level content—is used in production to handle thousands of reviews daily and across multiple domains. Megan Yetman offers an overview of Pensieve as well as ways to improve model reporting and the ability for continuous model learning and improvement.
Raw text is input and transformed using a custom tokenized vocabulary set. The output is then sent through an embedding layer, a convolutional neural network (CNN), and a bidirectional long short-term memory network (bi-LSTM) to produce softmax outputs on the classification options. Monte Carlo simulations are then run, generating multiple softmax outputs per classification per review. Nonparametric tests are also performed to determine which outputs to report on. This enables optimization on accuracy by balancing model coverage.
Additionally, Pensieve has self-training capabilities. If review classifications are validated by a human, they are used to further train the model. If the new model weights pass an added layer of tests, the model is updated, increasing the scope and accuracy of the classifications. Fail scenarios are also in place to account for poor data as well as if the model stops performing as expected.
Megan Yetman is a machine learning engineer at the Center for Machine Learning at Capital One. Megan has production experience with natural language processing and neural networks as well as data migration and data science. She holds a BA and MS in statistics from the University of Virginia.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com