Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.
Roger, Nina, Sudipto, and Ryan begin by discussing unsupervised machine learning algorithms that they have extended to operate on streams of data, which requires the machine learning model to continuously “evolve” as data streams through the system. The first example is the Robust Random Cut Forrest (RRCF) for anomaly detection that continuously learns each time it sees a new data record and emits a high anomaly score when it detects an outlier. The algorithm learns what “normal” looks like and evolves this model as new data streams in. They also discuss using stream clustering to reveal the internal structure of a data stream, which is capable of performing fast incremental clustering of records and constantly adapts to changes in the underlying stream of data, and share a new method to identify anomalies in directed graphs streaming in at high rates. Practical applications include the ability to detect anomalies in flow logs, such as denial of service attacks, port scans, and inter-VPC attacks. They conclude with techniques that are common to all of these machine learning algorithms.
Along they way, they also explore functions powered by machine learning that give customers insights into their data. Explainable machine learning has been a common customer request. Roger, Nina, Sudipto, and Ryan describe an enhanced anomaly detection function that returns an anomaly score for every data record, which can identify exactly what fields in the record contributed to the anomaly score, their contribution factor (1–100), and how each value changed, and explain how they enable a customer to identify false alarms or specify when they want to be alerted. This anomaly detection function then takes this user feedback as training data and learns to eliminate false positive or to automatically classify anomalies. Roger, Nina, Sudipto, and Ryan conclude with a discussion of how these algorithms are implemented and provided to customers in Kinesis Analytics, actual customer applications and success stories, and a live demo.
Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.
Nina Mishra is principal scientist at Amazon Web Services, where she focuses on data science, data mining, web search, machine learning, and privacy. Nina has many years of experience leading projects at Amazon, Microsoft Research, and HP Labs. She was also an associate professor at the University of Virginia and an acting faculty member at Stanford University. Nina’s research encompasses the design and evaluation of new data mining algorithms on real, colossal-sized datasets. She has authored almost 50 publications in top venues, including WWW, WSDM, SIGIR, ICML, NIPS, AAAI, COLT, VLDB, PODS, CRYPTO, EUROCRYPT, FOCS, and SODA, which have been recognized with best paper award nominations. Nina’s research was central to the Bing search engine and has been widely featured in external press coverage. Nina holds 14 patents with a dozen more still in the application stage. She has had the distinct privilege of helping others advance in their careers, including 15 summer interns and many full-time researchers. Nina’s service to the community includes serving on journal editorial boards Machine Learning, the Journal of Privacy and Confidentiality, IEEE Transactions on Knowledge and Data Engineering, and IEEE Intelligent Systems and chairing the premier machine learning conference ICML in 2003, as well as serving on numerous program committees for web search, data mining, and machine learning conferences. She was awarded an NSF grant as a principal investigator and has served on eight PhD dissertation committees.
Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource-constrained devices, such as sensors, to massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.
Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org