Machine learning for streaming data: practical insights
Who is this presentation for?This presentation is for machine learning engineers, data scientists and software developers interested in applying machine learning for continuous flows of data.
Prerequisite knowledgeIntermediary understanding of machine learning for batch data. Intermediary understanding of Spark and Scala is also desirable.
What you'll learn
In many domains data is generated in a fast-paced way. A clear example is the Internet of Things (IoT) applications, where connected sensors yield large amount of data in short periods.
To build predictive models from this data, we need to either settle for traditional offline learning or attempt to learn from the data incrementally.
A significant setback with the offline learning approach is that it is slow to react to changes in the domain. These changes can have a catastrophic impact on the model predictive performance since the patterns in which the model was trained on are no longer valid.
An online approach where the model is trained incrementally can potentially aid this issue. However, the untold story is that the existing challenges for offline learning are still present (and are even maximized) when processing the data online.
These challenges include, but are not limited to (i) raw data preprocessing; (ii) efficient incremental updates to models; (iii) algorithms to detect changes and react to them; and (iv) dealing with lots of unlabeled and delayed labeled data.
In this talk, we are going to show how a machine learning pipeline for streaming data can be developed in the StreamDM framework (https://github.com/huawei-noah/streamDM).
We are not going to present how we applied a specific algorithm to proprietary data or give a lecture on theoretical problems related to machine learning for data streams.
Our focus is to show to developers how to apply StreamDM to their data streams, and expand the framework to accommodate their needs.
Heitor Murilo Gomes
Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly evolving data streams, concept drift, ensemble methods, and big data streams. He co-leads the StreamDM open data stream mining project.
Albert Bifet is a professor at LTCI and head of the Data, Intelligence, and Graphs (DIG) Group at Télécom ParisTech, and a scientific collaborator at École Polytechnique. A big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache SAMOA (Scalable Advanced Massive Online Analysis), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led MOA (Massive Online Analysis), the most popular open source framework for data stream mining, with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the Big Data Mining special issue of SIGKDD Explorations in 2012. He was cochair of the industrial track at ECML PKDD 2015, BigMine (2014, 2013, 2012), and the data streams track at ACM SAC (2015, 2014, 2013, 2012). He holds a PhD from BarcelonaTech.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts