Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Real-time image classification: Using convolutional neural networks on real-time streaming data

Josh Patterson (Skymind), Kirit Basu (StreamSets )
2:55pm3:35pm Thursday, September 28, 2017
Data science & advanced analytics, Machine Learning & Data Science
Location: 1A 12/14 Level: Intermediate
Secondary topics:  Deep learning, Streaming

Who is this presentation for?

  • Data scientists and architects

Prerequisite knowledge

  • A basic understanding of deep learning and how to configure an enterprise tool such as StreamSets

What you'll learn

  • Learn how to leverage prebuilt convolutional models in streaming systems to make real-time classifications


Building a data lake with unstructured data such as audio and images is always challenging. You have to be able to bring in useful data and limit data that doesn’t serve your goals. For example, image data that was captured for research purposes over the years is useful and needs further analysis, but corporate vacation photos on the same servers don’t need to make it into the analytics cluster. Unfortunately, no one thought to classify those images over time and now you have several petabytes of data to sort through.

Convolutional neural nets are a deep learning technology used to automatically classify the content of the image based on previously trained models. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real time, while ingesting data at scale. Specifically, they demonstrate how to use Skymind’s DL4J, which uses the VGG16 model, to classify images and how StreamSets Data Collector can execute these machine learning models while ingesting image data at scale to populate the data lake. You’ll learn how to do all this by designing a Dataflow pipeline using a drag-and-drop UI and writing a few lines of code.

Photo of Josh Patterson

Josh Patterson


Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor of Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif. Josh holds a master’s degree in computer science from the University of Tennessee at Chattanooga, where he did research in mesh networks and social insect swarm algorithms.

Photo of Kirit Basu

Kirit Basu


Kirit Basu is director of product management at StreamSets.