Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Real-time image classification: Using convolutional neural networks on real-time streaming data

Josh Patterson (Patterson Consulting), Kirit Basu (StreamSets )
2:55pm3:35pm Thursday, September 28, 2017
Data science & advanced analytics, Machine Learning & Data Science
Location: 1A 12/14 Level: Intermediate
Secondary topics:  Deep learning, Streaming

Who is this presentation for?

  • Data scientists and architects

Prerequisite knowledge

  • A basic understanding of deep learning and how to configure an enterprise tool such as StreamSets

What you'll learn

  • Learn how to leverage prebuilt convolutional models in streaming systems to make real-time classifications


Building a data lake with unstructured data such as audio and images is always challenging. You have to be able to bring in useful data and limit data that doesn’t serve your goals. For example, image data that was captured for research purposes over the years is useful and needs further analysis, but corporate vacation photos on the same servers don’t need to make it into the analytics cluster. Unfortunately, no one thought to classify those images over time and now you have several petabytes of data to sort through.

Convolutional neural nets are a deep learning technology used to automatically classify the content of the image based on previously trained models. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real time, while ingesting data at scale. Specifically, they demonstrate how to use Skymind’s DL4J, which uses the VGG16 model, to classify images and how StreamSets Data Collector can execute these machine learning models while ingesting image data at scale to populate the data lake. You’ll learn how to do all this by designing a Dataflow pipeline using a drag-and-drop UI and writing a few lines of code.

Photo of Josh Patterson

Josh Patterson

Patterson Consulting

Josh Patterson is CEO of Patterson Consulting, a solution integrator at the intersection of big data and applied machine learning. In this role, he brings his unique perspective blending a decade of big data experience and wide-ranging deep learning experience to Fortune 500 projects. At the Tennessee Valley Authority (TVA), Josh drove the integration of Apache Hadoop for large-scale data storage and processing of smart grid phasor measurement unit (PMU) data. Post-TVA, Josh was a principal solutions architect for a young Hadoop startup named Cloudera (CLDR), as employee 34. After leaving Cloudera, Josh co-founded the Deeplearning4j project and co-wrote Deep Learning: A Practitioner’s Approach (O’Reilly Media). Josh was also the VP of Field Engineering for Skymind. Josh also co-wrote the upcoming Oreilly book “Kubeflow Operations”

Photo of Kirit Basu

Kirit Basu


Kirit Basu is director of product management at StreamSets.