Building a data lake with unstructured data such as audio and images is always challenging. You have to be able to bring in useful data and limit data that doesn’t serve your goals. For example, image data that was captured for research purposes over the years is useful and needs further analysis, but corporate vacation photos on the same servers don’t need to make it into the analytics cluster. Unfortunately, no one thought to classify those images over time and now you have several petabytes of data to sort through.
Convolutional neural nets are a deep learning technology used to automatically classify the content of the image based on previously trained models. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real time, while ingesting data at scale. Specifically, they demonstrate how to use Skymind’s DL4J, which uses the VGG16 model, to classify images and how StreamSets Data Collector can execute these machine learning models while ingesting image data at scale to populate the data lake. You’ll learn how to do all this by designing a Dataflow pipeline using a drag-and-drop UI and writing a few lines of code.
Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor of Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif. Josh holds a master’s degree in computer science from the University of Tennessee at Chattanooga, where he did research in mesh networks and social insect swarm algorithms.
Kirit Basu is director of product management at StreamSets.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com