Building a data lake with unstructured data such as audio and images is always challenging. You have to be able to bring in useful data and limit data that doesn’t serve your goals. For example, image data that was captured for research purposes over the years is useful and needs further analysis, but corporate vacation photos on the same servers don’t need to make it into the analytics cluster. Unfortunately, no one thought to classify those images over time and now you have several petabytes of data to sort through.
Convolutional neural nets are a deep learning technology used to automatically classify the content of the image based on previously trained models. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real time, while ingesting data at scale. Specifically, they demonstrate how to use Skymind’s DL4J, which uses the VGG16 model, to classify images and how StreamSets Data Collector can execute these machine learning models while ingesting image data at scale to populate the data lake. You’ll learn how to do all this by designing a Dataflow pipeline using a drag-and-drop UI and writing a few lines of code.
Josh Patterson is CEO of Patterson Consulting, a solution integrator at the intersection of big data and applied machine learning. In this role, he brings his unique perspective blending a decade of big data experience and wide-ranging deep learning experience to Fortune 500 projects. At the Tennessee Valley Authority (TVA), Josh drove the integration of Apache Hadoop for large-scale data storage and processing of smart grid phasor measurement unit (PMU) data. Post-TVA, Josh was a principal solutions architect for a young Hadoop startup named Cloudera (CLDR), as employee 34. After leaving Cloudera, Josh co-founded the Deeplearning4j project and co-wrote Deep Learning: A Practitioner’s Approach (O’Reilly Media). Josh was also the VP of Field Engineering for Skymind. Josh also co-wrote the upcoming Oreilly book “Kubeflow Operations”
Kirit Basu is director of product management at StreamSets.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org