Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. When testing data is present at the petabyte scale, the ability to seamlessly access all the images that have been assigned specific labels poses a technical challenge by itself.
Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable. Images and labels are stored in HBase. The model is encapsulated in a PySpark program, while the images are indexed with Solr and can be accessed from a Hue dashboard.
Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.
Mirko Kämpf is a solutions architect on the CEMEA team at Cloudera, where he applies tools from the Hadoop ecosystem, such as Spark, HBase, and Solr, to solve customer’s problems and is working on graph-based knowledge representation using Apache Jena to enable semantic search at scale. Mirko’s research focuses on time-dependent networks and time series analysis at scale. He loves to deliver data-centric workshops and has spoken at several big data-related conferences and meetups. He holds a PhD in statistical physics.
Jan Kunigk has worked on enterprise Hadoop solutions since 2010. Before joining Cloudera in 2014, his tasks included building optimized systems architectures for Hadoop at IBM and implementing a Hadoop-as-a-service offering at T-Systems. In his current role as a Solutions Architect he makes Hadoop projects at Cloudera’s enterprise customers successful, covering a wide spectrum of architectural decisions to the implementation of big data applications across all industry sectors on a day-to-day basis.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com