Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Improving computer vision models at scale

Marton Balassi (Cloudera), Mirko Kämpf (Cloudera), Jan Kunigk (Cloudera)
14:5515:35 Thursday, 24 May 2018
Data engineering and architecture
Location: Capital Suite 2/3 Level: Intermediate
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Architects and data scientists

Prerequisite knowledge

  • A basic understanding of Hadoop
  • Familiarity with Python and Scala (useful but not required)

What you'll learn

  • Explore a solution that automates the process of running a computer vision model on testing data and populating an index of labels so they become searchable


Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. When testing data is present at the petabyte scale, the ability to seamlessly access all the images that have been assigned specific labels poses a technical challenge by itself.

Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable. Images and labels are stored in HBase. The model is encapsulated in a PySpark program, while the images are indexed with Solr and can be accessed from a Hue dashboard.

Photo of Marton Balassi

Marton Balassi


Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.

Photo of Mirko Kämpf

Mirko Kämpf


Mirko Kämpf is a solutions architect on the CEMEA team at Cloudera, where he applies tools from the Hadoop ecosystem, such as Spark, HBase, and Solr, to solve customer’s problems and is working on graph-based knowledge representation using Apache Jena to enable semantic search at scale. Mirko’s research focuses on time-dependent networks and time series analysis at scale. He loves to deliver data-centric workshops and has spoken at several big data-related conferences and meetups. He holds a PhD in statistical physics.

Photo of Jan Kunigk

Jan Kunigk


Jan Kunigk has worked on enterprise Hadoop solutions since 2010. Before joining Cloudera in 2014, his tasks included building optimized systems architectures for Hadoop at IBM and implementing a Hadoop-as-a-service offering at T-Systems. In his current role as a Solutions Architect he makes Hadoop projects at Cloudera’s enterprise customers successful, covering a wide spectrum of architectural decisions to the implementation of big data applications across all industry sectors on a day-to-day basis.