The case for a common metadata layer for machine learning platforms

Max Neunhöffer (ArangoDB), Joerg Schad (ArangoDB)

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1A 23/24

Data Engineering and Architecture

Secondary topics: Data quality, data governance and data lineage

Average rating:

(3.00, 4 ratings)

Who is this presentation for?

Machine learning engineers, data scientists, and data architects

Level

Intermediate

Description

With the rapid and recent rise of data science, machine learning (ML) platforms are becoming more complex. For example, consider the various Kubeflow components: distributed training, Jupyter notebooks, CI/CD, hyperparameter optimization, feature stores, and more. Each of these components is producing metadata: different (versions of) datasets, different versions of Jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and much more. For production use, it’s critical to have a common view across all this metadata as we have to establish things such as which Jupyter notebook was used to build a model currently running in production or which models currently serving in production have to be updated if there’s new data for a given dataset.

As the overall machine learning stack is still rapidly changing (and different companies choose different components for their stack) with new tools coming out every month (if not ever week), it seems key to specify a generic API first supporting new and different components. Furthermore, data scientists need a simple model and intuitive interface to query across all metadata.

Max Neunhöffer and Joerg Schad propose a first draft of a common metadata API. One of the challenges is a particular variety of data types arising naturally in this context, unstructured JSON data, structured data from fixed APIs, and highly nested and interlinked data, which comes in the form of graphs (vertices and edges). Furthermore, the queries are highly variable, can involve complex joins, graph traversals, and index lookups. Therefore, it turns out that this is a case in which the relatively new breed of native multimodel databases supporting a combination of unstructured documents and a graph structure between these documents seems to be a perfect fit.

They demonstrate the first implementation of this API in Kubeflow using ArangoDB, a native multimodel database. ArangoDB provides a simple way to implement custom APIs using the Foxx framework, which makes it a perfect fit for such a metadata store.

Prerequisite knowledge

A basic knowledge of machine learning (useful but not required)

What you'll learn

Understand that the actual challenges in machine learning lie not necessarily in model building but in building a production-grade ML platform

Max Neunhöffer

ArangoDB

Max Neunhöffer is a mathematician turned database developer. He’s a senior software developer architect at ArangoDB. In his academic career, he worked for 16 years on the development and implementation of new algorithms in computer algebra, where he juggled with mathematical big data like group orbits containing trillions of points. He recently returned from St. Andrews to Germany, shifted his focus to NoSQL databases, and now helps develop ArangoDB. He’s spoken at international conferences including O’Reilly Software Architecture London, J On The Beach, and MesosCon Seattle.

Website

Joerg Schad

ArangoDB

Jörg Schad is Head of Machine Learning at ArangoDB. In a previous life, he has worked on or built machine learning pipelines in healthcare, distributed systems at Mesosphere, and in-memory databases. He received his Ph.D. for research around distributed databases and data analytics. He’s a frequent speaker at meetups, international conferences, and lecture halls.