Sep 23–26, 2019

The case for a common metadata layer for machine learning platforms

Max Neunhöffer (ArangoDB), Joerg Schad (Suki)
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 23

Who is this presentation for?

Machine Learning Engineers, Data Scientist, Data Architects

Level

Intermediate

Description

With the rapid and recent rise of data science, the Machine Learning Platforms being built are becoming more complex. For example consider the various Kubeflow components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature store, and more. Each of these components is producing metadata: Different (versions) Datasets, different versions a of a jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and many more.
For production use it is critical to have a common view across all these metadata as we have to ask questions such as: Which jupyter notebook has been used to build Model xyz current running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?

As the overall Machine Learning stack is still rapidly changing (and also different companies typically choose different components for their stack) with new tools coming out every month (if not week), it seems key to specify a generic API first supporting new and different components. Furthermore, Data Scientists need a simple model and intuitive interface to query across all metadata.

In this talk, we propose a first draft of a common Metadata API. One of these challenges is a particular variety of data types arising naturally in this context: There is rather unstructured JSON data, structured data from fixed APIs, but also highly nested and interlinked data, which comes in the form of graphs (vertices and edges). Furthermore,the above-mentioned queries are highly variable, can involve complex joins, graph traversals and index lookups. Therefore, this turns out to be a use case, in which the relatively new breed of “native multi-model databases” supporting a combination of unstructured document and a graph structure between these documents seems like a perfect fit.

We demo the first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database. ArangoDB furthermore provides a simple way to implement custom APIs using the Foxx framework, which makes it a perfect fit for such a Metadata store.

Prerequisite knowledge

Basic knowledge of Machine Learning ist helpful, but not required.

What you'll learn

The actual challenges in Machine lie not necessarily in Model building but building a production-grade ML platform. For such platform we need a holistic metadata view of the different components.
Photo of Max Neunhöffer

Max Neunhöffer

ArangoDB

Max Neunhöffer is a mathematician turned database developer. In his academic career he has worked for 16 years on the development and implementation of new algorithms in computer algebra. During this time he has juggled a lot with mathematical big data like group orbits containing trillions of points. Recently he has returned from St. Andrews to Germany, has shifted his focus to NoSQL databases, and now helps to develop ArangoDB. He has spoken at international conferences including O’Reilly Software Architecture London, J On The Beach or MesosCon Seattle.

Photo of Joerg Schad

Joerg Schad

Suki

Jörg is a Machine Learning Platform Engineer at Suki. In his previous life, he worked on distributed systems at Mesosphere, implemented distributed and in memory databases and conducted research in the Hadoop and Cloud area. His speaking experience includes various Meetups, international conferences, and lecture halls.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts