The case for a common metadata layer for machine learning platforms
Who is this presentation for?Machine Learning Engineers, Data Scientist, Data Architects
With the rapid and recent rise of data science, the Machine Learning Platforms being built are becoming more complex. For example consider the various Kubeflow components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature store, and more. Each of these components is producing metadata: Different (versions) Datasets, different versions a of a jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and many more.
For production use it is critical to have a common view across all these metadata as we have to ask questions such as: Which jupyter notebook has been used to build Model xyz current running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?
As the overall Machine Learning stack is still rapidly changing (and also different companies typically choose different components for their stack) with new tools coming out every month (if not week), it seems key to specify a generic API first supporting new and different components. Furthermore, Data Scientists need a simple model and intuitive interface to query across all metadata.
In this talk, we propose a first draft of a common Metadata API. One of these challenges is a particular variety of data types arising naturally in this context: There is rather unstructured JSON data, structured data from fixed APIs, but also highly nested and interlinked data, which comes in the form of graphs (vertices and edges). Furthermore,the above-mentioned queries are highly variable, can involve complex joins, graph traversals and index lookups. Therefore, this turns out to be a use case, in which the relatively new breed of “native multi-model databases” supporting a combination of unstructured document and a graph structure between these documents seems like a perfect fit.
We demo the first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database. ArangoDB furthermore provides a simple way to implement custom APIs using the Foxx framework, which makes it a perfect fit for such a Metadata store.
Prerequisite knowledgeBasic knowledge of Machine Learning ist helpful, but not required.
What you'll learn
Max Neunhöffer is a mathematician turned database developer. In his academic career he has worked for 16 years on the development and implementation of new algorithms in computer algebra. During this time he has juggled a lot with mathematical big data like group orbits containing trillions of points. Recently he has returned from St. Andrews to Germany, has shifted his focus to NoSQL databases, and now helps to develop ArangoDB. He has spoken at international conferences including O’Reilly Software Architecture London, J On The Beach or MesosCon Seattle.
Jörg is a Machine Learning Platform Engineer at Suki. In his previous life, he worked on distributed systems at Mesosphere, implemented distributed and in memory databases and conducted research in the Hadoop and Cloud area. His speaking experience includes various Meetups, international conferences, and lecture halls.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts