Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Faster ML over joins of tables

Arun Kumar (University of California, San Diego)
1:50pm2:30pm Thursday, March 28, 2019
Secondary topics:  Automation in data science and big data, Storage, Streaming, realtime analytics, and IoT
Average rating: ****.
(4.00, 2 ratings)

Who is this presentation for?

  • Data scientists, data analysts, statisticians, ML engineers, and ML software developers



Prerequisite knowledge

  • A basic understanding of machine learning, databases, and R or Python

What you'll learn

  • Understand the benefits of query optimization in ML systems and frameworks
  • Explore new technical connections between learning theory and databases and practical tools and libraries for faster ML over multitable data


Most relational/tabular datasets in real-world data-driven applications are multitable, connected by key-foreign key (KFK) relationships. Yet almost all ML training tools are designed for single-table data. This disconnect forces ML users to join all base tables to materialize a single table before ML. For example, in a recommender system, you have at least three tables: ratings, users, and products. Building, say, a content-based classifier requires joining all tables to materialize a single table. Alas, such join materialization can blow up the data in size, wasting memory and storage, while the data redundancy caused by joins increases the runtime of ML training, often by even an order of magnitude. In turn, these slowdowns can hurt ML user productivity.

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python.

First, inspired by database query optimization, Arun shows how to “avoid joins physically” (i.e., not materialize the KFK joins but instead push ML computations down through joins to the base tables). This technique, factorized ML, can dramatically reduce memory usage and runtimes for several ML methods such as popular generalized linear models, k-means clustering, and matrix factorization. Crucially, the ML model obtained, including its accuracy, are unaffected. In the recommender systems example, this means ML executes directly on the three base tables. Arun explains how this general technique can be realized in various system environments, including in-database ML, ML on Spark, and in-memory R and Python. He also generalizes this technique to arbitrary ML methods written with bulk matrix algebra. Arun presents software prototypes in both R and Python, including sample code of a few factorized ML methods, to show how ML users can reap these benefits. This technique was adopted or explored for internal use cases by LogicBlox, Microsoft, and Google, while Oracle explored it for a banking customer’s use case. Avito of Russia is exploring the tool in Python for production ecommerce use cases.

Second, Arun connects learning theory with KFK joins to show that in some cases, you can also “avoid joins logically.” By this, he means a rather radical capability: some of the foreign tables being joined can be ignored outright without significantly reducing ML classifier accuracy. In the recommender systems example, this means you could sometimes ignore the products table for training, for instance. Arun explains how this is even possible using the theory of the bias-variance trade-off and discusses the pros and cons for accuracy and interpretability, including how to mitigate such issues. He distills this analysis into an easy-to-understand decision rule based on the numbers of tuples in the joining tables to help ML users quickly decide based on their error tolerance if a foreign table can be avoided—without even looking into the table’s data. This technique is even more widely applicable, since it’s agnostic to both the ML classifier (linear models, trees, neural networks, etc.) and the system environment. This technique has seen adoption in practice by numerous companies, including LogicBlox, Facebook, and MakeMyTrip.

Photo of Arun Kumar

Arun Kumar

University of California, San Diego

Arun Kumar is an assistant professor in the Department of Computer Science and Engineering at the University of California, San Diego. He’s a member of the Database Lab and CNS and an affiliate member of the AI Group. His primary research interests are in data management and systems for machine learning- and artificial intelligence-based data analytics. Systems and ideas based on his research have been released as part of the MADlib open source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, Microsoft, and other companies. He’s a recipient of the ACM SIGMOD 2014 Best Paper Award, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)