Whenever efficient data access is needed, index structures are the answer, and a wide variety of choices exist to address the different needs of various access pattern. For example, B-trees are the best choice for range requests (e.g., retrieve all records in a certain timeframe); HashMaps are hard to beat in performance for key-based lookups; and, bloom filters are typically used to check for record existence. Yet all of those indexes remain general-purpose data structures, assuming the worst-case distribution of data and not taking advantage of more common patterns prevalent in real-world data.
For example, if the goal is to build a highly tuned system to store and query fixed-length records with continuous integers keys (e.g., the keys 100M to 200M), you wouldn’t use a conventional B-tree index over the keys since the key itself can be used as an offset, making it an O(1) rather than O(log n) operation to look up any key or the beginning of a range of keys. Maybe surprisingly, the same optimizations are still possible for other data patterns. In other words, knowing the exact data distributions enables highly optimizing almost any index the database system uses.
Tim Kraska explains how fundamental data structures can be enhanced using machine learning with wide-reaching implications even beyond indexes, arguing that all existing index structures can be replaced with other types of models, including deep learning models (i.e., learned indexes). The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. Initial results show that simple neural nets are able to outperform cache-optimized B-trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world datasets. More importantly though, replacing core components of a (data management) system through learned models has far reaching implications for future systems designs. To quote Steven Sinofsky, board partner at A16Z and former president at Microsoft, “This paper [about learned indexes] blew my mind. . . .ML meets 1960s data structures and crushes them.”
Tim Kraska is an associate professor of electrical engineering and computer science in MIT’s Computer Science and Artificial Intelligence Laboratory and codirector of the Data System and AI Lab at MIT (DSAIL@CSAIL). His research focuses on building systems for machine learning and using machine learning for systems. Previously, Tim was an assistant professor at Brown, spent time at Google Brain, and was a postdoc in the AMPLab at UC Berkeley after his PhD at ETH Zurich. Tim’s a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the 2018 VLDB Early Career Research Contribution Award, the 2017 VMware Systems Research Award, an NSF CAREER Award, as well as several best paper and demo awards at VLDB and ICDE.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org