Whenever efficient data access is needed, index structures are the answer, and a wide variety of choices exist to address the different needs of various access pattern. For example, B-trees are the best choice for range requests (e.g., retrieve all records in a certain timeframe); HashMaps are hard to beat in performance for key-based lookups; and, bloom filters are typically used to check for record existence. Yet all of those indexes remain general-purpose data structures, assuming the worst-case distribution of data and not taking advantage of more common patterns prevalent in real-world data.
For example, if the goal is to build a highly tuned system to store and query fixed-length records with continuous integers keys (e.g., the keys 100M to 200M), you wouldn’t use a conventional B-tree index over the keys since the key itself can be used as an offset, making it an O(1) rather than O(log n) operation to look up any key or the beginning of a range of keys. Maybe surprisingly, the same optimizations are still possible for other data patterns. In other words, knowing the exact data distributions enables highly optimizing almost any index the database system uses.
Tim Kraska explains how fundamental data structures can be enhanced using machine learning with wide-reaching implications even beyond indexes, arguing that all existing index structures can be replaced with other types of models, including deep learning models (i.e., learned indexes). The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. Initial results show that simple neural nets are able to outperform cache-optimized B-trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world datasets. More importantly though, replacing core components of a (data management) system through learned models has far reaching implications for future systems designs. To quote Steven Sinofsky, board partner at A16Z and former president at Microsoft, “This paper [about learned indexes] blew my mind. . . .ML meets 1960s data structures and crushes them.”
Tim Kraska is an associate professor of electrical engineering and computer science at MIT’s Computer Science and Artificial Intelligence Laboratory. Currently, his research focuses on building systems for machine learning and using machine learning for systems. Tim spent the majority of 2017 at Google Research, where he invented the concept of learned index structures with the MLX and Brain teams. Tim was recently selected as a 2017 Alfred P. Sloan Research Fellow in computer science. He has also received the 2017 VMware Systems Research Award, NSF CAREER Award, an Air Force Young Investigator award, two Very Large Data Bases (VLDB) conference best demo awards, and a best paper award from the IEEE International Conference on Data Engineering (ICDE).
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org