Most traditional big data systems store data in a columnar format and rely on sequential scans for query processing. This approach is efficient and cheap, but typical queries are slow and not suitable for online applications. Some other search systems, like Elasticsearch, build inverted indices on the data, which makes simple queries faster. However, they’re not able to satisfy complex queries, and the cost is a concern for bigger datasets.
Igor Canadi and Dhruba Borthakur offer an overview of converged indexing with ROCKSET, a system that combines the columnar and search indices to enable both low-latency and complex queries. Two ideas make converged indexing performant and cost-effective: using log-structured merge trees as an underlying storage system and utilizing the elasticity and storage hierarchy provided by the cloud environment.
Igor and Dhruba explain how Rockset maps a semistructured document into individual keys and values to be stored in a RocksDBCloud’s log-structured merge (LSM) tree. In a B-tree-based system, the performance is significantly affected by the number of indices. If one database update has to be reflected in many indices, it causes many random writes to storage. An LSM engine buffers random writes in memory and writes them as one big sequential write, making it efficient to maintain multiple indices. Using an LSM engine enables ROCKSET to efficiently store both columnar and search indices for all columns of a data record.
The second component that makes the architecture economically feasible is the elasticity of the cloud environment. ROCKSET uses an open source technology called RocksDBCloud, which stores data for durability in S3 but is able to replicate said data into local SSD memory and RAM during times of load. Igor and Dhruba describe the custom-built ROCKSET scheduler that runs as part of Kubernetes and manages the placement of data across available resources and requests new resources from the cloud provider when needed.
Igor Canadi is a software engineer at Rockset, where he is developing its data indexing and distributed SQL query engine. Previously, Igor was an engineer at Facebook, working on the database engineering and product infrastructure teams, where he contributed to RocksDB, developed MongoRocks and MongoDB with RocksDB storage engine, drove RocksDB open source initiatives, worked on core GraphQL infrastructure for Facebook’s Android application, and owned GraphQL developer tooling for hundreds of developers. Igor holds a master’s degree in computer science from the University of Wisconsin-Madison and a bachelor’s degree from the University of Zagreb. In his free time, he likes sailing and snowboarding.
Dhruba Borthakur is cofounder and CTO at Rockset, a company building software to enable data-powered applications. Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop File System at Yahoo. Dhruba was also an early contributor to the open source Apache HBase project. Previously, he was a senior engineer at Veritas Software, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; was the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and was a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS), a part of IBM’s ecommerce initiative, WebSphere. Dhruba holds an MS in computer science from the University of Wisconsin-Madison and a BS in computer science BITS, Pilani, India. He has 25 issued patents.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org