San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

3:50pm–4:30pm Thursday, March 28, 2019

Data Engineering & Architecture
Location: 2002

Secondary topics: AI and Data technologies in the cloud, Storage, Streaming, realtime analytics, and IoT

Average rating:

(4.00, 1 rating)

Download slides (PDF)

Who is this presentation for?

CTOs, data scientists, and data engineers

Level

Intermediate

Prerequisite knowledge

A basic understanding of existing big data technologies (Hadoop, Hive, Spark, Kafka, etc.)

What you'll learn

Explore ROCKSET, an approach that reduces the engineering effort needed to derive intelligence from machine data

Description

Most traditional big data systems store data in a columnar format and rely on sequential scans for query processing. This approach is efficient and cheap, but typical queries are slow and not suitable for online applications. Some other search systems, like Elasticsearch, build inverted indices on the data, which makes simple queries faster. However, they’re not able to satisfy complex queries, and the cost is a concern for bigger datasets.

Igor Canadi and Dhruba Borthakur offer an overview of converged indexing with ROCKSET, a system that combines the columnar and search indices to enable both low-latency and complex queries. Two ideas make converged indexing performant and cost-effective: using log-structured merge trees as an underlying storage system and utilizing the elasticity and storage hierarchy provided by the cloud environment.

Igor and Dhruba explain how Rockset maps a semistructured document into individual keys and values to be stored in a RocksDBCloud’s log-structured merge (LSM) tree. In a B-tree-based system, the performance is significantly affected by the number of indices. If one database update has to be reflected in many indices, it causes many random writes to storage. An LSM engine buffers random writes in memory and writes them as one big sequential write, making it efficient to maintain multiple indices. Using an LSM engine enables ROCKSET to efficiently store both columnar and search indices for all columns of a data record.

The second component that makes the architecture economically feasible is the elasticity of the cloud environment. ROCKSET uses an open source technology called RocksDBCloud, which stores data for durability in S3 but is able to replicate said data into local SSD memory and RAM during times of load. Igor and Dhruba describe the custom-built ROCKSET scheduler that runs as part of Kubernetes and manages the placement of data across available resources and requests new resources from the cloud provider when needed.

Igor Canadi

Rockset

Igor Canadi is a software engineer at Rockset, where he is developing its data indexing and distributed SQL query engine. Previously, Igor was an engineer at Facebook, working on the database engineering and product infrastructure teams, where he contributed to RocksDB, developed MongoRocks and MongoDB with RocksDB storage engine, drove RocksDB open source initiatives, worked on core GraphQL infrastructure for Facebook’s Android application, and owned GraphQL developer tooling for hundreds of developers. Igor holds a master’s degree in computer science from the University of Wisconsin-Madison and a bachelor’s degree from the University of Zagreb. In his free time, he likes sailing and snowboarding.

Website

Dhruba Borthakur

Rockset

Dhruba Borthakur is cofounder and CTO at Rockset, a company building software to enable data-powered applications. Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop File System at Yahoo. Dhruba was also an early contributor to the open source Apache HBase project. Previously, he was a senior engineer at Veritas Software, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; was the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and was a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS), a part of IBM’s ecommerce initiative, WebSphere. Dhruba holds an MS in computer science from the University of Wisconsin-Madison and a BS in computer science BITS, Pilani, India. He has 25 issued patents.

Website

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com