Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency. (LinkedIn runs Venice as a multitenant, self-service, globally replicated system.)
Venice was designed to be the next-generation replacement of the Voldemort Read-Only system, with the intent to provide a broader feature set, better availability characteristics, and a more efficient architecture. Venice is designed for high-throughput ingestion from Hadoop and Kafka, and these data sources can be merged at ingestion time in order to provide semantics similar to those of a lambda architecture but with a simpler, faster, and more available read path. Robustness is a primary architectural concern and, as such, Venice provides highly available reads and writes, self-healing, stringent data validation guarantees, and the ability to roll back entire datasets in cases where bad data is pushed.
Felix GV is a software engineer working on LinkedIn’s data infrastructure. He leads the Venice project and keeps a close eye on Hadoop, Kafka, Samza, Azkaban, Zookeeper, Helix and Avro.
Yan Yan is an engineer at LinkedIn, where he works on the Voldemort and Venice team within the company’s data infrastructure organization. He has extensive experience working on cluster management, Zookeeper, Helix, and distributed systems in general.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com