17–19 October 2016: Conference & Tutorials
19–20 October 2016: Training
London, UK

Building a powerful data tier from open source datastores

Joseph Lynch (Yelp)
13:35–14:15 Tuesday, 18/10/2016
Location: Sandringham Level: Intermediate
Average rating: *****
(5.00, 3 ratings)

Prerequisite knowledge

  • A firm grasp on large-scale system architecture
  • Knowledge in the areas of cloud hardware deployment (e.g., AWS, GCE) configuration management (e.g., Puppet, Chef), software CI (e.g., Jenkins, Docker, Linux packaging), and monitoring solutions (e.g., Nagios, Sensu) (useful but not required)

What you'll learn

  • Understand the trade-offs when choosing various open source data stores to deploy to your data tier
  • Explore concrete advice on how to get your data store to production
  • Description

    Today’s open source databases are plentiful and offer wildly different capabilities. As technology companies push the boundaries of what traditional RDBMS can do to the limit, we’ve seen significant innovation in open source “distributed first” data stores, including key value stores, search engines, document stores, caches, and even distributed locking systems. Joseph Lynch explores how Yelp made the hard technical choices and built a bulletproof data tier from these distributed data stores.

    Joseph starts with a survey of the open source datastore landscape, outlining the high-level trade-offs that have to be made when choosing between different classes (e.g., relational versus NoSQL) of data stores, as well as the limitations of those choices. Joseph then explains how Yelp made the decision for search engines (Elasticsearch versus Solr); configuration systems (Zookeeper versus Etcd); key value store (Cassandra versus HBase); and caching (MySQL versus Cassandra versus Memcache versus Redis).

    Regardless of which set of open source data stores a company chooses, the hard part is getting it to production. In order to keep up with all the new options, Yelp invested early in building a common platform for deploying, configuring, and monitoring data stores. Joseph discusses some of these shared abstractions including:

    • Robust deployment of data stores to machines in the cloud using Terraform
    • Cluster-first Puppet modules for configuring fleets of data stores
    • How to develop, build, and deploy packaged toolsets for operators (and which tools to build)
    • Autoscaling these myriad databases up and down
    • Integrating with open source monitoring solutions like Sensu or Nagios
    • Handling schemas and evolving data models in a consistent way
    • How to build robust clients and proxies to decouple applications from implementation choices

    Joseph ends by covering the implications of giving developers so many choices in your data store infrastructure and sharing some lessons learned about the requisite education in a DevOps datastore world. Companies are scaling and iterating far beyond the days where one could run a single database cluster, and just as monoliths are becoming microservices, catch-all databases are turning into polyglot data stores.

    Photo of Joseph Lynch

    Joseph Lynch


    Joseph Lynch is a software engineer for Yelp who focuses on building data store and service infrastructure. Joey is a core contributor to Yelp’s data store platform, which has allowed Yelp to go from a primarily MySQL data tier to a polyglot data tier including Elasticsearch, Cassandra, and Zookeeper. He loves pushing the edge of how Yelp uses DevOps tools to automate infrastructure and never met a problem he didn’t want to automate away. When not wrangling clusters of data stores, Joey enjoys building service discovery, reliable communication, fast deployment, and monitoring into Yelp’s SOA.