Skip to main content

Source of Truth: Using Open Source Tools to Manage and Monitor Large Deployments in Multiple Datacenters

Dale Hamel (Shopify)
Operations
Grand Ballroom CD
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Average rating: **...
(2.67, 42 ratings)
Slides:   1-PDF 

THIS TUTORIAL HAS REQUIREMENTS AND INSTRUCTIONS LISTED BELOW

Shopify is a growing eCommerce platform that has been approximately doubling in customer base each year. Having started off literally out of a coffee shop, our SRE and Operations dept has grown very organically, and capacity planning, server management, and monitoring had gotten into a fairly messy state as a result.

This stature as relatively new, quickly growing, high-availability operation makes Shopify a very interesting case study. Having moved from cluttered and nearly unmaintainable spreadsheets to a Collins based infrastructure management stack, we have been able to streamline and improve the efficiency of our Operations.

By the time of the Velocity conference, we plan to have released our open source Source of Truth stack tentatively called “OpenSRE” for “open site reliability engineering”, which will encompass our intake and provisioning system and the tools and/or processes used to keep our Source of Truth consistent with reality, as will as integrate with other tools such as configuration management systems like Chef.

We will be discussing the following tools, and how they can fit into the stack or benefit from interacting with a central Source of Truth:

  • Collins – the “Source of Truth” that other components connect to
  • iPXE, dnsmasq, tftp for provisioning servers from baremetal
  • Alchemy Linux and Alchemy Transmuter for intake and burnin
  • Chef for provisioning servers into a useable state
  • Docker for containerization to “draw a line in the sand” between Ops and Devs, with CoreOS or Mesos to manage capacity
  • Realtime system monitoring and alerting briefly contrasting Datadog (proprietary SaaS) and OpenTSDB (open source), and Graphite (open source)
  • EC 2 and other cloud options – for on-demand virtualized servers

We will demonstrate how the components above can work together to facilitate the following tasks:

  • Server intake, burnin, and bootstrapping
  • Managing physical and virtual server assets (keeping track, knowing what is what and where)
  • Physical resource hierarchies and Rack diagrams
  • IP address management
  • IPMI control over servers
  • Network resource graphs and eventually heatmaps
  • Provisioning, and reprovisioning servers
  • Capacity planning
  • Adjusting capacity based on requirements and resource availability

We will conclude with a comparison to how we used to manage our infrastructure, versus how our collins-based system has been able to alleviate us from many mundane and frustrating tasks, and a summary of our ideas for future improvements..

TUTORIAL REQUIREMENTS AND INSTRUCTIONS FOR ATTENDEES

Everyone who wants to participate in the software demo should have a laptop with Vagrant installed. Also, please download the following vagrant images with documentation PRIOR to the conference.

* https://github.com/OpenSRE/OpenSRE.github.io/releases/download/pre-velocity/alchemy-linux.box
* https://github.com/OpenSRE/OpenSRE.github.io/releases/download/pre-velocity/collins-and-transmuter.box

QUESTIONS for the speaker?: Use the “Leave a Comment or Question” section at the bottom to address them.

Photo of Dale Hamel

Dale Hamel

Shopify

Dale Hamel is a Linux and Open Source Software advocate with a background in system operations, administration, and development.

Well known for creating the Open Source media centre “RasPlex” for Raspberry Pi, Dale believe strongly in the open source community and has experience managing Open Source projects.

At his day job, he helps take care of Shopify, a growing eCommerce platform based out of Ottawa, Canada.