Build Systems that Drive Business
Sep 30–Oct 1, 2018: Training
Oct 1–3, 2018: Tutorials & Conference
New York, NY

Sell cron, buy Airflow: Modern data pipelines in finance

James Meickle (Quantopian)
4:45pm–5:25pm Tuesday, October 2, 2018
Distributed Data
Location: Nassau Level: Beginner
Secondary topics:  Resilient, Performant & Secure Distributed Systems
Average rating: *****
(5.00, 2 ratings)

Prerequisite knowledge

  • Familiarity with data pipelines (ETL), distributed systems, and job scheduling
  • A basic understanding of Kubernetes (useful but not required)

What you'll learn

  • Learn how Quantopian rearchitected brittle crontabs into resilient, recoverable pipelines with Apache Airflow

Description

The Quantopian data pipeline begins every night after equity trading in the US ends, when the company ingests the day’s financial data from several vendors. Its Python infrastructure reconciles and cleans data to produce a unified view of history, repackages cleaned data into higher-performance formats, and produces analytics data that is provided to Quantopian’s worldwide community as a free portfolio risk model (usually only available to institutions).

But that high-level view of Quantopian’s business is an abstraction; as the company scaled its research and trading infrastructure, the engine keeping Quantopian running grew to almost a hundred cron jobs. These brittle scheduling systems regularly failed in the real world, where vendors are late, data is missing, and services fail. As the company considered adding support for global markets, it knew it needed to invest in a more resilient and flexible approach.

Quantopian began researching a data pipeline solution in late 2017 and rapidly converged on Apache Airflow as the right tool for the job. The team spent a month on research and prototyping and another month developing a detailed implementation plan to introduce Airflow to the rest of the company: adoption targets, documentation, code samples, and test suites.

James Meickle explains how in less than six months, Quantopian went from not knowing how it would ever escape its cron jobs to putting Airflow on the critical path for its high-reliability trading infrastructure. And the best news? Since Quantopian has done the research for you, you can do it even faster.

Photo of James Meickle

James Meickle

Quantopian

James Meickle is a site reliability engineer at Quantopian, a Boston startup making algorithmic trading accessible to everyone. In past roles, he’s been responsible for processing MRI scans at the Center for Brain Science at Harvard University, sales engineering and developer evangelism at AppNeta, and release engineering during the Romney for President 2012 campaign. Between NYSE trading days, he advises devopsdays Boston and conducts Ansible trainings on O’Reilly’s Safari platform. What free time remains is dedicated to cooking, sci-fi, permadeath video games, and Satanism.