Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Schema on read and the new logging way

David Josephsen (Sparkpost)
12:0512:45 Thursday, 2 May 2019
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Practitioners and engineers responsible for streaming data and pipeline architectures



Prerequisite knowledge

  • Familiarity with the fact computers make logs

What you'll learn

  • Get real-world advice for building streaming data infrastructure on AWS


For the entire length of the history of people interacting with databases, we’ve been mapping our data to a schema at write time. We take our raw data in one hand and a description of what the data should look like in our other, and we combine the two, writing the result to disk in a binary, preformatted way. Users can subsequently make queries against the data, because we have it stored in a schema-fied, normalized, query-able format.

Schema-on-read systems, by comparison, enable us to map our schema to our raw data at query time. That is, the data isn’t preformatted—it isn’t “query-able” in the classical database sense. It’s just bytes sitting somewhere on disk in its native format (JSON, newline-separated lines of text, whatever. . .). The schema itself is stored as a set of ETL-like instructions (or even a regular expression), which the system can use as a map to transform the at-rest data into named fields on-demand. There’s no “database” in a schema-on-read system. In its place is some metadata linking the the location of some at-rest data to a schema we can use to parse it when we want to.

David Josephsen tells the story of what happened when Sparkpost’s SRE team stole a page from data engineering and abandoned ELK log processing for a DIY schema-on-read logging infrastructure. You’ll learn the architectural details along with the trials and tribulations from the company’s Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis Firehose, Parquet, SNS, and S3 to ship event data from sources like NGINX logs into at-rest columnar data on S3, where automated processes, engineers, managers, and support personnel all make use of a democratic SQL query engine to answer questions like, “How many customers of type X received a 5xx when attempting to access the /accounts API in the last 24 hours?”

If you are challenged by the maintenance overhead of Splunk or ELK; if you have vast quantities of at-rest log data and wish you could run SQL queries on it without any ETL; if grep is no longer a viable answer; if your computers make logs all the live long day and you just want to scream—this is the talk for you.

Photo of David Josephsen

David Josephsen


Dave Josephsen runs the telemetry engineering team at Sparkpost. He thinks you’re pretty great.