For the entire length of the history of people interacting with databases, we’ve been mapping our data to a schema at write time. We take our raw data in one hand and a description of what the data should look like in our other, and we combine the two, writing the result to disk in a binary, preformatted way. Users can subsequently make queries against the data, because we have it stored in a schema-fied, normalized, query-able format.
Schema-on-read systems, by comparison, enable us to map our schema to our raw data at query time. That is, the data isn’t preformatted—it isn’t “query-able” in the classical database sense. It’s just bytes sitting somewhere on disk in its native format (JSON, newline-separated lines of text, whatever. . .). The schema itself is stored as a set of ETL-like instructions (or even a regular expression), which the system can use as a map to transform the at-rest data into named fields on-demand. There’s no “database” in a schema-on-read system. In its place is some metadata linking the the location of some at-rest data to a schema we can use to parse it when we want to.
David Josephsen tells the story of what happened when Sparkpost’s SRE team stole a page from data engineering and abandoned ELK log processing for a DIY schema-on-read logging infrastructure. You’ll learn the architectural details along with the trials and tribulations from the company’s Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis Firehose, Parquet, SNS, and S3 to ship event data from sources like NGINX logs into at-rest columnar data on S3, where automated processes, engineers, managers, and support personnel all make use of a democratic SQL query engine to answer questions like, “How many customers of type X received a 5xx when attempting to access the /accounts API in the last 24 hours?”
If you are challenged by the maintenance overhead of Splunk or ELK; if you have vast quantities of at-rest log data and wish you could run SQL queries on it without any ETL; if grep is no longer a viable answer; if your computers make logs all the live long day and you just want to scream—this is the talk for you.
Dave Josephsen runs the telemetry engineering team at Sparkpost. He thinks you’re pretty great.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com