San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Please log in

Add to Your Schedule

Schema on read and the new logging way

David Josephsen (Sparkpost)

12:05–12:45 Thursday, 2 May 2019

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms, Streaming and realtime analytics

Average rating:

(3.50, 2 ratings)

Who is this presentation for?

Practitioners and engineers responsible for streaming data and pipeline architectures

Level

Beginner

Prerequisite knowledge

Familiarity with the fact computers make logs

What you'll learn

Get real-world advice for building streaming data infrastructure on AWS

Description

For the entire length of the history of people interacting with databases, we’ve been mapping our data to a schema at write time. We take our raw data in one hand and a description of what the data should look like in our other, and we combine the two, writing the result to disk in a binary, preformatted way. Users can subsequently make queries against the data, because we have it stored in a schema-fied, normalized, query-able format.

Schema-on-read systems, by comparison, enable us to map our schema to our raw data at query time. That is, the data isn’t preformatted—it isn’t “query-able” in the classical database sense. It’s just bytes sitting somewhere on disk in its native format (JSON, newline-separated lines of text, whatever. . .). The schema itself is stored as a set of ETL-like instructions (or even a regular expression), which the system can use as a map to transform the at-rest data into named fields on-demand. There’s no “database” in a schema-on-read system. In its place is some metadata linking the the location of some at-rest data to a schema we can use to parse it when we want to.

David Josephsen tells the story of what happened when Sparkpost’s SRE team stole a page from data engineering and abandoned ELK log processing for a DIY schema-on-read logging infrastructure. You’ll learn the architectural details along with the trials and tribulations from the company’s Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis Firehose, Parquet, SNS, and S3 to ship event data from sources like NGINX logs into at-rest columnar data on S3, where automated processes, engineers, managers, and support personnel all make use of a democratic SQL query engine to answer questions like, “How many customers of type X received a 5xx when attempting to access the /accounts API in the last 24 hours?”

If you are challenged by the maintenance overhead of Splunk or ELK; if you have vast quantities of at-rest log data and wish you could run SQL queries on it without any ETL; if grep is no longer a viable answer; if your computers make logs all the live long day and you just want to scream—this is the talk for you.

David Josephsen

Sparkpost

Dave Josephsen runs the telemetry engineering team at Sparkpost. He thinks you’re pretty great.

Website

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com