Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Radically modular data ingestion APIs in Apache Beam

Eugene Kirpichov (Google)

14:55–15:35 Thursday, 24 May 2018

Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Location: Capital Suite 8/9 Level: Advanced

Secondary topics: Data Integration and Data Pipelines sessions

Average rating:

(4.50, 2 ratings)

Who is this presentation for?

Senior engineers interested in programming models for distributed batch and streaming data processing

Prerequisite knowledge

Basic experience with distributed data processing using any popular framework

What you'll learn

Understand the power and challenges of a unified batch-streaming programming model and the unique challenges of a batch-streaming agnostic API for data ingestion
Appreciate the modularity and composability advantages offered by treating data ingestion as just another data processing task
Learn to build highly modular data ingestion APIs using the new Beam programming model primitive SplittableDoFn

Description

Apache Beam equips users with a novel programming model in which the classic batch-streaming data processing dichotomy is erased. Beam also offers a rich set of I/O connectors to popular storage systems. Beam adopts the philosophy that interacting with a storage system is just another parallel data-processing task, so the I/O connectors are packaged as simple Beam transforms, which offers surprisingly large benefits in both modularity and scalability.

However, the batch-streaming dichotomy used to still exist for authors of new I/O connectors: it was common knowledge that efficiently ingesting batch and streaming data is fundamentally different and, at a low level, requires fundamentally different (and rather heavyweight) APIs.

Over the years, a number of issues with these APIs have been identified, and attempts to improve them have yielded an unexpected result—the most fundamental data processing primitive, the Map operation (DoFn in Beam), was generalized. Eugene Kirpichov offers an overview of this generalization—called Splittable DoFn in Beam—which not only allows developing data ingestion APIs in a way agnostic to batch versus streaming but is lightweight and transparently blends with the rest of the Beam programming model, enabling and popularizing new, highly modular design patterns for data ingestion APIs.

Eugene Kirpichov

Google

Eugene Kirpichov is a staff software engineer on the Cloud Dataflow team at Google, where he works on the Apache Beam programming model and APIs. Previously, Eugene worked on Cloud Dataflow’s autoscaling and straggler elimination techniques. He is interested in programming language theory, data visualization, and machine learning.

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com