Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Data Integration and Data Pipelines sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data integration and the creation and maintenance of data pipelines.

11:00am–11:40am Wednesday, March 7, 2018

The future of ETL isn’t what it used to be

Data engineering and architecture
Location: 212 A-B

Gwen Shapira (Confluent)

Average rating:

(4.93, 14 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Radically modular data ingestion APIs in Apache Beam

Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Location: 212 A-B

Eugene Kirpichov (Google)

Average rating:

(4.75, 4 ratings)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Data engineering and architecture, Streaming systems and real-time applications
Location: 212 A-B

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

Average rating:

(4.25, 4 ratings)

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Semi-automated analytic pipeline creation and validation using active learning

Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience
Location: 212 A-B

Sean Ma (Trifacta)

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Building a flexible ML pipeline at a B2B AI startup

Data engineering and architecture
Location: 212 A-B

Dorna Bandari (Jetlore)

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Pipeline testing with Great Expectations

Big data and data science in the cloud, Data engineering and architecture
Location: 212 A-B

Abe Gong (Superconductive Health), James Campbell (USG)

Average rating:

(5.00, 4 ratings)

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

The future of ETL isn’t what it used to be

Data engineering and architecture
Location: 210 B/F

Gwen Shapira (Confluent)

Average rating:

(5.00, 3 ratings)

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com