Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Data Integration and Data Pipelines sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data integration and the creation and maintenance of data pipelines.

11:00am11:40am Wednesday, March 7, 2018
Gwen Shapira (Confluent)
Average rating: ****.
(4.93, 14 ratings)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
11:50am12:30pm Wednesday, March 7, 2018
Eugene Kirpichov (Google)
Average rating: ****.
(4.75, 4 ratings)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.
1:50pm2:30pm Wednesday, March 7, 2018
Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)
Average rating: ****.
(4.25, 4 ratings)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.
2:40pm3:20pm Wednesday, March 7, 2018
Sean Ma (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.
4:20pm5:00pm Wednesday, March 7, 2018
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.
5:10pm5:50pm Wednesday, March 7, 2018
Abe Gong (Superconductive Health), James Campbell (USG)
Average rating: *****
(5.00, 4 ratings)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.
5:10pm5:50pm Wednesday, March 7, 2018
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.