Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Schedule: Data Integration and Data Pipelines sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data integration and the creation and maintenance of data pipelines.

11:20am–12:00pm Wednesday, 09/12/2018

The future of ETL isn’t what it used to be.

Location: 1A 23/24 Level: Intermediate

Gwen Shapira (Confluent)

Average rating:

(4.00, 4 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:20am–12:00pm Wednesday, 09/12/2018

Next-generation cybersecurity via data fusion, AI, and big data: Pragmatic lessons from the front lines in financial services

Location: Expo Hall Level: Non-technical

Usama Fayyad (Open Insights & OODA Health, Inc.), Troels Oerting (WEF Global Cybersecurity Center)

Average rating:

(3.00, 1 rating)

Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions. Read more.

11:20am–12:00pm Wednesday, 09/12/2018

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

Location: 1A 10 Level: Intermediate

Felix Cheung (Uber)

Average rating:

(4.60, 5 ratings)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.

1:15pm–1:55pm Wednesday, 09/12/2018

Lessons learned building a scalable and extendable data pipeline for Call of Duty

Location: 1A 23/24 Level: Intermediate

Yaroslav Tkachenko (Activision)

Average rating:

(4.67, 3 ratings)

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision. Read more.

2:05pm–2:45pm Wednesday, 09/12/2018

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Location: 1A 23/24 Level: Intermediate

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

Average rating:

(3.80, 5 ratings)

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works. Read more.

2:55pm–3:35pm Wednesday, 09/12/2018

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Location: 1A 23/24 Level: Intermediate

Mauricio Aristizabal (Impact)

Average rating:

(2.67, 3 ratings)

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

Tracking data lineage at Stitch Fix

Location: 1A 23/24 Level: Intermediate

Neelesh Salian (Stitch Fix)

Average rating:

(1.33, 3 ratings)

Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

Circuit breakers to safeguard for garbage in, garbage out

Location: 1A 23/24 Level: Beginner

Sandeep Uttamchandani (Intuit)

Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

Hudi: Unifying storage and serving for batch and near-real-time analytics

Location: 1E 07/08 Level: Beginner

Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond. Read more.

11:20am–12:00pm Thursday, 09/13/2018

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

Location: 1E 09 Level: Advanced

Barbara Eckman (Comcast)

Average rating:

(4.33, 6 ratings)

Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Read more.

3:30pm–4:10pm Thursday, 09/13/2018

Kafka at PayPal: Enabling 400 billion messages a day

Location: 1E 09 Level: Intermediate

Kevin Lu (PayPal), Maulin Vasavada (PayPal), Na Yang (PayPal)

Average rating:

(4.00, 3 ratings)

PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing. Read more.

4:20pm–5:00pm Thursday, 09/13/2018

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Location: 1A 23/24 Level: Beginner

Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)

Average rating:

(4.50, 2 ratings)

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture. Read more.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com