Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Schedule: Data Integration and Data Pipelines sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data integration and the creation and maintenance of data pipelines.

11:20am–12:00pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Gwen Shapira (Confluent)
Average rating: ****.
(4.00, 4 ratings)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: Expo Hall Level: Non-technical
Usama Fayyad (Open Insights & OODA Health, Inc.), Troels Oerting (WEF Global Cybersecurity Center)
Average rating: ***..
(3.00, 1 rating)
Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1A 10 Level: Intermediate
Felix Cheung (Uber)
Average rating: ****.
(4.60, 5 ratings)
Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Yaroslav Tkachenko (Activision)
Average rating: ****.
(4.67, 3 ratings)
What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Danny Chen (Uber Technologies), Omkar Joshi (Uber Technologies), Eric Sayle (Uber Technologies)
Average rating: ***..
(3.80, 5 ratings)
Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Average rating: **...
(2.67, 3 ratings)
Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Average rating: *....
(1.33, 3 ratings)
Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Beginner
Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1E 07/08 Level: Beginner
Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: 1E 09 Level: Advanced
Barbara Eckman (Comcast)
Average rating: ****.
(4.33, 6 ratings)
Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1E 09 Level: Intermediate
Kevin Lu (PayPal), MAULIN VASAVADA (PayPal), Na Yang (PayPal)
Average rating: ****.
(4.00, 3 ratings)
PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing. Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1A 23/24 Level: Beginner
Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Average rating: ****.
(4.50, 2 ratings)
Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture. Read more.