Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Henry Cai (Pinterest), Yi Yin (Pinterest)

2:40pm–3:20pm Wednesday, March 7, 2018

Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Location: 230 A

Average rating:

(3.00, 1 rating)

Download slides (PDF)

Who is this presentation for?

Data engineers, software engineers, architects, project managers, machine learning engineers, data scientists, and data users

What you'll learn

Learn how Pinterest solved the problem of moving hundreds of terabytes of MySQL data offline on a daily basis to power continuous computation

Description

Pinterest helps people discover, save, and do things that they love. The company has a hundred billion core objects (pins, boards, and users) stored in MySQL at the scale of a hundred terabytes. Most of that data is used to build data-driven products, such as personalized recommendations, A/B experiments, and search indexes.

As Pinterest is moving toward real-time computation, the company is faced with much stringent SLA requirements, such as making MySQL data available in S3/Hadoop within 15 minutes and serving DB data incrementally in stream processing. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest’s continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. The system can listen for MySQL BinLog changes, publish the MySQL change logs as an Apache Kafka change stream, and ingest and compact the stream into columnar tables in S3/Hadoop within 15 minutes.

Topics include:

Scalable data partitioning with 100 TBs of MySQL data
Building an efficient compaction algorithm
Schema migration, rewind, and recovery
PII (personal identifiable information) processing
Columnar storage for efficient incremental queries
How the DB change stream powers other use cases, such as cache invalidation in multi-data center deployments

Henry Cai

Henry Cai is a software engineer on the data engineering team at Pinterest, where he designs large-scale big data infrastructures. Previously, he worked at LinkedIn. Henry is the maintainer and contributor of many open source data ingestion systems, including Camus, Kafka, Gobblin, and Secor.

Website

Yi Yin

Yi Yin is a software engineer on the data engineering team at Pinterest, where he works on Kafka-to-S3 persisting tools and schema generation of Pinterest’s data.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com