Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Henry Cai (Pinterest), Yi Yin (Pinterest)
2:40pm3:20pm Wednesday, March 7, 2018
Average rating: ***..
(3.00, 1 rating)

Who is this presentation for?

  • Data engineers, software engineers, architects, project managers, machine learning engineers, data scientists, and data users

What you'll learn

  • Learn how Pinterest solved the problem of moving hundreds of terabytes of MySQL data offline on a daily basis to power continuous computation


Pinterest helps people discover, save, and do things that they love. The company has a hundred billion core objects (pins, boards, and users) stored in MySQL at the scale of a hundred terabytes. Most of that data is used to build data-driven products, such as personalized recommendations, A/B experiments, and search indexes.

As Pinterest is moving toward real-time computation, the company is faced with much stringent SLA requirements, such as making MySQL data available in S3/Hadoop within 15 minutes and serving DB data incrementally in stream processing. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest’s continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. The system can listen for MySQL BinLog changes, publish the MySQL change logs as an Apache Kafka change stream, and ingest and compact the stream into columnar tables in S3/Hadoop within 15 minutes.

Topics include:

  • Scalable data partitioning with 100 TBs of MySQL data
  • Building an efficient compaction algorithm
  • Schema migration, rewind, and recovery
  • PII (personal identifiable information) processing
  • Columnar storage for efficient incremental queries
  • How the DB change stream powers other use cases, such as cache invalidation in multi-data center deployments
Photo of Henry Cai

Henry Cai


Henry Cai is a software engineer on the data engineering team at Pinterest, where he designs large-scale big data infrastructures. Previously, he worked at LinkedIn. Henry is the maintainer and contributor of many open source data ingestion systems, including Camus, Kafka, Gobblin, and Secor.

Photo of Yi Yin

Yi Yin


Yi Yin is a software engineer on the data engineering team at Pinterest, where he works on Kafka-to-S3 persisting tools and schema generation of Pinterest’s data.