Pinterest helps people discover, save, and do things that they love. The company has a hundred billion core objects (pins, boards, and users) stored in MySQL at the scale of a hundred terabytes. Most of that data is used to build data-driven products, such as personalized recommendations, A/B experiments, and search indexes.
As Pinterest is moving toward real-time computation, the company is faced with much stringent SLA requirements, such as making MySQL data available in S3/Hadoop within 15 minutes and serving DB data incrementally in stream processing. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest’s continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. The system can listen for MySQL BinLog changes, publish the MySQL change logs as an Apache Kafka change stream, and ingest and compact the stream into columnar tables in S3/Hadoop within 15 minutes.
Henry Cai is a software engineer on the data engineering team at Pinterest, where he designs large-scale big data infrastructures. Previously, he worked at LinkedIn. Henry is the maintainer and contributor of many open source data ingestion systems, including Camus, Kafka, Gobblin, and Secor.
Yi Yin is a software engineer on the data engineering team at Pinterest, where he works on Kafka-to-S3 persisting tools and schema generation of Pinterest’s data.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org