Pinterest operates on data at petabyte scale. Previously, the company’s fact tables were generated daily using Hadoop, resulting in data that was frequently 24–48 hours old. In order to support real-time decision making, stats, and analytics, Pinterest modeled its warehouse on quasi-Kappa architecture, treating batch processing as a special case of stream processing and warehousing data with sub-15-minute lag.
Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest’s real-time data pipeline, discussing the company’s decision to warehouse data at near-real-time to enable downstream systems to operate on much fresher data, the platform’s architecture, and its impact on Pinterest’s systems, tools, and processes. They conclude by demonstrating how Pinterest models real-time ads analytics use cases on the platform and sharing lessons learned along the way.
Swaminathan Sundaramurthy is a Director of Engineering at Salesforce Einstein, where he manages Machine Learning Services and Orchestration teams. Prior to Salesforce, Swami worked at Pinterest, where he initiated and managed the company’s stream platform and machine learning training platform, and managed anti-Spam and fraud efforts. He began his career as an IC, spending more than 12 years building distributed systems and cloud platforms at Amazon, Yahoo, Microsoft and Ask Jeeves. Swami is passionate about technology, distributed systems, promoting diversity and eliminating bias in the workplace.
Mark Cho is a software engineer at Pinterest.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org