Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo’s Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.
The resulting platform is based on Mesos. Mesos has allowed Criteo to scale per demand and better utilize resources, iterate on development much faster than on bare metal, and roll out new versions seamlessly without downtime for our users. Finally, it has allowed the company to eliminate the last SPOF in its Hive stack. Szehon and Pawel detail Criteo’s data architecture and explain how the company solved challenges in security, monitoring, scheduling, and load balancing on multiple layers. They also discuss the gains made by this process.
Szehon Ho is a staff software engineer on the analytics data storage team at Criteo, where he works on Criteo’s Hive platform. Previously, he was a software engineer on the Hive team at Cloudera. He was a committer and PMC member in the Apache Hive open source community, working on features like Hive on Spark and Hive monitoring and metrics, among others.
Pawel Szostek is a senior software engineer on Criteo’s analytics data storage team, where he works on various projects, including implementing an improved HyperLogLog algorithm. Previously, he was a researcher at CERN in Geneva.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com