Criteo has a main production cluster of 2,000 nodes that runs over 300,000 jobs a day, along with a backup cluster of 1,200 nodes. Criteo’s job is to keep these clusters running together as it builds a cluster to replace the backup cluster. These clusters are in the company’s own data centers, as running in the cloud would be many times more expensive. These two clusters were meant to provide a redundant solution to Criteo’s storage and compute needs, including a tested failover mechanism.
Building a cluster requires testing the hardware from several manufacturers and choosing the most cost effective option. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo’s progress in building another cluster to survive the loss of a full DC. Criteo has now done these tests twice and can provide advice on how to do it right the first time. The tests were effective except for the RAID controller for the company’s 35,000 disks. Criteo had so many problems using the new controller that it had to replace it and is now working on a solution that will help the company better manage its disks.
Stuart Pook is senior DevOps engineer at Criteo, where he is part of Criteo’s Lake team that runs some small and two rather large Hadoop clusters. Stuart loves storage (208 PB at Criteo) and automation with Chef, because configuring more than 3,000 Hadoop nodes by hand is just too slow. Before discovering Hadoop, he developed
user interfaces and databases for biotech companies. Stuart has presented at ACM CHI 2000, Devoxx 2016, NABD 2016, Hadoop Summit Tokyo 2016, Apache Big Data Europe 2016, Big Data Tech Warsaw 2017, and Apache Big Data North America 2017.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org