Criteo has a main production cluster of 2000 nodes that runs over 300000 jobs/day and a backup cluster of 1200 nodes. Our job is to keep these clusters running together as we build a cluster to replace the backup cluster. These clusters are in our own data centres as running in the cloud would be many times more expensive.
These two clusters were meant to provide a redundant solution to Criteo’s storage and compute needs including a tested failover mechanism. We will explain our project, what went wrong, and our progress in building yet another cluster to finally create a computing system that will survive the loss of an entire data centre.
This presentation will also describe what we have learnt when building and running Hadoop clusters.
Building a cluster requires testing the hardware from several manufacturers and choosing the most cost effective option. We have now done these tests twice and can provide advice on how to do it right the first time.
Our tests were effective except for the RAID controller for our 35000 disks. We had so many problems using our new controller that we had to replace it and are now working with the constructors on a solution that will help us better manage our disks.
Hadoop, especially at that this scale, does not run itself, so what operational skills and tools are required to keep the clusters healthy, the data safe and the jobs running 24 hours a day every day?
Stuart loves storage (208 PB at Criteo) and is part of Criteo’s Lake team that runs some small and two rather large Hadoop clusters. He also loves automation with Chef because configuring more than 3000 Hadoop
nodes by hand is just too slow. Before discovering Hadoop he developed
user interfaces and databases for biotech companies.
Stuart has presented at ACM CHI 2000, Devoxx 2016, NABD 2016, Hadoop Summit Tokyo 2016, Apache Big Data Europe 2016, Big Data Tech Warsaw 2017, and Apache Big Data North America 2017.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com