I am a senior software engineer at Criteo and part of the team that manages a 26808 core cluster with 39 PB of raw storage and 11 TB RAM that runs up to 90000 jobs per day on a Hadoop cluster with Kerberos security. We are currently installing a new 14000 core cluster with 48 PB of raw storage.
Criteo has a 26808 core cluster with 39 PB of raw storage and 11 TB RAM that runs up to 90000 jobs per day. Criteo decided to create a cluster with the capacity to temporarily replace, in both storage and compute, the existing cluster while it is upgraded. We will then run both clusters in parallel as part of our disaster recovery program and for extra capacity. The new hardware was chosen after manufacturers were asked to make competing bids to supply the new machines. In this talk we will describe:
the criteria we established for the 800 new computers and our comparison tests between suppliers' hardware and prices
our non-blocking level 3 network infrastructure with 10 Gb endpoints scalable to 5000 machines
the installation and configuration, using Chef, of a new cluster with a new version of Hadoop on new hardware
the problems encountered in moving our jobs and data from the old CDH4 cluster to the new CDH5 cluster
the plans for running the two clusters in parallel and for failing over between them in case of disaster
the operational issues we face running two large clusters and three smaller clusters