Predicting Criteo’s internet traffic load using Bayesian structural time series models
Criteo connects 1.5 billion active shoppers with the things they need and love. Its technology takes an algorithmic approach to predict which user it shows an ad to, when, and for what products. Criteo’s infrastructure evolution is driven by its traffic forecast. Its infrastructure provides capacity and connectivity to host the Criteo platform and applications. Located in six different countries across the Americas, Europe, and Asia, its footprint covers nine data centers, two high-performance computing (HPC) clusters, more than 35K physical servers, and more than 5M queries per second (QPS) on peak hours.
Due to its critical importance, one of principal tasks of the product data science team is to build machine learning models to forecast traffic demand across services and data centers to make good investment decisions to scale the company’s infrastructure. This allows Criteo to accurately build predictions of how many machines any service will need in the future with stunning accuracy. Predicting capacity is especially useful to allocate hardware needs for periods when the traffic load is really high, for example, during Black Friday, Cyber Monday, or Christmas sales in the Americas and Europe.
Hamlet Jesse Medina Ruiz explains how to forecast Criteo’s traffic load using Bayesian dynamic time series models. He details the general Bayesian framework, its advantages and limitations, and alternatives to solve the problem.
To forecast the traffic load, the company makes use of Bayesian state space models to forecast daily traffic load several months in advance. The statistical Bayesian framework, in contrast to classical econometric or classical time series models, allows you to infer time-varying components present in the time series, like local trends, local seasonalities, capture especial events and holidays in a hierarchical way, or simply induce sparsity in the model, etc. The Bayesian treatment also allows you to include domain knowledge in the form of prior distributions in a flexible way. This modeling approach has proven to be very valuable for Criteo when there isn’t enough data available to train its models. Over the last two years, these extreme periods have been predicted six months in advance very well by its models with an error lower than 6%.
- A basic understanding of machine learning and time series concepts
What you'll learn
- Learn how to analyze time series using Bayesian modeling, in particular how to make a good forecast by including uncertainty in your estimates
Hamlet Jesse Medina Ruiz
Hamlet Jesse Medina Ruiz is a senior data scientist at Criteo. Previously, he was a control system engineer for Petróleos de Venezuela. Hamlet finished in the top ranking in multiple data science competitions, including 4th place on predicting return volatility on the New York Stock Exchange hosted by Collège de France and CFM in 2018 and 25th place on predicting stock returns hosted by G-Research in 2018. Hamlet holds a two master degrees on mathematics and machine learning from Pierre and Marie Curie University, and a PhD in applied mathematics from Paris-Sud University in France, where he focused on statistical signal processing and machine learning.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts