Sep 23–26, 2019

We run, we improve, we scale - XGBoost story in Uber

Nan Zhu (Uber), Felix Cheung (Uber)
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 08/10
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Transportation and Logistics

Who is this presentation for?

Machine Learning Engineers, Data Scientists



Prerequisite knowledge

beginner understanding of tree machine learning model What is XGBoost Brief experience with XGBoost

What you'll learn

(1) Overview of business problems Uber is solving with XGBoost (2) How we improve the model training of XGBoost to bring more scaled business impact (3) What is going to happen with XGBoost in the near future


With the tremendous growing of Uber’s business scale, the agility and scalability of the machine learning system is the core prerequisite in making data-driven decisions to improve our user experiences.

With a good fitting to our requirements, XGBoost is playing roles across our business scope. XGBoost not only produces accurate models, but also it scales to handle billions of records and thousands of features. We have XGBoost models improving the driver’s safety during driving, recommending foods and restaurants and estimating the arrival time of rides, etc.

This talk, given by XGBoost team at Uber and committee member of the Open Source XGBoost, is to give insights about

(1) the internals on how XGBoost scales training to hundreds even thousands of workers with the accuracy guarantee. It’s the first time for the community core member to bring detailed internals of distributed training to the public audience.

(2) Uber’s journey with the latest version of XGBoost. We will talk about the problems we met with the earlier version of XGBoost, how we identify, fix and eventually unblock ourselves by improving XGBoost and contribute back to the community. Finally, we will summarize the lessons we learnt and our future plan with XGBoost.

Photo of Nan Zhu

Nan Zhu


Nan Zhu is a software engineer in Uber. He works on optimizing Apache Spark for Uber scenarios and scaling XGBoost in the machine learning platform of Uber. Nan has been the committee member of XGBoost since 2016. He started project XGBoost4J-Spark integrating XGBoost and Spark as well as fast histogram algorithm in distributed training.

Felix Cheung


Felix Cheung is an engineer at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts