Improve the speed of ML innovations at LinkedIn

Zhe Zhang (LinkedIn)

13:45–14:25 Thursday, 17 October 2019

Location: King's Suite - Balmoral

Average rating:

(3.43, 7 ratings)

Machine learning engineering differs fundamentally from traditional software engineering in the level of uncertainty and unpredictability of an idea until fully verified in production. Therefore, a deciding factor in ML-based products (e.g., recommendation, ranking) is the speed of the trial-and-error loop.

As a result, across the LinkedIn data organization, multiple teams focus on attacking different aspects of the “speed of the loop,” including tooling, data movement, and computation. Zhe Zhang provides an overview of these efforts while focusing on the performance and efficiency of ML computation.

For a typical LinkedIn ML engineer, the cycle of experimentation begins with exploring the data to understand the schema and basic statistical patterns. The managed datasets (tracked from the LinkedIn site and published through Kafka) are stored on Hadoop. For these kinds of lightweight, ad hoc computation, the team’s focus is to provide a secure and easy-to-use interactive environment. You’ll get an introduction to the team’s work on a hosted notebooks solution based on JupyterHub.

The next step is to massage the managed datasets as ML features. This step requires a wide range of data processing; Spark, as a general-purpose big data framework, is ideal for this job. At LinkedIn’s scale, it has discovered that joining large datasets is a severe challenge, especially with unevenly distributed keys (skew). Zhe dives into LinkedIn’s ongoing work of optimizing Spark SQL.

After the feature data is prepared, a variety of algorithms are used to train the models. The most significant types of training are linear models based on GLMix (Spark), tree models based on XGBoost (Spark), and neural network models based on TensorFlow. Zhe explores LinkedIn’s recent optimization work on TensorFlow and the challenges in GLMix and XGBoost to shuffle data between compute stages. After an ML model has been generated, the most time-consuming part is to deploy to the LinkedIn site (online) and conduct A/B experiments to analyze model performance. And you’ll briefly explore LinkedIn’s model deployment solution and A/B testing framework XLNT.

What you'll learn

Discover how LinkedIn handles the speed of the trial-and-error loop

Zhe Zhang

Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).