Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Web-scale machine learning on Apache Spark

Jason (Jinquan) Dai (Intel), Yiheng Wang (Intel)
11:15am–11:55am Thursday, December 8, 2016
Data science and advanced analytics
Location: Summit 2 Level: Intermediate
Average rating: **...
(2.00, 2 ratings)

Prerequisite Knowledge

  • A general understanding of big data analytics and machine learning

What you'll learn

  • Understand how to do large-scale, distributed machine learning on Apache Spark for big data analytics


Web-scale machine learning plays a central role in today’s Internet applications and intelligent systems; these problem settings have pushed the field to address issues of scale that were almost inconceivable even a decade ago. Today, a typical industrial machine-learning application may analyze trillions of training samples using a correspondingly large model with up to tens of billions of unique features. Web-scale machine learning is driving the need for scalable, distributed learning algorithms and systems that can handle big data. Unfortunately, existing open source big data systems fail to readily support the rapid increase in the magnitude and complexity of these analytic tasks, especially the challenges associated with datasets and models of massive size and dimensionality.

Jason Dai and Yiheng Wang share their experience building web-scale machine learning using Apache Spark—focusing specifically on “war stories” (e.g., in-game purchase, fraud detection, and deep leaning)—outline best practices to scale these learning algorithms, and discuss trade-offs in designing learning systems for the Spark framework.

Photo of Jason (Jinquan) Dai

Jason (Jinquan) Dai


Jason (Jinquan) Dai is a senior principal engineer and CTO of big data technologies at Intel, where he is responsible for leading the global engineering teams (located in both Silicon Valley and Shanghai) on the development of advanced big data analytics (including distributed machine and deep learning), as well as collaborations with leading research labs (e.g., UC Berkeley AMPLab and RISELab). Jason is an internationally recognized expert on big data, cloud, and distributed machine learning; he is the program cochair of the O’Reilly AI Conference in Beijing, a founding committer and PMC member of Apache Spark, and the creator of BigDL, a distributed deep learning framework on Apache Spark.

Photo of Yiheng Wang

Yiheng Wang


Yiheng Wang is a software development engineer on the Big Data Technology team at Intel working in the area of big data analytics. Yiheng and his colleagues are developing and optimizing distributed machine learning algorithms (e.g., neural network and logistic regression) on Apache Spark. He also helps Intel customers build and optimize their big data analytics applications.