Web-scale machine learning plays a central role in today’s Internet applications and intelligent systems; these problem settings have pushed the field to address issues of scale that were almost inconceivable even a decade ago. Today, a typical industrial machine-learning application may analyze trillions of training samples using a correspondingly large model with up to tens of billions of unique features. Web-scale machine learning is driving the need for scalable, distributed learning algorithms and systems that can handle big data. Unfortunately, existing open source big data systems fail to readily support the rapid increase in the magnitude and complexity of these analytic tasks, especially the challenges associated with datasets and models of massive size and dimensionality.
Jason Dai and Yiheng Wang share their experience building web-scale machine learning using Apache Spark—focusing specifically on “war stories” (e.g., in-game purchase, fraud detection, and deep leaning)—outline best practices to scale these learning algorithms, and discuss trade-offs in designing learning systems for the Spark framework.
Jason (Jinquan) Dai is a senior principal engineer and CTO of big data technologies at Intel, where he is responsible for leading the global engineering teams (located in both Silicon Valley and Shanghai) on the development of advanced big data analytics (including distributed machine and deep learning), as well as collaborations with leading research labs (e.g., UC Berkeley AMPLab and RISELab). Jason is an internationally recognized expert on big data, cloud, and distributed machine learning; he is the program cochair of the O’Reilly AI Conference in Beijing, a founding committer and PMC member of Apache Spark, and the creator of BigDL, a distributed deep learning framework on Apache Spark.
Yiheng Wang is a software development engineer on the Big Data Technology team at Intel working in the area of big data analytics. Yiheng and his colleagues are developing and optimizing distributed machine learning algorithms (e.g., neural network and logistic regression) on Apache Spark. He also helps Intel customers build and optimize their big data analytics applications.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.