Apache Spark ML has become hugely popular among data analytics in big data ecosystem and absorbed a great number of developers across the globe to actively contribute to the project. It has now evolved from being a standard ML library to a powerful component on Spark to support complex workflows and production requirements.
Intel has been deeply involved in Spark from its earliest moments, working with the community in feature development, bug fixing, and performance optimization. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel’s work on SparkML optimization—profile, analyze, and optimize.
At the profiling stage, Intel leverages HiBench to benchmark the target Spark ML algorithms on the dataset at different scales. With HiBench ML workloads, it’s easier to hit the bottleneck of the algorithm under test. Intel uses a set of tools, including Intel vTune, Intel PAT, and visualVM, to collect and analyze the performance data of the HiBench ML workloads with different metrics (CPU, memory, disk, network I/O, etc.). With such detailed performance data, it’s always likely to spot some opportunity to optimize the ML algorithms, either by software engineering or by leveraging HW supports. With this methodology, Intel has boosted the training process for logistic regression by ~1.7x, random forest and GBT by ~1.4x, and SVM by ~1.4x, and Intel saw a more than 60x performance boost for ALS on prediction. Vincent and Peng discuss those achievements and illustrate Intel’s three-stage working model on Spark optimization.
Vincent Xie (谢巍盛) is the Chief Data Scientist/Senior Director at Orange Financial, as head of the AI Lab, he built the Big Data & Artificial Intelligence team from scratch, successfully established the big data and AI infrastructure and landed tons of businesses on top, a thorough data-driven transformation strategy successfully boosts the company’s total revenue by many times. Previously, he worked at Intel for about 8 years, mainly on machine learning- and big data-related open source technologies and productions.
Peng Meng is a senior software engineer on the big data and cloud team at Intel, where he focuses on Spark and MLlib optimization. Peng is interested in machine learning algorithm optimization and large-scale data processing. He holds a PhD from the University of Science and Technology of China.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org