In recent years, deep learning has significantly improved several AI applications, including recommendation engines, voice/speech recognition, and image/video recognition. BigDL, an open source distributed deep learning framework developed by Intel and built for big data platforms using Apache Spark, brings native support of deep learning functionalities to Spark.
In order to scale beyond a cluster, gradient aggregation has to span across servers on a network. In regular reduce or aggregate functions in Apache Spark (and the original MapReduce), all partitions have to send their computed local gradients to the driver machine, and that machine spends linear time on the number of partitions (due to the CPU cost in merging partial results and the network bandwidth limit). This process becomes a bottleneck when there are many partitions. TreeReduce and treeAggregate in Apache Spark are new aggregation communication patterns based on multilevel aggregation trees. They can reduce the load the driver has to deal with but still don’t allow for near-linear scaling. Besides, as the number of partitions grows, there is additional overhead from having a centralized driver that schedules all the tasks in the system. Especially with a large number of partitions, this overhead cannot be ignored and results in decreased throughput and increased training time.
Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. To allow for near-linear scaling, they use the new AllReduce operation, a part of the parameter manager in BigDL. The AllReduce operation works in two phases: First, gradients from all the partitions within a single worker are aggregated locally. The gradient is then sliced into chunks, and chunks are exchanged between all the nodes in the cluster. Now each node has its own partition of aggregated gradients and computes its own partition of weights, thus ensuring scalability. In the end, each node ends up with the same updated model weights, but the driver overhead is eliminated. In this way, even as the number of partitions grows, data transferred on each worker is still proportional to the size of the gradients (or weights), and the driver is not involved in the communication.
To address the scheduling overhead, they use Drizzle, a recently proposed scheduling framework for Apache Spark. Currently, Spark uses a BSP computation model and notifies the scheduler at the end of each task, which adds overheads and results in decreased throughput and increased latency. Drizzle uses group scheduling, where multiple iterations (or a group) of computation are scheduled at once, helping decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch.
Shivaram and Sergey share results from using the AllReduce operation and Drizzle on a number of common deep learning models, including VGG, GoogLeNet, and Inception, using benchmarks run on Amazon EC2 and Google DataProc.
Sergey Ermolin is a software solutions architect for deep learning, Spark analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since 1996, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard. Sergey holds an MSEE and a certificate in mining massive datasets from Stanford and BS degrees in both physics and mechanical engineering from California State University, Sacramento.
Shivaram Venkataraman is a postdoctoral researcher at Microsoft Research. Starting in Fall 2018, he will be an assistant professor in computer science at the University of Wisconsin-Madison. Shivaram holds a PhD from the University of California, Berkeley, where he was advised by Mike Franklin and Ion Stoica. His work spans distributed systems, operating systems, and machine learning, and his recent research has looked at designing systems and algorithms for large-scale data analysis.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org