Modern data processing applications execute many diverse operators. For example, a complete machine learning pipeline from label generation to model training may involve regular expressions, relational joins, and image convolutions, each of which has many known implementations for the same functionality. However, their performance can be dramatically affected by the characteristics of things like the input data and the hardware setting, so it can be difficult for developers to choose among the implementations when writing their applications. Traditional database query optimizers and offline autotuners attempt to solve this problem by automatically picking the best operator variants, but they require developers to build optimization rules and cost models or collect representative workloads and profile applications offline.
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These tuners automatically pick operator implementations online and use multi-armed bandit reinforcement learning techniques to quickly learn which operator variants are best for each application. The tuners cyclically try out operator variants during execution so as to balance exploration and exploitation, observe the resulting application performance, and use those observations to influence later decisions.
Cuttlefish tuners can incorporate contextual features about the input data when they are available, such as the dimensions of each input image to a convolution operator. They can effectively tune applications in shared-nothing distributed environments even as clusters grow in size. Finally, they can adaptively react to changes in an application’s workload. Cuttlefish was prototyped in Apache Spark, but it can easily be added to other big data systems. To evaluate this prototype, Cuttlefish tuners were used to optimize a wide range of large-scale data processing applications that involve image convolution, regular expression matching, and relational joins. They have achieved 3–6x higher convolution throughput compared to the original unoptimized applications and up to 75x higher regular expression throughput. The tuners have also outperformed Spark SQL’s default query optimizer and sped up relational joins by up to 2.6×.
Tomer Kaftan is a second-year PhD student at the University of Washington, working with Magdalena Balazinska and Alvin Cheung. His research interests include machine learning systems, distributed systems, and query optimization. Previously, Tomer was a staff engineer in UC Berkeley’s AMPLab, working on systems for large-scale machine learning. He holds a degree in EECS from UC Berkeley. He is a recipient of an NSF Graduate Research Fellowship.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org