Organizations deploying Hadoop are storing, organizing, processing, and analyzing more data than ever before, and the number of analytic applications natively integrating with Hadoop has grown rapidly in the last few years. Consequently, there are often hundreds or thousands of business and data analysts that leverage Hadoop clusters to explore, wrangle, visualize, and operationalize data for diverse use cases. As cluster utilization increases, however, maintaining performance of both exploratory and production use cases becomes critical.
Sean Kandel and Kaushal Gandhi share best practices for building and deploying Hadoop applications to support large-scale data exploration and analysis across an organization and demonstrate techniques to amortize exploratory workloads across clients to scale deployments while limiting performance degradation. Along the way, Sean and Kaushal explain how to flexibly compile queries across multiple runtime engines to optimize both data analytic and transformation queries and compare benchmarks for multiple architectures, demonstrating the effects of these techniques in data lake initiatives.
Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.
Kaushal Gandhi is a senior software engineer at Trifacta, where he built Trifacta’s fast interactive transformation engine (Photon) along with various data transformation features that improve user utility and usability of the product. Previously, Kaushal built prediction and estimation software at NVIDIA. He holds an MS in computer science and engineering.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com