Public cloud usage for large-scale data processing is rapidly increasing, and running data engineering workloads in the cloud is becoming easier and more cost effective. Compute engines have adapted to leverage cloud infrastructure, including object storage and elastic compute. For example, Hive, Spark, and Impala compute engines are able to read input from and write output directly to AWS S3 storage. Moreover, these read and write paths have been optimized for fast processing speeds, lowering the overall cost of running a job. In addition, platform-as-a-service offerings for data processing in the cloud have evolved to minimize the operational overhead of clusters, instead allowing the end user to focus on workloads: developing, running, and troubleshooting jobs.
Data engineering, a workload that transforms raw data at scale into clean structured data, is a foundational workload run prior to most analytic and operational database use cases. It’s important for end users to be able to implement data pipeline workflows that seamlessly transition from one stage of the data pipeline to the next. Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.
Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud services and data engineering. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.
Fahd Siddiqui is a software engineer at Cloudera, where he’s working on cloud products, such as Cloudera Altus and Cloudera Director. Previously, Fahd worked at Bazaarvoice developing EmoDB, an open source data store built on top of Cassandra. His interests include highly scalable and distributed systems. He holds a master’s degree in computer engineering from the University of Texas at Austin.
Paul George is a software engineer at Cloudera, working on cloud products such as Cloudera Altus. Previously, Paul worked at Palantir Technologies and cofounded a company focused on building data systems for genomics. He holds a PhD in electrical and computer engineering from Cornell University.
Eugene Fratkin is a director of engineering at Cloudera, heading Cloud R&D. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com