Over the past several years we have observed ever-increasing quantities of data being processed within public clouds. Cloud promises to provide solutions to some of the limitations of conventional single, multi-purpose clusters offering hyperscale storage, which is decoupled from elastic, on-demand compute; and allowing data to be shared between on-demand provisioned processing engines such as Hive, Spark, Impala, etc.
But to fulfill this promise one needs to solve several technical challenges – simple resource allocation, cross-cluster metadata sharing, common authorization framework. Without comprehensive answers to these questions, the challenges of single cluster model are simply duplicated inside a public cloud environment.
During our tutorial, we’ll discuss some of the new paradigms that allow one to effectively run production level pipelines with minimal operational overhead and remove the barrier of data discovery, meta-data sharing, and access control. As a part of the deep dive, we will provide a hands-on example of the creation of such pipelines and execution of data processing/data analytic workflows.
Mala Ramakrishnan heads product initiatives for data engineering at Cloudera. She has 17+ years experience in marketing, product management, and software development in organizations of varied sizes that deliver middleware, software security, network optimization, and mobile computing. She holds a Masters in Computer Science from Stanford University. Outside of work, she is a mom of two boys 6 and 9 years of age.
Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.
Mark Samson is a principal systems engineer at Cloudera, helping customers solve their big data problems using enterprise data hubs based on Hadoop. Mark has 17 years’ experience working with big data and information management software in technical sales, service delivery, and support roles.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org