Building a best-in-class data lake on AWS and Azure
Who is this presentation for?Architect, data engineer, data scientist, analyst
Prerequisite knowledgeBasic Linux and SQL skills are beneficial in order to follow the examples.
What you'll learn
Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. However, there are still a number of key challenges when it comes to building a cloud-based data lake:
- Most data in the cloud doesn’t start in S3 and ADLS. Instead, it’s stored in a variety of data sources, ranging from relational databases like Amazon RDS and Azure SQL DB, to NoSQL databases like MongoDB and Elasticsearch. Logs often start their life in a data pipeline layer such as Kafka.
- The data needs to be processed, explored and analyzed using a variety of engines, including Spark, Impala, Athena and Dremio. While an on-premise data lake is static, a cloud data lake enables these engines to run independently, on a common storage layer, with their own individual lifecycle and scale.
- S3 and ADLS are typically slower than HDFS. This introduces challenges for real-time workloads.
In this talk we describe how companies are building data lakes in the cloud using S3 and ADLS as storage layers, while leveraging multiple processing engines to address various needs including batch processing, ad-hoc data exploration, reporting and ML/AI. In addition to exploring best practices, we provide several real-world examples from different industries.
In addition, we discuss the significance of Apache Arrow to the future of the heterogeneous data lake. Arrow is now downloaded 2.5 million times per month, up 100x in the last year, and is a foundational component in dozens of open source technologies such as Pandas, R, Spark and Dremio. With the introduction of Arrow Flight, it will soon be possible to exchange Big Data data between various distributed systems in the data lake, thereby unlocking additional use cases and dramatically increasing the performance of data lake workloads.
Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He is the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.
Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts