Building a cloud data lake: Ingesting, processing, and analyzing big data on AWS
Who is this presentation for?Data engineers, data architects, developers
Data lake architectures have changed in recent years. While on-premises data lakes include colocated compute and storage, cloud data lakes are composed of separate compute and storage. The storage layer in AWS data lakes is S3, an infinitely scalable and cost-efficient object store. To create a production data lake on AWS, a number of services need to be assembled in order to ingest and prepare the data and to enable analysts and data scientists to consume it. Tomer Shiran and Jacques Nadeau explore the building blocks of an AWS data lake.
You can use AWS Glue, Hive Metastore, or self-describing files and directories on S3 for the data catalogue and table format. Irrespective of the catalog, the data can be stored as simple files or wrapped in a table format, such as the new open source Apache Iceberg project, which enables inserts and mutations for data lakes.
You can use AWS services such as AWS Glue and AWS Lake Formations to ingest the data. You can also use Apache Kafka to collect the data and load it into the data lake. In addition, a variety of ETL services are available from traditional ETL vendors, as well as many startups. These services can help ingest data from hundreds of data sources.
The data doesn’t always land in S3 in the desirable end state, and you’ll need its processing layer. In many cases additional transformations are required to prepare the data for consumption by analysts and data scientists. You can use AWS EMR, which includes Apache Spark and Apache Hive, or other managed Spark services such as Databricks and Cloudera to process the data.
Analysts and data scientists need a way to run SQL queries on data lake datasets at high speed. Without this capability, you have no choice but to export the data from data lake storage into a data warehouse such as Redshift or Snowflake. However, technologies such as AWS Athena and Apache Arrow-based Dremio accelerate queries on data lake storage, making it possible to achieve high-concurrency BI in an AWS data lake.
- A basic understanding of cloud services and SQL
What you'll learn
- Discover a blueprint of how to build a data lake on AWS
- Understand the main categories of building blocks for an AWS data lake and popular options in each category
Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.
Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires