Mar 15–18, 2020

Building a cloud data lake: Ingesting, processing, and analyzing big data on AWS

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
11:50am12:30pm Wednesday, March 18, 2020
Location: LL20A

Who is this presentation for?

Data engineers, data architects, developers

Level

Intermediate

Description

Data lake architectures have changed in recent years. While on-premises data lakes include colocated compute and storage, cloud data lakes are composed of separate compute and storage. The storage layer in AWS data lakes is S3, an infinitely scalable and cost-efficient object store. To create a production data lake on AWS, a number of services need to be assembled in order to ingest and prepare the data and to enable analysts and data scientists to consume it. Tomer Shiran and Jacques Nadeau explore the building blocks of an AWS data lake.

You can use AWS Glue, Hive Metastore, or self-describing files and directories on S3 for the data catalogue and table format. Irrespective of the catalog, the data can be stored as simple files or wrapped in a table format, such as the new open source Apache Iceberg project, which enables inserts and mutations for data lakes.

You can use AWS services such as AWS Glue and AWS Lake Formations to ingest the data. You can also use Apache Kafka to collect the data and load it into the data lake. In addition, a variety of ETL services are available from traditional ETL vendors, as well as many startups. These services can help ingest data from hundreds of data sources.

The data doesn’t always land in S3 in the desirable end state, and you’ll need its processing layer. In many cases additional transformations are required to prepare the data for consumption by analysts and data scientists. You can use AWS EMR, which includes Apache Spark and Apache Hive, or other managed Spark services such as Databricks and Cloudera to process the data.

Analysts and data scientists need a way to run SQL queries on data lake datasets at high speed. Without this capability, you have no choice but to export the data from data lake storage into a data warehouse such as Redshift or Snowflake. However, technologies such as AWS Athena and Apache Arrow-based Dremio accelerate queries on data lake storage, making it possible to achieve high-concurrency BI in an AWS data lake.

Prerequisite knowledge

  • A basic understanding of cloud services and SQL

What you'll learn

  • Discover a blueprint of how to build a data lake on AWS
  • Understand the main categories of building blocks for an AWS data lake and popular options in each category
Photo of Tomer Shiran

Tomer Shiran

Dremio

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Photo of Jacques Nadeau

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires