Building a best-in-class data lake on AWS and Azure

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

2:05pm–2:45pm Wednesday, September 25, 2019

Location: 1E 09

Business Analytics and Visualization, Data Engineering and Architecture

Secondary topics: BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Data Management and Storage

Average rating:

(5.00, 2 ratings)

Who is this presentation for?

Architects, data engineers, data scientists, and analysts

Level

Intermediate

Description

Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. However, there are still a number of key challenges when it comes to building a cloud-based data lake. Most data in the cloud doesn’t start in S3 and ADLS. Instead, it’s stored in a variety of data sources, ranging from relational databases like Amazon RDS and Azure SQL DB to NoSQL databases like MongoDB and Elasticsearch. Logs often start their life in a data pipeline layer such as Kafka. The data also needs to be processed, explored, and analyzed using a variety of engines, including Spark, Impala, Athena and Dremio. While an on-premises data lake is static, a cloud data lake enables these engines to run independently on a common storage layer with their own individual lifecycle and scale. And S3 and ADLS are typically slower than a Hadoop distributed file system (HDFS). This introduces challenges for real-time workloads.

Tomer Shiran and Jacques Nadeau explain how you can build data lakes in the cloud using S3 and ADLS as storage layers while leveraging multiple processing engines to address needs including batch processing, ad hoc data exploration, reporting, and ML and AI. In addition to exploring best practices, they provide several real-world examples from different industries.

They also dive into the significance of Apache Arrow to the future of the heterogeneous data lake. Arrow is downloaded 2.5 million times per month, up 100x in the last year, and is a foundational component in dozens of open source technologies such as pandas, R, Spark. and Dremio. With the introduction of Arrow Flight, it’ll soon be possible to exchange big data between distributed systems in the data lake, unlocking use cases and dramatically increasing the performance of data lake workloads.

Prerequisite knowledge

A basic understanding of Linux and SQL

What you'll learn

Learn how to build a cloud data lake on AWS and Azure

Tomer Shiran

Dremio

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Website

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.