San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Please log in

Add to Your Schedule

Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow

Jacques Nadeau (Dremio)

12:05–12:45 Wednesday, 1 May 2019

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Average rating:

(4.75, 4 ratings)

Who is this presentation for?

Data architects, data engineers, and data scientists

Level

Intermediate

Prerequisite knowledge

An understanding of Apache Arrow and big data infrastructure

What you'll learn

Learn how to use Apache Arrow and other open source projects to significantly lower the cost of deploying SQL-based workloads on cloud infrastructure

Description

As companies move their workloads to cloud platforms like AWS and Azure, they must consider not only performance SLAs but also the requisite costs for delivering an SLA on a specific dataset. Likewise, as companies look to take advantage of cloud object stores like Amazon S3 and Azure ADLS, they must weigh the cost advantages of these services against the functional capabilities of legacy systems like the relational database.

In the EDW, indexes and materialized views are powerful features that allow the system to process queries orders-of-magnitude more efficiently than scanning full tables for each query. The power of these approaches lies in both their efficiency and their ability to be applied to different workloads without impacting the behavior of end users. Add an index or a materialized view, and most users will never need to change their query or the resources they connect to.

In the cloud, when storing data in systems like Amazon S3 and Azure ADLS, some SQL engines are available for querying the data, but these offerings don’t provide the same rich features in terms of indexes and materialized views.

Jacques Nadeau shares a novel solution that uses Apache Arrow, Apache Parquet, and Apache Calicite to provide features similar to materialized views in relational databases. The key differences are that this approach is integrated into the separation of compute and storage available on cloud platforms and scales to any number of nodes, works with nested data structures like JSON natively, and is fully open source.

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Website

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com