Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

From flat files to deconstructed database: The evolution and future of the big data ecosystem

Julien Le Dem (WeWork)

5:25pm–6:05pm Wednesday, 09/12/2018

Data engineering and architecture
Location: 1A 10 Level: Intermediate

Average rating:

(5.00, 1 rating)

View slides

Who is this presentation for?

Data engineers and data architects

Prerequisite knowledge

Knowledge of how to use a database

What you'll learn

Learn the purpose of the various projects in the ecosystem (e.g., Parquet, Arrow, and Calacite) and how they fit together

Description

Over the past 10 years, big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. With Hadoop, we started from a system that was good at looking for a needle in a haystack using snowplows. We had a lot of horsepower and scalability but lacked the subtlety and efficiency of relational databases. But since Hadoop provided the ultimate flexibility compared to the more constrained and rigid RDBMSs, we didn’t mind and plowed through.

However, machine learning, recommendations, matching, abuse detection, and data-driven products in general require a more flexible infrastructure. Over time, we started applying everything that had been known to the database world for decades to this new environment. We’d been told loud enough how Hadoop was a huge step backward. And it was true to some degree. The key difference was the flexibility of the Hadoop stack. There are many highly integrated components in a relational database and decoupling them took some time.

Today, we see the emergence of key components, such as optimizers, columnar storage, in-memory representation, table abstraction, and batch and streaming execution, as standards that provide the glue between the options available to process, analyze, and learn from our data. We’ve been deconstructing the tightly integrated relational database into flexible reusable open source components. Storage, compute, multitenancy, and batch or streaming execution are all decoupled and can be modified independently to fit every use case.

Julien Le Dem discusses the key open source components of the big data ecosystem—including Apache Calcite, Parquet, Arrow, Avro, and Kafka as well as batch and streaming systems—and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. (Parquet is the columnar data layout to optimize data at rest for querying. Arrow is the in-memory representation for maximum throughput execution and overhead-free data exchange. Calcite is the optimizer to make the most of our infrastructure capabilities.) Julien also explores the emerging components that are still missing or haven’t become standard yet to fully materialize the transformation to an extremely flexible database that lets you innovate with your data.

Julien Le Dem

WeWork

Julien Le Dem is a principal engineer at WeWork. He’s also the coauthor of Apache Parquet and the PMC chair of the project, and he’s a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Website

Comments on this page are now closed.

Comments

Julien Le Dem | PRINCIPAL ENGINEER

09/13/2018 4:55am EDT

Slides here: https://www.slideshare.net/julienledem/strata-ny-2018-the-deconstructed-database

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com