Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Data reflections: Making data fast and easy to use without making copies

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

2:40pm–3:20pm Thursday, March 8, 2018

Big data and data science in the cloud, Data engineering and architecture
Location: 230 C

Average rating:

(5.00, 3 ratings)

Who is this presentation for?

Data engineers and data scientists

Prerequisite knowledge

A basic understanding of database, data warehouse, and data lake concepts

What you'll learn

Explore data reflections, a new approach to making data available that dramatically reduces the need for data copies

Description

Raw data is rarely suitable for business users or data scientists. First, it is typically spread across many systems and may not be be accessible to those who want to consume the data. Second, it is often structured to suit the needs of the application developer, as opposed to the data consumer. Finally, it is usually not readily available in a single system that can magically respond to any analytical query at interactive speed.

In order to make data available, companies develop complex ETL pipelines in which data is copied many times between systems. For example, in order to achieve high performance, a subset of the overall dataset may be copied into a relational data warehouse, and then preaggregated into aggregation tables or external cubes or extracted into BI servers. This leads to tremendous cost and impacts the organization’s agility. It also eliminates the possibility of self-service, because it becomes impossible for a data consumer to know which system and table they should utilize to answer a specific question. As a result, in most enterprises, it can take months to develop a new BI dashboard or machine learning model.

Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies. When used in conjunction with a cost-based optimizer such as Apache Calcite, data reflections can help accelerate queries without the need for data engineers to manually create data copies or data consumers to interact with different materializations of data to achieve the desired performance. In addition, data reflections provide separation between the logical world, where analysts and data scientists need to curate and transform the model of the data, and and the physical world, where data must be physically optimized in order to enable execution engines to respond to queries in real-time.

In addition to providing an overview of data reflections and explaining the technological underpinnings, Tomer and Jacques offer a live demo of an open source implementation that shows how data science workloads, such as those in Python/pandas and R, and BI workloads can be automatically accelerated without having to explicitly move or copy data and while affording users the freedom to curate and transform the logical model of the data.

Tomer Shiran

Dremio

Tomer Shiran is the CEO and cofounder of Dremio. Previously, he was vice president of product at MapR, where he was responsible for product strategy, roadmap, and new feature development, and as a member of the executive team, helped grow the company from 5 to 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. Tomer is the founder of the open source Apache Drill project. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology. He has authored five US patents.

Jacques Nadeau

Dremio

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com