Raw data is rarely suitable for business users or data scientists. First, it is typically spread across many systems and may not be be accessible to those who want to consume the data. Second, it is often structured to suit the needs of the application developer, as opposed to the data consumer. Finally, it is usually not readily available in a single system that can magically respond to any analytical query at interactive speed.
In order to make data available, companies develop complex ETL pipelines in which data is copied many times between systems. For example, in order to achieve high performance, a subset of the overall dataset may be copied into a relational data warehouse, and then preaggregated into aggregation tables or external cubes or extracted into BI servers. This leads to tremendous cost and impacts the organization’s agility. It also eliminates the possibility of self-service, because it becomes impossible for a data consumer to know which system and table they should utilize to answer a specific question. As a result, in most enterprises, it can take months to develop a new BI dashboard or machine learning model.
Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies. When used in conjunction with a cost-based optimizer such as Apache Calcite, data reflections can help accelerate queries without the need for data engineers to manually create data copies or data consumers to interact with different materializations of data to achieve the desired performance. In addition, data reflections provide separation between the logical world, where analysts and data scientists need to curate and transform the model of the data, and and the physical world, where data must be physically optimized in order to enable execution engines to respond to queries in real-time.
In addition to providing an overview of data reflections and explaining the technological underpinnings, Tomer and Jacques offer a live demo of an open source implementation that shows how data science workloads, such as those in Python/pandas and R, and BI workloads can be automatically accelerated without having to explicitly move or copy data and while affording users the freedom to curate and transform the logical model of the data.
Tomer Shiran is the CEO and cofounder of Dremio. Previously, he was vice president of product at MapR, where he was responsible for product strategy, roadmap, and new feature development, and as a member of the executive team, helped grow the company from 5 to 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. Tomer is the founder of the open source Apache Drill project. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology. He has authored five US patents.
Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org