Organizations are increasingly adopting modern data stores (Hadoop, NoSQL, etc.) and public cloud infrastructure for new applications. This introduces many challenges for data consumers such as business analysts and data scientists. For example, standard BI tools don’t work with NoSQL databases (because they don’t support SQL), and their performance on large Hadoop and cloud storage datasets is often prohibitively slow.
As a result, IT and data engineering often resort to ETLing the data from these systems into a relational data warehouse. This process is complex and expensive and introduces significant delays in data availability.
Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R. Tomer and Jacques outline the architecture of an Arrow-based solution to querying data from disparate data sources before highlighting best practices for accelerating queries such as in-memory caching and pre-aggregation. They conclude with a live demo exploring and analyzing data across HDFS, S3, MongoDB, and Elasticsearch utilizing popular client applications, including Tableau, Excel, Python, and R.
Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.
Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org