One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.
Ruhollah Farchtchi explores best practices for dealing with these challenges and the append-only nature of HDFS and discusses how to make sure data is distributed appropriately. This is challenging to do with static data and even tougher with real-time, dynamic data. Ruhollah also explains how to deal with updates to existing data, whether due to restatements or a need to compact the data.
Ruhollah then offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for fast analytics on rapidly changing data, demonstrates how Kudu simplifies the architecture of such systems, and reviews a number of lessons learned from working with Kudu, including how to use dictionary attributes to optimize storage of denormalized dimensional data; how to achieve a high degree of parallelization of queries via data distribution and sizing the right number of tablets based on available cores; and how to balance insert rates versus read-heavy workloads.
Ruhollah Farchtchi is chief technologist and vice president of Zoomdata Labs. Ruhollah has over 15 years’ experience in enterprise data management architecture and systems integration. Prior to Zoomdata, he held management positions at BearingPoint, Booz-Allen, and Unisys. Ruhollah holds an MS in information technology from George Mason University.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.