As data is ingested into Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible—because “data freshness” can have a direct impact on business results.
In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics—for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain and reduces the speed at which new analytic applications can be introduced.
In this talk, attendee will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics—including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.
Tech lead at Cloudera for new products. Graduated in 2000 with a PhD in databases from UC Berkeley, followed by engineering jobs at a few database-related startup companies. Marcel joined Google in 2003, where he worked on several ads serving and storage infrastructure projects. His last engagement was as the tech lead for the distributed query engine component of Google’s F1 project.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.