Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Getting Started with Interactive SQL-on-Hadoop

John Russell (Cloudera), Alan Choi (Cloudera)
9:00am–12:30pm Wednesday, 02/18/2015
Hadoop & Beyond
Location: 210 A/E
Average rating: *....
(1.80, 5 ratings)
Slides:   1-ZIP    2-PDF    3-PDF 

Materials or downloads needed in advance

- Experience with other SQL: any level. - QuickStart VM installation (via VMware or VirtualBox) - requires laptop with 64-bit OS & virtualization enabled

Description

THIS TUTORIAL HAS REQUIREMENTS AND INSTRUCTIONS LISTED BELOW

Impala is the massively parallel analytic database delivering interactive performance on Hadoop. In this half-day tutorial, we’ll walk you through hands-on exercises, taking you from zero to up and running with Impala.

You’ll get an overview of how the Impala engine brings the computation to the data to achieve high performance and near-linear scalability for distributed queries. You’ll see many different ways to ingest data so that it’s query-able through Impala. You’ll explore all the kinds of tables (internal and external, file formats, partitioned and unpartitioned)
and learn when you would use each one. You’ll try many kinds of queries in real time, and understand what factors influence their performance.

Audience:

  • Developers.
  • Data analysts.
  • SQL coders.

Subjects covered in the tutorial include:

Logical and Physical aspects of the system:

  • Logical:
    • Databases and tables
    • Views
    • Table properties
    • Built-in functions
  • Physical:
    • Distributed queries
    • Partitions
    • HDFS directories
    • Data files
    • Hosts, nodes, cores
    • Read buffers
    • In-memory hash tables
    • Sources and sinks

SQL features and use cases:

  • New features in latest Impala releases
  • Partitioning
  • Date and time
  • Data ingest

Tuning:

  • EXPLAIN plan
  • SUMMARY
  • PROFILE
  • COMPUTE STATS
  • File format
  • Partitioning
  • Figuring out what the query is “bound” by: I/O, CPU, network
  • Native code generation
  • Spill to disk
  • Local vs. remote reads
  • HDFS caching
  • TPC-style tables and queries
  • Synthesized data for particular use cases

Different ways to “drive” Impala:

  • impala-shell
  • JDBC
  • ODBC
  • Perl
  • Python
  • PHP

TUTORIAL REQUIREMENTS AND INSTRUCTIONS FOR ATTENDEES

* Experience with other SQL: any level
* QuickStart VM installation (via VMware or VirtualBox) – requires laptop with 64-bit OS & virtualization enabled

Attendees should bring a laptop, and download/install in advance:

  • CDH 5.3 QuickStart VM from here
  • VirtualBox or KVM or VMWare Player – whichever is most convenient to run the VM on their setup.
  • 7zip-compatible archive tool to unpack the downloaded QuickStart VM. Alternatives available here
  • Configure the QuickStart VM with as much memory and as many CPUs as practical for the capacity of your laptop. For example, an 8 GB Macbook Air might only be able to handle 1 CPU / 2 GB memory for the virtual machine. 4 or 8 GB, and 2 or 4 CPUs, are better if laptop capacity allows.

    There WILL NOT be bandwidth for everybody to download during the tutorial, so please prepare in advance.

    Photo of John Russell

    John Russell

    Cloudera

    John Russell is a software developer and technical writer, and he’s currently the documentation lead for Impala. He is the author of the forthcoming book from O’Reilly, “Getting Started with Impala”.

    Photo of Alan Choi

    Alan Choi

    Cloudera

    Alan Choi is a software engineer at Cloudera working on the Impala project. Previously, he worked at Greenplum on the Greenplum-Hadoop integration and worked extensively on PL/SQL and SQL at Oracle.

    Comments on this page are now closed.

    Comments

    John Russell
    02/24/2015 2:41pm PST

    Arthur, stats go into the metastore database – the same fields used for stats on Hive tables. The stats values are not dependent on the file format. (It might take more or less time to compute them, based on the efficiency of querying a particular file format.)

    With the latest Impala release, ‘incremental stats’ help to cut down the time to compute stats for big partitioned tables – the incremental stats avoid rescanning the whole table when a new partition is added.

    John Russell
    02/24/2015 2:37pm PST

    Slides, the Q&A material, and sample code and output will be downloadable from this session page when it is posted by the O’Reilly folks. I improved the readability of the sample code by color-coding the SQL statements, output, and commentary.

    Arthur Yeo
    02/18/2015 4:11am PST

    Where are the stats stored at?
    Are stats dependent on file format, like Parquet?
    Can Impala compte stats and use them if the file format is non-Parquet?

    Picture of Alan Choi
    Alan Choi
    02/18/2015 3:58am PST

    Stats can get stale. So, no, Impala cannot know the #rows without scanning.

    Arthur Yeo
    02/18/2015 3:34am PST

    If stats are computed, does Impala know the current # of rows without doing a table scan?

    Picture of Alan Choi
    Alan Choi
    02/18/2015 2:10am PST

    The other advantage of Parquet is that it compresses better than the other format. So, you’ll still get reduced IO even though you’re scanning all the columns.

    But IO is only part of the equation. The other part of the equation is CPU. Assembling a row from columar (i.e. Parquet) requires more cpu. There’re a few cases where scanning all the columns is slower with parquet.

    For intermediate table, it becomes even more interesting. Writing parquet file is a bit more cpu intensive than other format. While it writes fewer bytes (because of better compression), you’re using more cpu cycles. So, it’s a trade off and it depends largely on your transformation.

    Arthur Yeo
    02/18/2015 1:56am PST

    Is Parquet recommend for cases where the entire table will need to be scanned for most of the columns.
    I’m thinking about intermediate tables that are created during a transformation (data massaging) step?

    If Parquet is not useful in this situation, which other format is recommended?

    Picture of Alan Choi
    Alan Choi
    02/18/2015 1:25am PST

    Right now, the only way is to use the “shell action” in Oozie to connect to Impala through either Impala-shell or your own JDBC application. A full integration with Oozie will be down the road.

    Arthur Yeo
    02/18/2015 1:15am PST

    Are your slides available online?
    Where will they be located?

    Arthur Yeo
    02/18/2015 1:03am PST

    Quick question:
    Is Imapala integrated with Oozie?