Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Just-in-time optimizing a database

Ted Dunning (MapR)
4:20pm–5:00pm Wednesday, 03/30/2016
Data Innovations

Location: 230 C
Average rating: ****.
(4.12, 8 ratings)

Prerequisite knowledge

Attendees should have a basic knowledge of SQL, but all terms will be explained as we go.

Description

SQL is normally a very static language that assumes a fixed and well-known schema for the data and flat data structures (with noncomplex field values). The received wisdom is that these assumptions about static data are required for performance. However, big data depends on being flexible and dynamic in order to manage nonlinear technical debts when scaling. This contradiction can make it difficult to process important data streams using SQL.

Apache Drill squares this circle by rethinking many of the assumptions that have been built into query systems over the last few decades. By moving much of the optimization and type specificity out of the query-parsing and static-optimization processes and into the execution process itself, the Drill query engine is able to very efficiently deal with data that has deeper structure and unknown schema. The optimization of the structure of the parallel computation can often be done without much detailed schema information, and detailed optimization with type and structure information can often be done very late in the execution process based on empirically observed schema information. This even allows alternative optimizations as changes in the data structure are observed across a large query.

The only strong assumption that Drill makes a priori is that the data being processed conforms to the JSON data model. There is not even a guarantee that any record has similar characteristics to any other record. Drill can still use such information if it is available early, or it can defer exploitation of such data until it is available. This requires wholesale restructuring of the query parsing, optimization, and execution process.

Ted Dunning walks attendees through Apache Drill, explaining potential use cases for the technology in Drill and why these extended capabilities matter to all big data practitioners.

Topics include:

  • How Drill can process data as it exists in the wild without expensive ETL processes
  • Why SQL has a strong future in the big data world, especially on NoSQL databases
  • How Drill brings together the sophistication and familiarity of SQL with the flexibility of the Hadoop ecosystem
Photo of Ted Dunning

Ted Dunning

MapR

Ted Dunning is the chief technology officer at MapR. He’s also a board member for the Apache Software Foundation; a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects; and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.