SQL is normally a very static language that assumes a fixed and well-known schema for the data and flat data structures (with noncomplex field values). The received wisdom is that these assumptions about static data are required for performance. However, big data depends on being flexible and dynamic in order to manage nonlinear technical debts when scaling. This contradiction can make it difficult to process important data streams using SQL.
Apache Drill squares this circle by rethinking many of the assumptions that have been built into query systems over the last few decades. By moving much of the optimization and type specificity out of the query-parsing and static-optimization processes and into the execution process itself, the Drill query engine is able to very efficiently deal with data that has deeper structure and unknown schema. The optimization of the structure of the parallel computation can often be done without much detailed schema information, and detailed optimization with type and structure information can often be done very late in the execution process based on empirically observed schema information. This even allows alternative optimizations as changes in the data structure are observed across a large query.
The only strong assumption that Drill makes a priori is that the data being processed conforms to the JSON data model. There is not even a guarantee that any record has similar characteristics to any other record. Drill can still use such information if it is available early, or it can defer exploitation of such data until it is available. This requires wholesale restructuring of the query parsing, optimization, and execution process.
Ted Dunning walks attendees through Apache Drill, explaining potential use cases for the technology in Drill and why these extended capabilities matter to all big data practitioners.
Ted Dunning is chief application architect at MapR. He’s also a board member for the Apache Software Foundation; a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects; and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.