Modern data is often messy and does not fit into the old schema-on-write or even the newer schema-on-read paradigms. Some data effectively has no schema at all. For example, in a MongoDB collection or a Mixpanel log file, different records may have different fields, and identically named fields in different records may have different types. This can make doing any sort of analysis extremely difficult.
Apache Drill has been built with this sort of data in mind. Tomer Shiran explores how to analyze such data with Drill, covering Drill’s internal architecture and explaining how type introspection can be used to query JSON and JSON-structured data—such as data in MongoDB—without requiring a schema.
Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.