Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Real-world NoSQL schema design

Ted Dunning (MapR)
5:25pm–6:05pm Wednesday, 09/30/2015
Production Ready Hadoop
Location: 3D 05/08 Level: Intermediate
Average rating: ***..
(3.70, 10 ratings)

There are lots of claims about the benefits of NoSQL databases, but few realistic demonstrations of the impact that such a database can have on anything more than toy-sized data. In this talk, I will deconstruct a real-world database schema into the corresponding NoSQL design.

The database that I will use is the Musicbrainz database, which exhibits many important idioms found in real databases, such as factoring relations into multiple tables to implement column families, linkage tables, and many-to-one relationships. The transformations that I will highlight show how almost all of the auxiliary tables in the original design are reduced to a format that is much simpler to understand – nested data structures. As a result, the number of tables drops by nearly 5x and the ease of understanding the design increases by a similar degree.

In spite of such radical structural changes, the resulting denormalized and nested data can still be queried with SQL by using Apache Drill, and the queries are often noticeably simpler than the queries used against the original data structures. The methods presented in this talk are practical and easy to apply, and can sometimes even be largely automated.

I will also show how a percolator pattern can be used to allow the resulting NoSQL database to be automatically maintained in multiple NoSQL technologies simultaneously, so that full text search, recommendations, and the HBase API can all be used to access the same data.

Photo of Ted Dunning

Ted Dunning

MapR

Ted Dunning is the chief technology officer at MapR. He’s also a board member for the Apache Software Foundation; a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects; and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.