Assessing a large new, unknown dataset for usability and data distribution is time consuming and difficult: Real life data is in general not nicely distributed; it contains all kinds of ‘artefacts’ which often are not errors but are meaningful.
We examined income tax data of 16 million inhabitants of the Netherlands for unexpected patterns in univariate and bivariate distributions. This dataset contains a large number of variables (400+), most of them continuous, and a few dozen categorical variables. Analyzing the full dataset gives rise to sifting through 400×400 different pairs of variables. We have found a way to quickly discern the noninformative pairs from the rest. Apparently boring distributions may contain interesting patterns, that can be found by sub selecting data along proper categorical variables. Finding interesting patterns in a large dataset (N>10M) and many variables (p > 400) is a daunting task, so we use machine learning and image processing techniques to distinguish similar distributions.
Furthermore we provide guidelines in using histograms and heatmaps for finding interesting patterns.
Alex Priem is a statistical consultant and data scientist at Statistics Netherlands: the Dutch government agency that is responsible for producing official demographic, economic, social and environmental statistics. His primary focus is data analysis and data visualization, and he is fluent in C, Python and various flavours of SQL. Although his work requires him to crunch and analyze ‘Big Data’, he doesn’t mind programming microcontrollers for fun in his spare time.
Edwin de Jonge is a statistical consultant and data scientist at Statistics Netherlands: the Dutch government agency that is responsible for producing official demographic, economic, social and environmental statistics. His expertise is statistical computing, data visualisation and exploratory techniques. He well versed in several programming languages including R and Python. Edwin is author of several R packages and book on using RStudio. Currently he is writing a book on data cleaning with applications in R.
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.