Patterns and Metapatterns in Income Tax Data

Alex Priem (Statistics Netherlands), Edwin De Jonge (Statistics Netherlands)
Government/Open Data
Location: 212
Average rating: ***..
(3.00, 4 ratings)
Slides:   1-PDF    2-PDF 

Assessing a large new, unknown dataset for usability and data distribution is time consuming and difficult: Real life data is in general not nicely distributed; it contains all kinds of ‘artefacts’ which often are not errors but are meaningful.

We examined income tax data of 16 million inhabitants of the Netherlands for unexpected patterns in univariate and bivariate distributions. This dataset contains a large number of variables (400+), most of them continuous, and a few dozen categorical variables. Analyzing the full dataset gives rise to sifting through 400×400 different pairs of variables. We have found a way to quickly discern the noninformative pairs from the rest. Apparently boring distributions may contain interesting patterns, that can be found by sub selecting data along proper categorical variables. Finding interesting patterns in a large dataset (N>10M) and many variables (p > 400) is a daunting task, so we use machine learning and image processing techniques to distinguish similar distributions.

Furthermore we provide guidelines in using histograms and heatmaps for finding interesting patterns.

Alex Priem

Statistics Netherlands

Alex Priem is a statistical consultant and data scientist at Statistics Netherlands: the Dutch government agency that is responsible for producing official demographic, economic, social and environmental statistics. His primary focus is data analysis and data visualization, and he is fluent in C, Python and various flavours of SQL. Although his work requires him to crunch and analyze ‘Big Data’, he doesn’t mind programming microcontrollers for fun in his spare time.

Photo of Edwin De Jonge

Edwin De Jonge

Statistics Netherlands

Edwin de Jonge is a statistical consultant and data scientist at Statistics Netherlands: the Dutch government agency that is responsible for producing official demographic, economic, social and environmental statistics. His expertise is statistical computing, data visualisation and exploratory techniques. He well versed in several programming languages including R and Python. Edwin is author of several R packages and book on using RStudio. Currently he is writing a book on data cleaning with applications in R.