Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Rent, rain, and regulations: leveraging structure in big data to predict criminal activity

Jorie Koster-Hale (Dataiku)
17:2518:05 Wednesday, 23 May 2018

Who is this presentation for?

Data scientists, criminologists, law and policy makers

Prerequisite knowledge

The audience is intended to be those who are familiar with some machine learning and data science techniques, and are interested in adding a more targeted and sophisticated range of techniques in their toolbox to deal with specific kinds of data.

What you'll learn

How to: - Leverage a large and diverse set of open source data - Engineer and combine spatial and temporal features - Use targeted statistical techniques to handle complex data (e.g. Bayesian time series, spatial kriging) - Combine machine learning with classical statistical techniques - Mange geo-data (geo-hashing and PostGIS) - Use machine learning techniques on data with a strong temporal (time series) dimension


Predicting when and where crime will occur poses a particularly interesting data challenge — the data are sparse, have complex internal dependencies, and may be affected by many different types of features — weather, city infrastructure, population demographics, public events, and government policy. As the data being modeled with machine learning becomes more complex, our machine learning tools need to be able to handle this complexity. Here, with a model that predicts the time and place of future crimes, I offer an approach for dealing with datasets that are not straightforwardly amenable to classic ML techniques – including those with spatial and temporal dependencies and sparse coding.

The project leverages a variety of public data sets, including police reports released by the US National Institute of Justice, the US census, Foursquare, newspapers, and the weather. I focus on how to merge, visualize, model, and deploy this type of multi-dimensional data. Specifically, I engineer spatial features using PostGIS and spatial mapping, employ targeted statistical techniques (e.g. Bayesian time series decomposition; spatial kriging), dimensionality reduction (e.g. PCA), and machine learning (e.g. XGBoost, artificial neural nets) to predict future crime. I show that this combination of machine learning, time series modeling, and geostatistics is more effective at predicting future crime than any of these techniques alone (capturing up to 95% of crime hot spots).

I deploy this model using a public REST API, allowing real time modeling of a crime “hotspots” in the next week. I consider the challenges of deploying complex “ensembled” models, and discuss techniques to support scalability.

Finally, I discuss the features that are most predictive of future crime, including poverty, familial instability, and lack of commercial infrastructure. We’ll discuss how we can use these types of models to understand where crime will occur next, what we can do to prevent it in the future, and the dangers and ethical considerations of building and deploying these types of models.

Photo of Jorie Koster-Hale

Jorie Koster-Hale


Jorie Koster-Hale is a lead scientist at Dataiku, with expertise in neuroscience, healthcare data, and machine learning. Prior to joining Dataiku, she completed her Ph.D. in Cognitive Neuroscience at Massachusetts Institute of Technology and worked as a Postdoctoral Fellow at Harvard. Jorie currently resides in Paris, where she builds predictive models and eats pain au chocolat.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)