Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Rent, rain, and regulations: Leveraging structure in big data to predict criminal activity

Jorie Koster-Hale (Dataiku)
17:2518:05 Wednesday, 23 May 2018
Average rating: *****
(5.00, 3 ratings)

Who is this presentation for?

  • Data scientists, criminologists, and law and policy makers

Prerequisite knowledge

  • Familiarity with machine learning and data science techniques

What you'll learn

  • Explore an approach for predicting crime that leverages a large and diverse set of open source data, combines spatial and temporal features, uses targeted statistical techniques to handle complex data, combines machine learning with classical statistical techniques, manages geodata, and uses machine learning techniques on data with a strong temporal (time series) dimension


Predicting when and where crime will occur poses a particularly interesting data challenge. Crime data is sparse, often has complex internal dependencies, and may be affected by many different types of features, including weather, city infrastructure, population demographics, public events, and government policy. Jorie Koster-Hale shares an approach using a combination of open source data, machine learning, time series modeling, and geostatistics to determine where crime will occur, what predicts it, and what we can do to prevent it in the future. The approach offers a way to deal with datasets that are not straightforwardly amenable to classic ML techniques, including those with spatial and temporal dependencies and sparse coding.

The project leverages a variety of public datasets, including police reports released by the US National Institute of Justice, the US census, Foursquare, newspapers, and the weather. Jorie explains how to merge, visualize, model, and deploy this type of multidimensional data, specifically by engineering spatial features using PostGIS and spatial mapping and employing targeted statistical techniques (e.g., Bayesian time series decomposition and spatial kriging), dimensionality reduction (e.g., PCA), and machine learning (e.g., XGBoost and artificial neural nets) to predict future crime. This combination is more effective at predicting future crime than any of these techniques alone (capturing up to 95% of crime hot spots). The model is deployed using a public REST API, allowing real-time modeling of crime hot spots in the next week.

Jorie details the challenges of deploying complex ensembled models and discusses techniques to support scalability. Jorie then concludes by exploring the features that are most predictive of future crime, including poverty, familial instability, and lack of commercial infrastructure, and how to use these types of models to understand where crime will occur next, what we can do to prevent it in the future, and the dangers and ethical considerations of building and deploying these types of models.

Photo of Jorie Koster-Hale

Jorie Koster-Hale


Jorie Koster-Hale is a lead data scientist at Dataiku with expertise in neuroscience, healthcare data, and machine learning. Previously, she was a postdoctoral fellow at Harvard. Jorie holds a PhD in cognitive neuroscience from the Massachusetts Institute of Technology. She currently resides in Paris, where she builds predictive models and eats pain au chocolat.