Predicting when and where crime will occur poses a particularly interesting data challenge. Crime data is sparse, often has complex internal dependencies, and may be affected by many different types of features, including weather, city infrastructure, population demographics, public events, and government policy. Jorie Koster-Hale shares an approach using a combination of open source data, machine learning, time series modeling, and geostatistics to determine where crime will occur, what predicts it, and what we can do to prevent it in the future. The approach offers a way to deal with datasets that are not straightforwardly amenable to classic ML techniques, including those with spatial and temporal dependencies and sparse coding.
The project leverages a variety of public datasets, including police reports released by the US National Institute of Justice, the US census, Foursquare, newspapers, and the weather. Jorie explains how to merge, visualize, model, and deploy this type of multidimensional data, specifically by engineering spatial features using PostGIS and spatial mapping and employing targeted statistical techniques (e.g., Bayesian time series decomposition and spatial kriging), dimensionality reduction (e.g., PCA), and machine learning (e.g., XGBoost and artificial neural nets) to predict future crime. This combination is more effective at predicting future crime than any of these techniques alone (capturing up to 95% of crime hot spots). The model is deployed using a public REST API, allowing real-time modeling of crime hot spots in the next week.
Jorie details the challenges of deploying complex ensembled models and discusses techniques to support scalability. Jorie then concludes by exploring the features that are most predictive of future crime, including poverty, familial instability, and lack of commercial infrastructure, and how to use these types of models to understand where crime will occur next, what we can do to prevent it in the future, and the dangers and ethical considerations of building and deploying these types of models.
Jorie Koster-Hale is a lead data scientist at Dataiku with expertise in neuroscience, healthcare data, and machine learning. Previously, she was a postdoctoral fellow at Harvard. Jorie holds a PhD in cognitive neuroscience from the Massachusetts Institute of Technology. She currently resides in Paris, where she builds predictive models and eats pain au chocolat.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org