Predicting when and where crime will occur poses a particularly interesting data challenge — the data are sparse, have complex internal dependencies, and may be affected by many different types of features — weather, city infrastructure, population demographics, public events, and government policy. As the data being modeled with machine learning becomes more complex, our machine learning tools need to be able to handle this complexity. Here, with a model that predicts the time and place of future crimes, I offer an approach for dealing with datasets that are not straightforwardly amenable to classic ML techniques – including those with spatial and temporal dependencies and sparse coding.
The project leverages a variety of public data sets, including police reports released by the US National Institute of Justice, the US census, Foursquare, newspapers, and the weather. I focus on how to merge, visualize, model, and deploy this type of multi-dimensional data. Specifically, I engineer spatial features using PostGIS and spatial mapping, employ targeted statistical techniques (e.g. Bayesian time series decomposition; spatial kriging), dimensionality reduction (e.g. PCA), and machine learning (e.g. XGBoost, artificial neural nets) to predict future crime. I show that this combination of machine learning, time series modeling, and geostatistics is more effective at predicting future crime than any of these techniques alone (capturing up to 95% of crime hot spots).
I deploy this model using a public REST API, allowing real time modeling of a crime “hotspots” in the next week. I consider the challenges of deploying complex “ensembled” models, and discuss techniques to support scalability.
Finally, I discuss the features that are most predictive of future crime, including poverty, familial instability, and lack of commercial infrastructure. We’ll discuss how we can use these types of models to understand where crime will occur next, what we can do to prevent it in the future, and the dangers and ethical considerations of building and deploying these types of models.
Jorie Koster-Hale is a lead scientist at Dataiku, with expertise in neuroscience, healthcare data, and machine learning. Prior to joining Dataiku, she completed her Ph.D. in Cognitive Neuroscience at Massachusetts Institute of Technology and worked as a Postdoctoral Fellow at Harvard. Jorie currently resides in Paris, where she builds predictive models and eats pain au chocolat.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org