This session uses the speaker’s experience in building a crime forecasting package to outline some tools and techniques useful in modeling space-time event data. While the case study focuses on modeling crime, the techniques and tools presented are applicable to a broad selection of domains. In particular, attendees will leave the session with:
While many data scientists work with data that includes geographic information, this data is often used in rather rudimentary ways or limited to vector data sets such as the point locations of stores or users. The session will introduce the strengths and weaknesses behind raster-based geographic analysis. Some challenges faced when modeling data at a fine geographic and temporal resolution will be discussed. For example, how can uncertainty around the time of occurrence for events be represented? Finally, the approach of modeling space-time events as stochastic point processes will be outlined.
The case study leverages the open source GeoTrellis framework to conduct geographic processing. GeoTrellis is currently an incubating project within the Eclipse Foundation’s LocationTech working group. The project provides fast and scalable geographic processing with an emphasis on raster-based analysis and routing through transportation networks. Already written in Scala, GeoTrellis is currently being extended to integrate with Apache Spark.
The modeling pipeline within the case study consists of several loosely coupled components. In addition to GeoTrellis, the project uses R for machine learning and the Amazon Simple Workflow service for pipeline orchestration. The presentation will outline the basic structure of the modeling process including details of the statistical techniques utilized within the process.
Several statistical techniques were examined throughout the development of the project with the final approach included a stacked model incorporating a gradient boosting machine (GBM) to model the presence of events and a generalized additive model (GAM) to transform these predictions into expected counts. The session will conclude by outlining some approaches to evaluating predictive accuracy for these types of data sets.
I’m the Senior Data Scientist at Azavea, a geospatial software firm located in Philadelphia. My primary focus is working with crime data to model patterns and forecast risk — the intersection of geography, data science, and social good.
Keywords: geographic data, raster processing, predictive analysis, spacetime event modeling, weather, demographics, machine learning, early warning systems, R, Scala, Python
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.