Machine learning is new to many US government agencies. They need to transparently document each step of a model, from data preparation to final model prediction. One US defense agency has used the Jupyter Notebook to document its steps and show results in the model building process for a series of recurrent neural network (RNNs) algorithms. The project was so successful that the team has recommended the Jupyter Notebook to be a key component in model documentation for all government scientists.
Catherine Ordun walks you through a notebook built to test the feasibility of developing multivariate time series models to predict cases of pertussis collected weekly over a 10-year time period. These models were built in Keras with a TensorFlow backend and built in Jupyter in order to transparently show the progress of training and testing for a US defense agency technical approach. The notebook chronicles the team’s data science workflow, from data acquisition and preprocessing to neural network building to evaluation and final model selection.
The project used EpiArchive, publicly available weekly time series data from Los Alamos National Laboratory. The team used the Python requests library to call an API response from the EpiArchive database and convert the disease data for a dozen different infectious diseases into a pandas DataFrame. They also used time series weekly NOAA temperature data and precipitation data as multivariate features and converted and normalized the data. For neural network building, the team built a basic ARIMA time series model to predict weekly pertussis cases achieving a mean absolute error of 6.633, in order to establish a baseline. They then built initial LSTM and GRU models, visualizing the training and validation loss in matplotlib and using the Keras callback function to visualize on TensorBoard (outside of the Jupyter Notebook). As the team experimented with adjusting different hyperparameters and layers for the LSTM and GRU (i.e., adding dropout and changing the activation functions, optimizers, and learning rate), they arrived at a set of final models in the notebook. After several more iterations of hyperparameter tuning, they selected a nonstateful LSTM as the final model of one input layer, one layer with 10 units, and two fully connected layers. This model applied a 20% dropout layer, activation was tanh (hyperbolic tangent), and run on 100 epochs with a batch size of 20. The final model achieved a mean absolute error of 0.0896.
Catherine Ordun is a Washington, DC-based senior data scientist at Booz Allen Hamilton. Catherine’s background is in biology, public health, and business. A self-taught Python programmer, she has led data science work across the US government, including intelligence and public health agencies and the DoD. She serves on the Women in Data Science Committee at Booz Allen, has presented to the National Academy of Medicine, and led her team to the top three in a Health and Human Services opioid codeathon. Catherine is a two-time recipient of the Women of Color (WoC) award and is currently a program reviewer for SciPy2018. She is passionate about machine learning, has recently started participating in Kaggle challenges, and has started an internal firm-wide machine intelligence meetup.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org