Earth’s climate is changing at a rate unprecedented in human history. This change brings profound challenges for human society, including rising seas, more severe droughts and floods, and more intense hurricanes. To understand and respond to these challenges, the climate science community is deploying an ever-growing array of satellites, autonomous sensor systems, and computer simulations, resulting in petabytes of new data generated every year. This volume of data is quickly overwhelming our community’s capacity for storage, analysis, and visualization. Paradoxically, rather than accelerating climate science, big data is slowly grinding it to a halt. Our inability to deal with this explosive growth in climate datasets has become a major technical obstacle, holding back scientific progress just when we need it most.
Climate scientists employ a wide range of data science techniques, from simple descriptive statistics to sophisticated spatiotemporal analysis to neural network-based learning. Interactivity, the ability to quickly iterate and refine a particular analysis pipeline is highly valued. Like most scientific fields, data analysis in climate science has traditionally followed a download model; datasets stored on FTP servers are downloaded and analyzed on a user’s personal computer. This works fine for MB-scale datasets, but it becomes cumbersome for GB-scale datasets, expensive and difficult for TB-scale datasets, and impossible for PB-scale datasets.
Part of the difficulty is that existing big data tools (e.g., Spark and Hadoop) were designed around tabular data and are not very well suited to the multidimensional numerical arrays found in climate science. A central goal of the Pangeo project is to meet this challenge by developing data and software infrastructure to enable interactive-speed analysis of the largest climate datasets by allowing the integration of existing open source scientific Python technologies within a cloud environment. These include xarray, a Python package for working with labeled, multidimensional array data, as commonly found in climate science; Dask, a parallel computing library for Python that helps xarray represent huge datasets and distribute computations across clusters; JupyterHub and JupyterLab, computing environments that enable users to interact with cloud-based resources; and Kubernetes, a versatile, cloud-agnostic scheduler for running interactive and batch workloads.
Ryan Abernathey and Yuvi Panda offer an overview of these tools and describe how they work together. They then conduct a live demo using a Pangeo environment running on Google Cloud Platform to analyze global patterns of sea-level rise based on satellite observations of the ocean. Ryan and Yuvi conclude by outlining remaining challenges regarding how climate data is stored and accessed on the cloud.
Acknowledgements: The Pangeo project recently received support from the National Science Foundation and Google to develop this platform in both traditional high-performance computing environments and on Google Cloud Platform. This award supports scientists and developers from Lamont Doherty Earth Observatory of Columbia University, the National Center for Atmospheric Research, and Anaconda Inc. It has also benefited from volunteer contributions from institutions such as UC Berkeley, UK Met Office, US Geological Survey, and the HDF Group.
Ryan Abernathey is an assistant professor of Earth and environmental science at Columbia University and Lamont Doherty Earth Observatory. Ryan is a physical oceanographer who studies the large-scale ocean circulation and its relationship with Earth’s climate. High-resolution numerical modeling and satellite remote sensing are key tools in this research, which has led to an interest in high-performance computing and big data. Previously, he held a postdoc at Scripps Institution of Oceanography. In 2016, Ryan was awarded an Alfred P. Sloan Research Fellowship in ocean sciences and an NSF CAREER award for a project entitled “Evolution of Mesoscale Turbulence in a Changing Climate” and received a NASA New Investigator Award in 2013. He is an active participant in and advocate for open source software, open data, and reproducible science. He holds a PhD from MIT and a BA from Middlebury College.
Yuvi Panda is infrastructure lead for the Data Science Education Program at UC Berkeley, where he works on scaling JupyterHub for use by thousands of students. A programmer and DevOps engineer, he wants to make it easy for people who don’t traditionally consider themselves programmers to do things with code and builds tools (Quarry, PAWS, etc.) to sidestep the list of historical accidents that constitute the “command-line tax” that people have to pay before doing productive things with computing. He’s a core member of the JupyterHub team and works on mybinder.org as well. Yuvi is also a Wikimedian, since you can check out of Wikimedia, but you can never leave.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org