Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Data at risk: Backing up the world's research data

Max Ogden (Independent)
4:20pm5:00pm Thursday, March 16, 2017
Law, ethics, governance
Location: 210 A/E
Average rating: *****
(5.00, 1 rating)

What you'll learn

  • Explore Data Refuge, a nationwide volunteer effort led by librarians, scientists, and coders to discover and back up research data at risk of disappearing

Description

Recently, the first dedicated effort to make a mirror of the 2.5 million datasets/40 TB of data contained within Data.gov was completed, with the mirror being placed on the University of California infrastructure in partnership with the California Digital Library. In addition to Data.gov, which relies on federal departments to self-report their raw data, there are hundreds of federal FTP servers that contain data and thousands of federal websites that may contain links to data over HTTP. Very few of these FTP/HTTP resources have machine-readable metadata, and many require scraping or custom data export that crawlers and bots can’t do.

With studies showing that one in five HTTP links in papers published in scientific break within five years, the issue of disappearing science based on brittle web infrastructure is coming to a head with the new fear of data being deleted. Due to the fragmented nature of scientific data sharing, with scientific publishers and funders only recently moving toward mandatory open data, most science can’t be reproduced as the underlying data was never shared, and if it was, it may not still be online.

Code for Science, developers of the Dat Project, a distributed filesystem for the web that distributes datasets securely over a peer-to-peer network, is working to track millions of datasets across the web, create cryptographic fingerprints for the files, and build a dataset mirroring network that will allow trusted mirrors of large datasets to be placed across the world, preventing a single point of failure if the original data source goes offline.

Max Ogden offers an overview of Data Refuge, a nationwide volunteer effort led by librarians, scientists, and coders to discover and back up research data at risk of disappearing. Max is working with teams at Data.gov, NASA, the Internet Archive, and the California Digital Library to aggregate metadata for thousands of servers that hold scientific research data scattered across the web, especially that dealing with climate change.Max discusses his work to uncover hundreds of federal data servers containing petabytes of publicly funded research data and his plan to keep it online and useful to researchers in the future.

Photo of Max Ogden

Max Ogden

Independent

Max Ogden is the director of Code for Science, a nonprofit behind the Dat Project, a open source distributed filesystem for the web that synchronizes large datasets over a peer-to-peer network. Max is a computer programmer of civic media, open data, and open source as well as a former Code for America fellow, Node.js and JavaScript community organizer, and the author of hundreds of small open source modules. Max is passionate about teaching and enabling the sharing of information.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)