Recently, the first dedicated effort to make a mirror of the 2.5 million datasets/40 TB of data contained within Data.gov was completed, with the mirror being placed on the University of California infrastructure in partnership with the California Digital Library. In addition to Data.gov, which relies on federal departments to self-report their raw data, there are hundreds of federal FTP servers that contain data and thousands of federal websites that may contain links to data over HTTP. Very few of these FTP/HTTP resources have machine-readable metadata, and many require scraping or custom data export that crawlers and bots can’t do.
With studies showing that one in five HTTP links in papers published in scientific break within five years, the issue of disappearing science based on brittle web infrastructure is coming to a head with the new fear of data being deleted. Due to the fragmented nature of scientific data sharing, with scientific publishers and funders only recently moving toward mandatory open data, most science can’t be reproduced as the underlying data was never shared, and if it was, it may not still be online.
Code for Science, developers of the Dat Project, a distributed filesystem for the web that distributes datasets securely over a peer-to-peer network, is working to track millions of datasets across the web, create cryptographic fingerprints for the files, and build a dataset mirroring network that will allow trusted mirrors of large datasets to be placed across the world, preventing a single point of failure if the original data source goes offline.
Max Ogden offers an overview of Data Refuge, a nationwide volunteer effort led by librarians, scientists, and coders to discover and back up research data at risk of disappearing. Max is working with teams at Data.gov, NASA, the Internet Archive, and the California Digital Library to aggregate metadata for thousands of servers that hold scientific research data scattered across the web, especially that dealing with climate change.Max discusses his work to uncover hundreds of federal data servers containing petabytes of publicly funded research data and his plan to keep it online and useful to researchers in the future.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.