eLife’s mission is to help scientists accelerate discovery by operating a platform for research communication that encourages and recognizes the most responsible behaviors in science, such as promoting collaboration over competition, giving constructive feedback, and sharing of resources. The technology and data science team at eLife follows these principles to share their software, building everything they create in the open and allowing others to use it through permissive open source licenses. In 2017, eLife started to invest more in experiments using data science and has since developed two projects, ScienceBeam and PeerScout, along with a few smaller projects looking at the way scholarly articles are cited.
Daniel Ecer and Paul Shannon detail eLife’s journey in using NLP, computer vision, and similarity algorithms to find more diverse peer reviewers, apply semantics to archive content, automate the submission process, and find insights into the sentiment of scholarly content. Daniel and Paul talk about the success and failures of the projects, the technical approach and models used, and the reasons why these projects are important to make science more open, collaborative, and constructive. They also detail how you can get involved with these open source projects to help solve some interesting and beneficial problems.
ScienceBeam uses computer vision and NLP with Apache Beam and TensorFlow to attempt to liberate the vast trove of science out there currently locked inside the PDF format. From preprints to peer-reviewed literature and historical research, millions of scientific manuscripts today can only be found in a print-era format that is effectively inaccessible to the web of interconnected online services and APIs that are increasingly becoming the digital scaffold of today’s research infrastructure. Services to extract key data from PDFs is also used in eLife’s journal submission systems. Such solutions let authors submit raw PDFs of their work and cut down on the considerable manual labor required to enter the relevant manuscript metadata at submission and turn the whole process from a form-filling exercise to a form-checking one.
PeerScout helps scientists in finding suitable peer reviewers, which is an important part of the editorial process. Just relying on editors to find potential reviewers can create unwanted bias and use a much smaller pool of potential reviewers. Editors will often use people they have used before, or people known to them in their geographic region, which limits the diversity of reviewers. Part of eLife’s mission is to make science more collaborative which means also promoting early career researchers and those from a variety of countries is important. PeerScout uses machine learning, including NLP, to find relevant reviewers while purposefully creating a positive bias to surface early career researchers, yet still leaves the final decision to the editors.
Citations are an important link in academic literature. They are often used as a measure of the impact an article has. But not all citations are equal. They vary not only in context (where in a manuscript the citation appears) but also in sentiment. However, an off-the-shelf sentiment model trained on Twitter does not perform with the same accuracy on the more subtle language used by academics. A number of projects have explored the context and sentiment of citations using NLP and general machine learning techniques.
In all three project areas the team have struggled to get reliable and consistent data for training or datasets that are large enough. They have forged partnerships with other publishers and data science teams to gather more data and insight, making the results open so others can benefit too. The benefits of ScienceBeam and citation analysis are quite obvious, but with PeerScout the team also faced a social problem in convincing editors that these recommendations are worthwhile.
Daniel Ecer is an ASI Fellow and a data scientist at eLife Sciences Publications. He has spent most of his career designing software to solve specific problems. Now he’s enjoying feeding the computer data and allowing the computer to make its own decisions. What could go possibly wrong?
Paul Shannon is head of technology at eLife, a unique collaboration between the funders and practitioners of research to improve the way it is selected, presented, and shared. He’s responsible for the technology strategy at eLife, ensuring the team is committed to openness in the products it produces to encourage broad change across the research communication landscape. Previously, he was vice president of technology at innovative digital music platform 7digital, where he grew the team and scaled the API platform to support the vastly changing music technology industry. Paul is also a regular speaker at international conferences.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org