Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

rosettaHUB: A global hub for reproducible and collaborative data science

Karim Chine (RosettaHUB)
4:35pm5:15pm Thursday, September 28, 2017
Emerging Technologies, Machine Learning & Data Science
Location: 1A 08/10 Level: Non-technical
Secondary topics:  Cloud

Who is this presentation for?

  • Data scientists, data engineers, and educators

Prerequisite knowledge

  • A general knowledge of R, Python, or Scala

What you'll learn

  • Explore rosettaHUB, which aims to establish a global open data science metacloud centered on usability, reproducibility, auditability, and shareability


Karim Chine offers an overview of rosettaHUB—which aims to establish a global open data science metacloud centered on usability, reproducibility, auditability, and shareability—and shares the results of the rosettaHUB/AWS Educate initiative, which involved 30 higher education institutions and research labs and over 3,000 researchers, educators, and students.

RosettaHUB leverages public and private clouds and makes them easy to use for everyone. RosettaHUB’s federation platform allows any higher education institution or research laboratory to create a virtual organization within the hub. The institution’s members (researchers, educators, students) receive active AWS accounts consolidated under one paying account, which are supervised in terms of budget and cloud resources usage, protected with safeguarding microservices, and centrally monitored and managed by the institution’s administrator. The cloud resources are generally paid for using the coupons provided by Amazon as part of the AWS Educate program. The organization members’ active AWS accounts are put under the control of a collaboration portal, which dramatically simplifies everything related to the interaction with AWS and its collaborative use by communities of researchers, educators, and students. The portal allows similar capabilities for Google Compute Engine, Azure, and OpenStack-based and OpenNebula-based clouds.

RosettaHUB leverages Docker and allows users to work with containers seamlessly. Its simple web interface allows users to create containers, connect them to data storage, snapshot them, share snapshots with collaborators, and migrate them from one cloud to another. The rosettaHUB perspectives make it possible to use the containers to securely serve noVNC, RStudio, and Jupyter and to enable those tools for real-time collaboration. The rosettaHUB real-time collaborative containerized workbench is a universal IDE for data scientists, making it possible to interact in a stateful manner with hybrid kernels glued together in a single process and allowing those different environments to share their workspace and their variables in memory. The rosettaHUB kernels and objects model break the silos between data science environments and make it possible to use them simultaneously in a very effective and flexible manner. A simplified reactive programming framework makes it possible to create reactive data science microservices and interactive web applications based on multilanguage macros and visual widgets. A scientific web based spreadsheet makes it possible to interact with R, Python, and Scala capabilities from within cells. Spreadsheet cells can also contain code and code execution results, making it a flexible multilanguage notebook. Ubiquitous Docker containers, coupled with the RosettaHUB workbench’s checkpointing capability and the logging to embedded databases of all the interactions the users have with their environments, make everything created within RosettaHUB reproducible and auditable.

The rosettaHUB’s APIs (700+ functions) cover the full spectrum of programmatic interaction between users and clouds, containers, and R/Python/Scala kernels. Clients for the APIs are available as an R package, a Python module, a Java library, an Excel add-in, and a Word add-in. RosettaHUB provides a CloudFormation-like service that makes it easy to create and manage templates, collections of related Cloud resources, container images, R/Python/Scala scripts, macros, and visual widgets alongside with optional cloud credentials. Those templates are cloud agnostic, and they make it possible for anyone to easily create and distribute complex data science applications and services. The user with whom the template is shared can trigger the reconstruction and wiring of all the artifacts and dependencies with one click on the fly. RosettaHUB templates constitute a powerful sharing mechanism for rosettaHUB’s e-science and e-learning environments, and rosettaHUB’s marketplace transforms those templates into products that can be shared or sold.

Photo of Karim Chine

Karim Chine


Karim Chine is a London-based software architect and entrepreneur and the author and designer of RosettaHUB. Previously, he held positions within academic research laboratories and industrial R&D departments, including Imperial College London, EBI, IBM, and Schlumberger. Karim’s interests include large-scale distributed software design, cloud computing applications in research and education, open source software ecosystems, and open science. Since 2009, he has collaborated with the European Commission as an independent expert for the Research E-infrastructure Program and for the Future and Emerging Technologies Program. He has also served as an evaluator and a reviewer of many of EU’s flagship projects related to grids, desktop grids, scientific clouds, and science gateways. Karim holds degrees from Ecole Polytechnique and Telecom ParisTech.