Jim Dowling describes how HopsWorks enables organizations to securely share a single Hadoop cluster using projects and a new metadata layer that enables protection domains while still allowing data sharing.
HopsWorks is a frontend to Hadoop that provides a new model for multitenancy in Hadoop, based around projects. A project is like a GitHub project: the owner of the project manages membership, and users can be one of two roles in the project—data scientists, who can just run programs, and data owners, who can also curate, import, and export data. Users can’t copy data between projects or run programs that process data from different projects, even if the user is a member of multiple projects. That is, we implement multitenancy with dynamic roles, where the user’s role is based on the currently active project. (Users can still share datasets between projects, however.)
HopsWorks also supports Apache Zeppelin with access control, free-text search for files in HDFS using Elasticsearch, and a metadata designer tool that enables users to curate files and directories in HDFS. HopsWorks has been enabled by migrating all metadata in HDFS and YARN into an open source, shared nothing, in-memory, distributed database called NDB. HopsWorks is open source and licensed as Apache v2, with database connectors licensed as GPL v2. From late January 2016, HopsWorks will be provided as software as a service for researchers and companies in Sweden by the Swedish ICT SICS Data Center.
Jim Dowling is an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology, a Senior Researcher at SICS RISE, and CEO of Logical Clocks AB. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is a distributed systems researcher and his research interests are in the area of large-scale distributed systems and machine learning. He is lead architect of Hops Hadoop (www.hops.io), the world’s most scalable Hadoop distribution. He teaches the first and largest course in Sweden on Deep Learning, ID2223, and he is a regular speaker at Big Data industry conferences.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.