Skip to main content

Apache Mesos as an SDK for Building Distributed Frameworks

Paco Nathan (derwen.ai)
Hadoop and Beyond
GA Ballroom J
Average rating: ****.
(4.00, 4 ratings)
Slides:   1-PDF 

Apache Mesos is an open source cluster manager that provides efficient resource isolation for distributed frameworks — similar to Google’s “Borg” and “Omega” projects for warehouse scale computing. It is based on isolation features in the modern kernel: “cgroups” in Linux, “zones” in Solaris.

Google’s “Omega” research paper shows that while 80% of the jobs on a given cluster may be batch (e.g., MapReduce), 55-60% of cluster resources go toward services. The batch jobs on a cluster are the easy part — services are much more complex to schedule efficiently. However by mixing workloads, the overall problem of scheduling resources can be greatly improved.

Given the use of Mesos as the kernel for a “data center OS”, two additional open source components Chronos (like Unix “cron”) and Marathon (like Unix “init.d”) serve as the building blocks for creating distributed, fault-tolerant, highly-available apps at scale.

This talk will examine case studies of Mesos uses in production at scale: ranging from Twitter (100% on prem) to Airbnb (100% cloud), plus MediaCrossing, Categorize, HubSpot, etc. How have these organizations leveraged Mesos to build better, more scalable and efficient distributed apps? Lessons from the Mesos developer community show that one can port an existing framework with a wrapper in approximately 100 line of code. Moreover, an important lesson from Spark is that based on “data center OS” building blocks one can rewrite a distributed system much like Hadoop to be 100x faster within a relatively small amount of source code.

These case studies illustrate the obvious benefits over prior approaches based on virtualization: scalability, elasticity, fault-tolerance, high availability, improved utilization rates, etc. Less obvious outcomes also include: reduced time for engineers to ramp-up new services at scale; reduced latency between batch and services, enabling new high-ROI use cases; and enabling dev/test apps to run on a production cluster without disrupting operations.

Photo of Paco Nathan

Paco Nathan

Evil Mad Scientist, derwen.ai

Paco Nathan is a “player/coach” who has led innovative Data teams building large-scale apps for the past decade. He is a recognized expert in Hadoop, R, cloud computing, distributed systems, machine learning, predictive analytics, and NLP. He received his BS Math Sciences and MS Computer Science degrees from Stanford, and has 25+ years experience in the tech industry ranging from Bell Labs to early-stage start-ups.

Paco is an evangelist for the Mesos and Cascading open source projects, and is also an O’Reilly author for “Enterprise Data Workflows with Cascading”.