Engineering the Future of Software
April 10–11, 2016: Training
April 11–13, 2016: Conference
New York, NY

How to have your causality and wall clocks too

Jonathan Moore (Comcast Cable)
3:50pm–4:40pm Tuesday, 04/12/2016
Distributed systems
Location: Beekman Parlor Level: Intermediate
Average rating: *****
(5.00, 2 ratings)

Prerequisite knowledge

Attendees should have a basic familiarity with NTP.

Description

Jon Moore describes evolving research into distributed monotonic clocks (DMCs), which can reflect causality like Lamport clocks while retaining a component that stays close to the wall-clock time that is meaningful to human operators, allowing application timestamps to come out in the right order even without perfect clock synchronization.

Jon very briefly covers the background of the problem and prior art, including Lamport clocks and the challenges NTP brings. The key issue is that if there is imperfect clock synchronization in a service-oriented architecture where application logs get generated and then collected for central analysis, the timestamps on log entries from different servers can come out in the wrong order, making production support difficult.

Jon explores the hybrid logical clock (HLC) proposed by Kulkarni, Demirbas, Madeppa, Avva, and Leone and explains why this solution almost provides the best of both worlds by combining system clock time with Lamport clock-like causality tracking. However, HLC has a glaring problem: a single server with an out-of-sync clock that is significantly ahead can drag all the logical clocks in the cluster into the “future” with no easy recovery mechanism.

Jon describes two brand new extensions to HLC. First, he demonstrates how to introduce an “epoch” component to allow for an operational “reset” of a system that has gotten stuck in the future. Second, he outlines a novel coordination protocol that allows systems to detect when their clocks are significantly ahead of the median system clock for the cluster and avoid dragging everyone else with them.

Because this new DMC scheme runs as a “piggyback” protocol on top of existing application message exchanges, the coordination protocol has several unusual constraints: nodes do not have awareness of cluster topology or size, can only send messages to neighbors, and cannot decide when messages are sent or even to which neighbors they will be sent. The protocol also has to scale to very large clusters and requires only a modest amount of per-node state and per-message space. Jon presents some preliminary, promising research results about DMC and discusses some of the open questions and future opportunities in this space.

Although Jon focuses mostly on the DMC coordination protocol, he’ll also include a brief introduction to the general class of population protocols that can be used to model ad hoc sensor networks and other IoT settings.

Photo of Jonathan Moore

Jonathan Moore

Comcast Cable

Jon Moore is the chief software architect at Comcast Cable, where he focuses on delivering a core set of scalable, performant, robust software components for the company’s varied software product development groups. Jon specializes in the “art of the possible,” finding ways to coordinate working solutions for complex problems and deliver them on time. He is equally comfortable leading and managing teams and personally writing production-ready code and has a passion for software engineering, continuously learning, and teaching colleagues new ways to deliver working, maintainable software with ever-higher quality and ever-shorter delivery times. His interests include distributed systems, fault tolerance, building healthy and engaging engineering cultures, and Texas Hold’em. Jon holds a PhD in computer and information science from the University of Pennsylvania. He resides in West Philadelphia, although he was neither born nor raised there and does not spend most of his days on playgrounds.