A Look at The Network: Searching for Truth in Distributed Applications

Location: D139-140 Level: Intermediate
Average rating: ****.
(4.11, 9 ratings)

As patterns in webops infrastructure evolve into increasingly interdependent networks of distributed applications, the process of illuminating and responding to failures and abnormalities often treads an uncomfortable border between instrumented decision-making and stumbling in the dark. Even with thorough application-level instrumentation, several classes of problems and misconfigurations can evade diagnosis and monitoring until it’s too late.

The network itself is phenomenal source of truth in distributed environments. The common communications medium for all apps across a cluster, the network is the glue holding applications together. Though an app may appear healthy enough to fool monitoring – especially in the case of partial failures – changes in communications patterns immediately signal unusual behavior, highlighting failures like a sore thumb. While failures of network devices are not terribly common, failures in distributed applications which communicate over it are. As such, a peek at the network itself offers one of the most powerful and effective techniques for evaluating the health and behavior of distributed applications at every second.

What’s more – a rendered response is only as good as its transit to the client. By instrumenting applications at the network level, one can gain an uncommon level of insight into the performance of mobile apps as data passes over unreliable channels to fleets of embedded devices. The ability to measure and make assertions about traffic patterns to clients based on their network endpoints – whether served by specific wireline providers or lossy mobile networks – enables operations teams to quickly identify the root cause of poor user experience, segments of customers which are affected, and to tie these stats directly back to business metrics and value.

This session offers a deep-dive into how application-level problems manifest at the network level. Some of these cases range from basic network partitions and node outages to sophisticated application-level changes such as garbage collections on managed runtimes, classes of bugs which evade conventional monitoring but constitute partial failures, changes in network activity based on database partitioning, load balancing, and sharding, and other warning signs that crop up at layer three long before wreaking havoc at layer seven as customer-visible failures begin to occur. Combining application-level metrics with network analytics is a powerful cocktail for identifying hot spots quickly, and connecting the dots out to the client closes the loop.

The session also explores several approaches to network monitoring using a variety of open-source software and commercial applications to understand normal application behavior at the network level. With a focus on practical strategies and warning signs to look for, attendees can expect to leave with a solid understanding of the field, a survey of approaches that can be implemented quickly, and a path for further exploration and research.

Photo of Scott Andreas

Scott Andreas

Boundary, Inc

Engineer at Boundary, Inc. hacking distributed ESP/CEP systems in Scala, Java, and Erlang. Former engineer at Urban Airship, Inc. designing and implementing low-latency messaging systems for mobile devices.

Comments on this page are now closed.


Picture of Scott Andreas
Scott Andreas
08/03/2012 3:13am PDT

Hey Shantanu,

Yep sure thing. They’re up on the OSCON SlideShare account here: www.slideshare.net/OReillyO...

Shantanu Bhattacharyya
08/02/2012 10:03pm PDT

Thanks for the talk. I really enjoyed it. Do you have your slides available somewhere for us to download?


For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at (707) 827-7065 or scordesse@oreilly.com.

View a complete list of OSCON contacts