Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA

Lessons and best practices learned from monitoring next-generation infrastructure (sponsored by SignalFx)

Arijit Mukherji (SignalFx)
2:10pm–2:50pm Wednesday, June 21, 2017
Sponsored
Location: LL20 C
Level: Beginner
Average rating: ****.
(4.20, 5 ratings)

Prerequisite knowledge

  • Familiarity with modern infrastructure trends

What you'll learn

  • Understand lessons and best practices learned from monitoring next-generation infrastructure

Description

The compute infrastructure landscape is evolving rapidly (clouds, containers, autoscaling, serverless, CI/CD, etc.). These trends pose a new set of monitoring challenges. Arijit Mukherji shares real-world examples demonstrating what these challenges are, some approaches that worked, and metrics system capabilities that helped SignalFx deal with the challenge.

  • Lesson 1: There are way more metrics than you expect; plan for far higher scale than seems necessary at first glance. Virtualized clouds are enabling the deployment of numerous short-lived, “right-sized” instances, and containerization is exploding the number of microinstances that need to be monitored, which are even more short lived. At SignalFx, every service instance container is its own AWS instance. Container IDs change with each push, for intermicroservice isolation and to ease management/deployment.

  • Lesson 2: History is hard. Tracking performance over time becomes a compute challenge; identify key derived metrics and preaggregate them. Select a metrics/metadata store that can store the amount of history you require. DevOps practices like continuous/frequent deployments rapidly increase the number of versions and container IDs over time. Historical trending (e.g., month-over-month comparisons) has to consider so many time series that it becomes slow and I/O and CPU intensive. For one large SignalFX customer, a third of all time series that they report in any day are brand new. That’s 33% churn per day.

  • Lesson 3: Today’s highly varied environments make it easy to lose sight of the forest for the trees. Model your data with care, and implement “join keys” (i.e., dimensions/tags that will let you correlate and analyze relevant metrics from different sources). Use scalable analytics to calculate KPIs, and focus on monitoring KPIs instead of the “little things” (e.g., p99 API latency across all servers). A typical service uses many different software and hardware components, each of which has its own way of reporting metrics and being monitored. Looking at each component in isolation is a losing proposition. You must have a holistic view. SignalFx has CloudWatch metrics, host metrics, container metrics, application metrics, third-party service metrics, to name just a few. They all use different models.

  • Lesson 4: Timeliness of your metrics system is critical in order to maintain SLA. Measure at high resolution. You must have an alerting system that works in event-driven manner rather than through periodic polling. You need “four nines” uptime (four minutes of outage per month). If your metrics system alerts you to an issue on the order of minutes, there is no chance of fixing it in time. The world is getting faster. Containers live for days or even minutes, serverless FaaSs live for seconds or less, and autoscaling changes things rapidly. We need to adapt. For example, one SignalFX customer was able to rollback a faulty software version in less than a minute.

This session is sponsored by SignalFx.

Photo of Arijit Mukherji

Arijit Mukherji

SignalFx

Arijit Mukherji was the first employee at SignalFx, where he has spent the last four years designing, developing and managing many aspects of the product. Arijit has focused on the monitoring space for the past 10 years, in a career that has spanned IP telephony, VoIP conferencing, and network virtualization. Previously, was an original developer on Facebook’s Metric Infrastructure (ODS) team and managed Facebook’s network tools development as well as data visualization for monitoring. He holds a BTech from the Indian Institute of Technology and an MS from UC Davis.