Skip to main content

Days in Green (DIG): Forecasting the Life of a Healthy Service @Twitter

Vibhav Garg (Twitter), Arun Kejariwal (Independent)
Operations
Mission City Ballroom B4
Average rating: **...
(2.71, 17 ratings)
Slides:   external link

Organic growth, dynamic traffic patterns, frequent deployments, etc. make capacity planning in the real world non-trivial. One of the biggest challenges is to be able to correctly characterize a healthy service and the length of time it is expected to healthy for. If we are able to determine when a service is expected to switch from a healthy to an unhealthy state, we can plan for additional capacity in advance. We can also determine the size of additional capacity based on our forecast. This allows us to build our infrastructure very efficiently since it reduces the chances of under allocation as well as over allocation.

Determining the health of a system throws up some interesting challenges -

  • Which metric – RPS, CPU, etc. – to monitor? Each service in a Service Oriented Architecture (SOA) has its own specific performance bottleneck(s).
  • In the presence of multiple data centers, being able to combine and forecast data for DR (Disaster Recovery) compliance?
  • Determining the trend in resource utilization in the wake of changing number of servers across multiple data centers and potentially different hardware platforms.
  • Accounting for noisy data. Issues like bad code deployments, spikes, data collection issues, etc. can add noise to the time series which makes forecasting difficult.

To this end, we developed the notion of a Green Zone and Red Zone. The former indicates that the health of a given service is sound from a capacity perspective; the latter suggests that the corresponding service potentially warrants capacity allocation to avoid performance impact. Further, we determine Days in Green (DIG), an estimate – determined statistically – of the number of days left in the Green Zone for a given service. Broadly speaking, DIG is computed as follows:

  1. For a given SLA, determine the threshold for the most important resource constraint
  1. Using the time series of the resource constraint in Step 1, forecast to determine the number of days to reach the threshold. Forecasting is carried using linear/quadratic regression or advanced statistical techniques such as ARIMA. We employ Model Selection to determine the “best” forecasting technique for the time series at hand.

We will walk the audience through how DIG is computed and used at Twitter using REAL data.

Photo of Vibhav Garg

Vibhav Garg

Twitter

I have 14 years of experience developing and working with distributed systems. In the last 4 years I have focused on working on performance problems on large scale distributed systems at orbitz.com, salesforce.com, and now @twitter. My current interests include solving capacity related problems using statistical techniques and performance of distributed systems.

Photo of Arun Kejariwal

Arun Kejariwal

Independent

@arun_kejariwal is currently a Staff Capacity Engineer at Twitter where he works on research and development of novel techniques to improve the accuracy of capacity models and demand forecasts. Prior to joining Twitter, @arun_kejariwal worked on research and development of practical and statistically rigorous methodologies to deliver high performance, availability and scalability in large scale distributed clusters. Some of the techniques developed have been published in peer-reviewed international conferences/journals.

@arun_kejariwal received his Bachelor’s degree in EE from IIT Delhi and doctorate in CS from UCI.