Yahoo!’s frontpage had a remarkable track record of stability. Over the years, more and more techniques have been implemented to ensure stability of the critical and highly profitable www.yahoo.com. A thorough incident management system ensures that the lessons learned from each previous incident are followed up on and continually add to the robustness of the application.
This session will cover in depth the top 5 techniques that contributed to its stability.
Description of techniques will include:
- Error proofing change: make the change (with forked production traffic) before really making the change
- Global loadbalancing and performance optimization
- Redundancy for everything, hardware, software, network, dns…heck, the entire internet
- Failure modes: everything can and will break, have a bandaid ready
- Monitoring/alerting: monitor every part of your application as well as everyone elses application
Lastly we will go into the causes for the last year’s outage and how each of these techniques failed to prevent it in this situation.
Error proofing change:
Description of software release process. Includes the multiple phases of a release including:
1. Continous Integration environment with automated build, unit test, deploy, and test for each checkin.
2. QA environment with automated tests and debug statements where logs and monitors are closely watched during testing
3. Staging environment where the rollout process is tested with forked copies of production traffic
4. Production deployment – all code is dark launched and reviewed before activating in a phased rollout
Global loadbalancing and performance optimization
- Route traffic to nearest of over a dozen colos worldwide
- Ability to serve any country from any location
- Use in failure scenarios, maintenance, code changes , testing, etc.
- Able to sustain a complete outage in any international country or region whether network, power or act of god
Redundancy for everything
- Description of how to make DNS, network, servers, software, colo, dependencies, etc. redundant
Failsafe measures/Degrade gracefully
- Static page created every 15 mins to serve traffic in failure or traffic spike scenario
- Failed Dependencies degrade gracefully
Monitoring/Alerting areas – senior engineers with data cards debugging 24/7 within 5 mins of any alert
Extensive monitoring includes:
- System level monitoring
- End-to-end functionality checks per host
- All dependencies: success/failure rates
- Content “freshness”
- Performance – Server side duration
- Traffic levels – week over week
And lastly techniques will not cover everything, kick ass operational engineers are essential as is a close interaction with the development team.
Jake Loomis is currently a VP of Service Engineering at Yahoo!, where he pioneered Yahoo!‘s efforts to be consistently reliable in a fast growing, rapidly changing environment. He is contributor to O’Reilly’s Web Operations book leveraging his experience of owning operational responsibility for many widely varied Yahoo! applications including Yahoo! Mail, Yahoo! Messenger, Flickr, Yahoo! Finance, www.yahoo.com and numerous others.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Yvonne Romaine at email@example.com
Download the Velocity Sponsor/Exhibitor Prospectus
View a complete list of Velocity contacts