We all do everything in our power to avoid outages and performance incidents. Regardless, they occur for reasons both within and outside of our control such as growth spikes, human error, hardware failure, software bugs and the list goes on. We cannot avoid these occurrences, so shouldn’t we work to improve our responses to them. We came to this conclusion at WebMD and invested significant effort into our investigation and analysis work.
The aftermath of an outage can be the perfect time to gain priority and focus on improvements and enhancements. The output of your analysis should be specific, detailed action items that you can turn into operation tickets, vendor support or enhancement request tickets, or software change requests for Dev. This collaborative process can leave everyone feeling positive and optimistic, rather than aimless and frustrated after an outage.
Analysis must go beyond what failed and how do we prevent it from failing next time. The severity of an outage is measured both in time and impact to the user. Timeline analysis should always be the starting point for outage and incident analysis.
Your outage involves time intervals that are meaningful and on which you can improve:
• Time to Detect – how long until your monitoring system or a user detected and reported an issue
• Time to Notify – how long from problem identification to notification of engineer
• Time to Respond – how long from an engineer being notified until are they are available to engage
• Time to Troubleshoot – how long does it take the engineer to diagnose the underlying cause
• Time to Repair – how long does it take the engineer to determine and implement the solution.
Teresa Dietrich’s passion for Operational Excellence, Technical Innovation and Web Scale Systems has been the focus of her 15 years in the Internet Industry. She is currently the VP of Technical Operations at WebMD. She joined the company 5 years ago to create a Network Operations team. Teresa acquired responsibility for Database, Web, and CMS Operations, as well as Corporate Desktop Services. She also created the Operations Service Center and a Site Reliability Engineering team to fill needs she identified in her organization. Previously, Teresa spent nearly 10 years in positions throughout AOL Technology.
Find out more @ www.teresadietrich.net
Director of SRE at WebMD
For information on exhibition and sponsorship opportunities at the conference, contact Gloria Lombardo at firstname.lastname@example.org
For media partnerships, contact mediapartners@ oreilly.com
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Velocity contacts