It only takes monitoring a few machines and applications for it to become very complicated to identify and fix issues in your environment. Throw in the type of dynamic infrastructure provided by cloud providers and container orchestration, and your static monitoring strategies will most likely not scale. Knowing which metrics to watch and how to troubleshoot based on those metrics will help you solve problems more quickly.
Ilan Rabinovitch outlines a framework for your metrics and explains how to use it to find solutions to the issues that come up. Ilan covers the three types of monitoring data, what to collect, what should trigger an alert (avoiding an alert storm and pager fatigue), and how to follow the resources to find the root causes of problems. Ilan’s talk is not tool specific, so you’ll leave with strategies and frameworks you can implement in environments today regardless of the platforms and tools you use.
Ilan Rabinovitch is vice president of product and community at Datadog, where he spends his days diving into container monitoring metrics, collaborating with Datadog’s open source community, and evangelizing observability best practices. Previously, Ilan spent a number of years leading infrastructure and reliability engineering teams at organizations such as Ooyala and Edmunds.com. He’s active in the open source and DevOps communities, where he is a co-organizer of events such as SCALE and Texas Linux Fest as well as a number of devopsdays events.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com