Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.
Monitoring a node for anomaly is hard; monitoring hundreds of nodes working in collaboration in a distributed system to find issues is harder. Mark and Arup discuss how Lyft monitors and troubleshoots issues with its Presto and Hadoop clusters. A time series database of metrics for both Presto and Hadoop is Lyft’s first line of defense: graphs are easier to look at and find patterns within. Searchable application logs also play a pivotal role. In the absence of a root cause, Lyft falls back to its audit logs to find root causes—Lyft takes advantage of Hive hooks for Hive Server 2 and event listeners for Presto to capture and record all queries that is executed on its platforms.
Being able to reproduce issues makes it easier to troubleshoot and debug. Lyft relies on audit logs heavily for reproducing issues and debugging. Being in the cloud helps the company easily spin a cluster that is equivalent in spec to our production clusters, while workload replaying tools let it reproduce the workload and bugs for troubleshooting and debugging.
Mark and Arup conclude with some thoughts on what the future holds.
Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.
Arup Malakar is a software engineer at Lyft.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com