Communication and collaboration platform Slack has been fortunate to experience exponential user growth since its launch in 2014. Slack was originally designed for small teams, and as the user base grew, the original design decisions didn’t scale with the rapid growth. Some of those powerful initial design decisions later became liabilities as the company had to support hundreds of thousands of users communicating at once.
By 2016, Slack faced a problem: the load on its backend servers had increased by 1,000×. Once, a whole team was knocked offline and couldn’t reconnect because they uploaded thousands of emojis, a use case that wasn’t expected. The spike of events caused a wave of client reconnections that cascaded into database failures.
Bing Wei explains how rearchitecting the system with lazy loading, a publish/subscribe model, and an edge cache service overcame the problem with zero downtime, improved latency, and led to gains in reliability and availability. Bing also discusses Slack’s ongoing effort to build a generalized publish/subscribe framework and how the company handles data synchronization between clients and backend servers, a solution that should further improve latency and reduce backend cost. She also compares her time at Slack with her experience on the Twitter infrastructure team, detailing how the companies’ approaches differ and what Slack could learn from other web-scale companies.
Bing Wei is a software engineer on the infrastructure team at Slack, working on its edge cache service. Previously, she was at Twitter, where she contributed to the open source RPC library Finagle, worked on core services for tweets and timelines, and led the migration of tweet writes from a monolithic Rails application to JVM-based microservices.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org