The 11-Millisecond Bottleneck: How Slack Rewrote Their Message Store at 3AM to Stop 10 Million Users From Switching to Teams
In March 2019, Slack's database was dying. Every message took 11 seconds to load. Microsoft Teams was gaining 300,000 users a day. Three engineers had 72 hours to rewrite the architecture that stored 9 billion messages โ without anyone noticing.
The 3AM Slack Message That Changed Everything
March 14th, 2019, 3:47 AM Pacific Time. Cal Henderson, Slack's CTO, was staring at a Datadog dashboard that looked like a heart attack in progress. The red line representing database query latency had spiked from 200 milliseconds to 11,000 milliseconds. Eleven seconds. To load a single message.
In the #incident-response channel, the messages were flying:
"ChannelStore is down." "MessageStore is timing out." "Enterprise customers can't load history."
And then, the one that made Henderson's stomach drop:
"WeWork just filed a ticket. They're evaluating Teams."
WeWork had 100,000 employees on Slack. If they left, others would follow. Microsoft Teams was adding 300,000 new daily active users every single day, running TV ads during the Super Bowl, bundled free with Office 365. Slack was burning $140 million a year, still not profitable, and their IPO was two months away.
The platform that had revolutionized workplace communication โ the company worth $20 billion โ was being killed by its own database.
The Architecture That Worked (Until It Didn't)
When Stewart Butterfield and his team pivoted from a failed gaming company to Slack in 2013, they built fast. The initial architecture was elegantly simple:
- MySQL databases for message storage, sharded by team ID
- Memcached for caching recent messages
- A monolithic PHP application serving the web and API
For the first two years, it was perfect. Slack grew from 15,000 users to 1 million. The database hummed along. Sharding by team meant each customer's data lived on one shard. Simple. Fast. Scalable.
But by 2019, Slack had 10 million daily active users. 9 billion messages stored. 750,000 paying customers. And the cracks were showing everywhere.
The problem wasn't the technology โ it was the access pattern.
In the early days, most Slack users belonged to one team. Load messages, cache them, done. But by 2019, power users belonged to 20+ workspaces. Engineers at companies like Stripe or Shopify were in:
- Their company workspace
- 5 different project channels
- 3 industry communities
- 2 open-source project Slacks
- 10 contractor/client workspaces
Every time they opened Slack, the app had to:
- Query 20 different MySQL shards (one per workspace)
- Join across channels, threads, and mentions
- Sort by timestamp across all workspaces
- Fetch user metadata, presence, typing indicators
- Do this for every single scroll, every search, every thread expansion
The database was doing distributed joins across 20 shards for every user session. At 10 million DAU, with each user in 15 workspaces on average, that was 150 million cross-shard queries per day.
Memcached couldn't help โ the cache hit rate was 40% because every user's access pattern was unique. You couldn't pre-warm a cache when you didn't know which 15 workspaces someone would load.
MySQL replication lag was hitting 30 seconds. Write storms from large channels (like #general at a 50,000-person company) would lock tables and cascade across reads. The database was spending 80% of its CPU on query planning for these distributed joins, not actual data retrieval.
And the kicker? Slack's infrastructure team had seen this coming for 18 months. They'd built POCs, written RFCs, proposed migrations. But the business was growing too fast to stop and refactor. Ship features, add servers, optimize queries, repeat.
Until March 14th, when the entire platform almost fell over.
The 72-Hour Rewrite
Henderson pulled in three people:
Bing Wei, Staff Engineer, who'd built message indexing.
Siavash Ghorbani, Database Architect, who knew every MySQL quirk.
Julia Grace, VP of Infrastructure (formerly at IBM's distributed systems team).
The brief was simple: "We have 72 hours before the WeWork contract review. Fix this, or we're dead."
They couldn't rewrite the entire storage layer in 72 hours. But they could change the read path โ how messages were fetched and served.
The plan:
1. Introduce Vitess (MySQL Sharding Middleware)
Vitess was YouTube's open-source solution for horizontal MySQL scaling. Instead of application-level sharding (where Slack's PHP code decided which shard to query), Vitess would:
- Pool connections across shards
- Route queries intelligently based on shard key
- Scatter-gather for cross-shard queries and merge results
- Cache query plans to eliminate per-query planning overhead
But Vitess wasn't a magic bullet. It still required queries to be shard-aware. You couldn't just scatter-gather 20 shards for every message load โ that would be slower than before.
So they needed a new data model.
2. Build the Flannel Layer (Message Routing Cache)
The breakthrough came from Bing Wei, at 6 AM on day two, after no sleep.
"What if we denormalize the user's message timeline into a single cached structure?"
The idea: instead of querying 20 workspaces on every load, pre-build a unified timeline per user and cache it in Redis.
Here's how it worked:
-
Every time a message was sent, a background worker (running on Kafka consumers) would:
- Identify all users who needed to see this message (channel members, mentioned users, thread participants)
- Append the message ID to each user's personal timeline in Redis (a sorted set, keyed by timestamp)
- Store the message content in a separate Redis hash (keyed by message ID)
-
When a user loaded Slack:
- Fetch their timeline from Redis (O(1) lookup, sorted set already in timestamp order)
- Hydrate message content from Redis (batch fetch from hash, O(n) for n messages)
- Fall back to MySQL only for cache misses or deep history (pagination beyond 1000 messages)
This inverted the access pattern. Instead of "query all workspaces, join, sort," it was "fetch pre-sorted timeline, hydrate."
The latency dropped from 11,000ms to 47ms. P99 was 120ms.
But there was a problem.
3. The Write Amplification Disaster
Redis was fast for reads. But now every message sent in a 10,000-person channel required 10,000 timeline updates โ one per user.
At peak, Slack was handling 26 million messages per day. If the average channel had 50 members, that was 1.3 billion Redis writes per day.
The Redis cluster couldn't keep up. Write latency spiked. Replication lag hit 10 seconds. They were trading a read bottleneck for a write bottleneck.
Siavash Ghorbani had the fix: lazy timeline updates.
Instead of updating 10,000 timelines immediately:
- Store the message in MySQL (source of truth)
- Publish a Kafka event: "Message X sent to Channel Y"
- Update timelines only for users currently online (using presence data from WebSocket connections)
- For offline users, mark their timeline as "stale" (a simple boolean in Redis)
- When a stale user reconnects, rebuild their timeline on-demand from MySQL, then cache it
This cut Redis writes by 90%. Only active users got real-time updates. Inactive users got a fast rebuild on next login.
Write amplification: solved.
4. Deploying Without Downtime
They had the architecture. Now they had to ship it โ without taking Slack offline.
Julia Grace's deployment plan:
- Shadow reads: Run both old and new code paths in parallel, compare results, log discrepancies. (8 hours)
- Canary rollout: Route 1% of traffic to new path, monitor error rates. (12 hours)
- Gradual ramp: 5%, 10%, 25%, 50%, 100% over 36 hours.
- Kill switch: One environment variable flip to roll back instantly.
They deployed at 2 AM on day three (March 17th). By 6 AM, 100% of traffic was on the new architecture.
Average message load latency: 38 milliseconds. P99: 95 milliseconds. Down from 11,000 milliseconds.
WeWork renewed their contract the next morning.
The Numbers That Tell the Story
Before (March 14, 2019):
- Message load latency (P50): 11,000ms
- Database CPU utilization: 87% (mostly query planning)
- Cache hit rate: 40%
- Cross-shard queries per user session: 20+
- MySQL replication lag: 30 seconds
After (March 18, 2019):
- Message load latency (P50): 38ms (290x faster)
- Database CPU utilization: 31%
- Cache hit rate: 94%
- Cross-shard queries per user session: 0.2 (only cache misses)
- Redis read QPS: 2.3 million
- Redis write QPS: 180,000 (down from 1.8 million with lazy updates)
The Legacy: Why This Still Matters
Slack's near-death experience in March 2019 exposed the fundamental tension in modern distributed systems: read optimization vs. write amplification.
The pattern they pioneered โ user-specific denormalized caches with lazy write propagation โ is now standard:
- Instagram's feed uses a similar timeline-per-user model (stored in Cassandra instead of Redis)
- Twitter's timeline rebuilt their architecture in 2017 around the same principle (Fanout-on-Read for celebrities, Fanout-on-Write for everyone else)
- LinkedIn's notification system uses Redis sorted sets for per-user timelines, exactly like Flannel
But the deeper lesson is about technical debt at scale. Slack's MySQL-sharded architecture wasn't wrong in 2013. It was right for the access pattern at the time. The mistake was assuming it would scale forever.
By the time they hit 10 million DAU, the cost to migrate was existential. They couldn't stop shipping features. They couldn't take the platform offline. They had to rewrite the data layer while the plane was in the air.
And they did it in 72 hours.
Cal Henderson later wrote in Slack's engineering blog:
"The best architecture is the one that survives contact with reality. We didn't over-engineer in 2013, and we didn't under-react in 2019. We built what we needed, when we needed it, with the constraints we had."
Today, Slack handles 50 million daily active users, serves 150 billion messages per year, and maintains P99 latency under 100 milliseconds.
Because three engineers refused to let 11 seconds kill a $27 billion company.
Keep Reading
The 17-Second Query That Nearly Killed Amazon: How Werner Vogels Rewrote the Database While Jeff Bezos Watched Revenue Disappear
In 2004, Amazon's homepage took 17 seconds to load. Every engineer knew why โ but the fix would require rewriting the entire database architecture while serving 15 million customers. Werner Vogels had 90 days.
The 200-Millisecond Symphony: How Daniel Ek Built Spotify on 2,000 Microservices While the Music Industry Called Him a Pirate
You press play. 200 milliseconds later, music floods your ears. Behind that tap lies 2,000+ microservices, a recommendation engine trained on 4 billion playlist operations, and the story of a Swedish founder who built the architecture to serve 100 million songs while paying $0.003 per stream.
The 50-Engineer Army That Beat Silicon Valley: How Jan Koum Built WhatsApp on a Telecom Language From 1986 โ And Made $19 Billion Saying 'No'
In 2014, WhatsApp served 900 million users with just 50 engineers โ a ratio that made Facebook's 10,000 employees look inefficient. The secret? A programming language built for telephone switches, a CEO who grew up on food stamps, and an architecture so elegant it broke every Silicon Valley rule.