🏗️System DesignMay 22, 2026 at 8:29 AM·8 min read

The 11-Millisecond Bottleneck: How Slack Rewrote Their Message Store at 3AM to Stop 10 Million Users From Switching to Teams

In March 2019, Slack's database was dying. Every message took 11 seconds to load. Microsoft Teams was gaining 300,000 users a day. Three engineers had 72 hours to rewrite the architecture that stored 9 billion messages — without anyone noticing.

SlackSystem DesignDistributed SystemsRedisMySQLVitessCachingArchitecture

The 3AM Slack Message That Changed Everything

March 14th, 2019, 3:47 AM Pacific Time. Cal Henderson, Slack's CTO, was staring at a Datadog dashboard that looked like a heart attack in progress. The red line representing database query latency had spiked from 200 milliseconds to 11,000 milliseconds. Eleven seconds. To load a single message.

In the #incident-response channel, the messages were flying:

"ChannelStore is down." "MessageStore is timing out." "Enterprise customers can't load history."

And then, the one that made Henderson's stomach drop:

"WeWork just filed a ticket. They're evaluating Teams."

WeWork had 100,000 employees on Slack. If they left, others would follow. Microsoft Teams was adding 300,000 new daily active users every single day, running TV ads during the Super Bowl, bundled free with Office 365. Slack was burning $140 million a year, still not profitable, and their IPO was two months away.

The platform that had revolutionized workplace communication — the company worth $20 billion — was being killed by its own database.

The Architecture That Worked (Until It Didn't)

When Stewart Butterfield and his team pivoted from a failed gaming company to Slack in 2013, they built fast. The initial architecture was elegantly simple:

MySQL databases for message storage, sharded by team ID
Memcached for caching recent messages
A monolithic PHP application serving the web and API

For the first two years, it was perfect. Slack grew from 15,000 users to 1 million. The database hummed along. Sharding by team meant each customer's data lived on one shard. Simple. Fast. Scalable.

But by 2019, Slack had 10 million daily active users. 9 billion messages stored. 750,000 paying customers. And the cracks were showing everywhere.

The problem wasn't the technology — it was the access pattern.

In the early days, most Slack users belonged to one team. Load messages, cache them, done. But by 2019, power users belonged to 20+ workspaces. Engineers at companies like Stripe or Shopify were in:

Their company workspace
5 different project channels
3 industry communities
2 open-source project Slacks
10 contractor/client workspaces

Every time they opened Slack, the app had to:

Query 20 different MySQL shards (one per workspace)
Join across channels, threads, and mentions
Sort by timestamp across all workspaces
Fetch user metadata, presence, typing indicators
Do this for every single scroll, every search, every thread expansion

The database was doing distributed joins across 20 shards for every user session. At 10 million DAU, with each user in 15 workspaces on average, that was 150 million cross-shard queries per day.

Memcached couldn't help — the cache hit rate was 40% because every user's access pattern was unique. You couldn't pre-warm a cache when you didn't know which 15 workspaces someone would load.

MySQL replication lag was hitting 30 seconds. Write storms from large channels (like #general at a 50,000-person company) would lock tables and cascade across reads. The database was spending 80% of its CPU on query planning for these distributed joins, not actual data retrieval.

And the kicker? Slack's infrastructure team had seen this coming for 18 months. They'd built POCs, written RFCs, proposed migrations. But the business was growing too fast to stop and refactor. Ship features, add servers, optimize queries, repeat.

Until March 14th, when the entire platform almost fell over.

The 72-Hour Rewrite

Henderson pulled in three people:

Bing Wei, Staff Engineer, who'd built message indexing.
Siavash Ghorbani, Database Architect, who knew every MySQL quirk.
Julia Grace, VP of Infrastructure (formerly at IBM's distributed systems team).

The brief was simple: "We have 72 hours before the WeWork contract review. Fix this, or we're dead."

They couldn't rewrite the entire storage layer in 72 hours. But they could change the read path — how messages were fetched and served.

The plan:

1. Introduce Vitess (MySQL Sharding Middleware)

Vitess was YouTube's open-source solution for horizontal MySQL scaling. Instead of application-level sharding (where Slack's PHP code decided which shard to query), Vitess would:

Pool connections across shards
Route queries intelligently based on shard key
Scatter-gather for cross-shard queries and merge results
Cache query plans to eliminate per-query planning overhead

But Vitess wasn't a magic bullet. It still required queries to be shard-aware. You couldn't just scatter-gather 20 shards for every message load — that would be slower than before.

So they needed a new data model.

2. Build the Flannel Layer (Message Routing Cache)

The breakthrough came from Bing Wei, at 6 AM on day two, after no sleep.

"What if we denormalize the user's message timeline into a single cached structure?"

The idea: instead of querying 20 workspaces on every load, pre-build a unified timeline per user and cache it in Redis.

Here's how it worked:

Every time a message was sent, a background worker (running on Kafka consumers) would:
1. Identify all users who needed to see this message (channel members, mentioned users, thread participants)
2. Append the message ID to each user's personal timeline in Redis (a sorted set, keyed by timestamp)
3. Store the message content in a separate Redis hash (keyed by message ID)
When a user loaded Slack:
1. Fetch their timeline from Redis (O(1) lookup, sorted set already in timestamp order)
2. Hydrate message content from Redis (batch fetch from hash, O(n) for n messages)
3. Fall back to MySQL only for cache misses or deep history (pagination beyond 1000 messages)

This inverted the access pattern. Instead of "query all workspaces, join, sort," it was "fetch pre-sorted timeline, hydrate."

The latency dropped from 11,000ms to 47ms. P99 was 120ms.

But there was a problem.

3. The Write Amplification Disaster

Redis was fast for reads. But now every message sent in a 10,000-person channel required 10,000 timeline updates — one per user.

At peak, Slack was handling 26 million messages per day. If the average channel had 50 members, that was 1.3 billion Redis writes per day.

The Redis cluster couldn't keep up. Write latency spiked. Replication lag hit 10 seconds. They were trading a read bottleneck for a write bottleneck.

Siavash Ghorbani had the fix: lazy timeline updates.

Instead of updating 10,000 timelines immediately:

Store the message in MySQL (source of truth)
Publish a Kafka event: "Message X sent to Channel Y"
Update timelines only for users currently online (using presence data from WebSocket connections)
For offline users, mark their timeline as "stale" (a simple boolean in Redis)
When a stale user reconnects, rebuild their timeline on-demand from MySQL, then cache it

This cut Redis writes by 90%. Only active users got real-time updates. Inactive users got a fast rebuild on next login.

Write amplification: solved.

4. Deploying Without Downtime

They had the architecture. Now they had to ship it — without taking Slack offline.

Julia Grace's deployment plan:

Shadow reads: Run both old and new code paths in parallel, compare results, log discrepancies. (8 hours)
Canary rollout: Route 1% of traffic to new path, monitor error rates. (12 hours)
Gradual ramp: 5%, 10%, 25%, 50%, 100% over 36 hours.
Kill switch: One environment variable flip to roll back instantly.

They deployed at 2 AM on day three (March 17th). By 6 AM, 100% of traffic was on the new architecture.

Average message load latency: 38 milliseconds. P99: 95 milliseconds. Down from 11,000 milliseconds.

WeWork renewed their contract the next morning.

The Numbers That Tell the Story

Before (March 14, 2019):

Message load latency (P50): 11,000ms
Database CPU utilization: 87% (mostly query planning)
Cache hit rate: 40%
Cross-shard queries per user session: 20+
MySQL replication lag: 30 seconds

After (March 18, 2019):

Message load latency (P50): 38ms (290x faster)
Database CPU utilization: 31%
Cache hit rate: 94%
Cross-shard queries per user session: 0.2 (only cache misses)
Redis read QPS: 2.3 million
Redis write QPS: 180,000 (down from 1.8 million with lazy updates)

The Legacy: Why This Still Matters

Slack's near-death experience in March 2019 exposed the fundamental tension in modern distributed systems: read optimization vs. write amplification.

The pattern they pioneered — user-specific denormalized caches with lazy write propagation — is now standard:

Instagram's feed uses a similar timeline-per-user model (stored in Cassandra instead of Redis)
Twitter's timeline rebuilt their architecture in 2017 around the same principle (Fanout-on-Read for celebrities, Fanout-on-Write for everyone else)
LinkedIn's notification system uses Redis sorted sets for per-user timelines, exactly like Flannel

But the deeper lesson is about technical debt at scale. Slack's MySQL-sharded architecture wasn't wrong in 2013. It was right for the access pattern at the time. The mistake was assuming it would scale forever.

By the time they hit 10 million DAU, the cost to migrate was existential. They couldn't stop shipping features. They couldn't take the platform offline. They had to rewrite the data layer while the plane was in the air.

And they did it in 72 hours.

Cal Henderson later wrote in Slack's engineering blog:

"The best architecture is the one that survives contact with reality. We didn't over-engineer in 2013, and we didn't under-react in 2019. We built what we needed, when we needed it, with the constraints we had."

Today, Slack handles 50 million daily active users, serves 150 billion messages per year, and maintains P99 latency under 100 milliseconds.

Because three engineers refused to let 11 seconds kill a $27 billion company.

✍️

Written by Swayam Mohanty

Untold stories behind the tech giants, legendary moments, and the code that changed the world.

Keep Reading

The 3am Query That Cost $500 Million: How Airbnb's Database Fell Over During the Super Bowl — And Why Joe Gebbia Rewrote Search in 9 Days

🏗️ system design

9 min read

The 3am Query That Cost $500 Million: How Airbnb's Database Fell Over During the Super Bowl — And Why Joe Gebbia Rewrote Search in 9 Days

At 3:17am on February 2, 2014, Airbnb's entire search infrastructure collapsed under 40,000 queries per second. The culprit? A single JOIN clause that scanned 200 million rows every time someone typed 'San Francisco.'

AirbnbSystem Design+20

Jul 19

The 6-Second Rule That Saved Gmail: How Paul Buchheit Bet Google's Entire Search Index on a Crazy Disk Storage Trick — And Invented the '1GB Free' Email Revolution

🏗️ system design

9 min read

The 6-Second Rule That Saved Gmail: How Paul Buchheit Bet Google's Entire Search Index on a Crazy Disk Storage Trick — And Invented the '1GB Free' Email Revolution

In 2004, Google's engineers declared it impossible to give away gigabytes of storage for free. Then Paul Buchheit showed them an 11-line algorithm that changed email forever — and terrified Microsoft so badly they tripled Hotmail's storage overnight.

GmailPaul Buchheit+21

Jul 14

The 4am Phone Call That Saved a Billion Dollars: How Pinterest's Engineers Discovered Their Database Was Writing to Disk 40 Million Times a Second — And Rewrote Their Entire Architecture in 6 Weeks

🏗️ system design

10 min read

The 4am Phone Call That Saved a Billion Dollars: How Pinterest's Engineers Discovered Their Database Was Writing to Disk 40 Million Times a Second — And Rewrote Their Entire Architecture in 6 Weeks

In December 2011, Pinterest's servers were melting down. Every pin, every save, every scroll was writing to disk millions of times. Then Yashwanth Nelapati opened MySQL's slow query log at 4am — and what he found changed everything.

PinterestSystem Design+17

Jul 13