The 72-Hour Migration That Saved Pinterest: How 8 Engineers Moved 16 Petabytes Without Losing a Single Pin — And Why HBase Nearly Killed Them
In March 2014, Pinterest's database was collapsing under 70 billion pins. Their fix? Migrate 16 petabytes of data in one weekend — while 50 million users kept pinning. This is the story of the most audacious database migration in tech history.
The 72-Hour Migration That Saved Pinterest: How 8 Engineers Moved 16 Petabytes Without Losing a Single Pin — And Why HBase Nearly Killed Them
It was 11:47 PM on March 13, 2014, and Marty Weiner was staring at a graph that looked like a heart attack in progress. Pinterest's Head of Engineering watched the line spike upward — 89% cluster saturation on their HBase deployment. In seventeen minutes, they'd hit 95%. At 100%, the entire site would collapse.
Fifty million users. Seventy billion pins. All of it sitting on top of a distributed database that was, quite literally, dying.
Weiner grabbed his phone and called an emergency war room. By midnight, eight engineers were on a video call, staring at the same terrifying graph. They had a choice: patch HBase one more time and pray it held, or do something absolutely insane.
They chose insane.
What followed was the most audacious database migration in Silicon Valley history — a 72-hour race to move 16 petabytes of user data from one database system to another, while keeping Pinterest running for millions of users who had no idea the entire foundation of their digital scrapbook was being ripped out and replaced beneath their feet.
The HBase Trap
Pinterest's love affair with HBase started innocently enough. In 2012, the startup was exploding — growth was doubling every few months, and they needed a distributed database that could scale horizontally. HBase, built on top of Hadoop, seemed perfect. It was open source, battle-tested at Facebook, and designed for exactly this kind of write-heavy workload.
The first year was magical. HBase handled billions of pins without breaking a sweat. The engineering team felt like geniuses.
Then the cracks started showing.
HBase's architecture relied on HDFS (Hadoop Distributed File System) underneath, which meant every write operation had to navigate through multiple layers: HBase RegionServers → HDFS DataNodes → actual disk storage. Each layer introduced latency. Each layer introduced failure modes.
By late 2013, Pinterest was doing 40,000 writes per second. HBase was starting to choke.
The first warning sign came during a routine deployment. A single RegionServer crashed, triggering a cascade failure that took down twelve others. Recovery took four hours. Users couldn't save pins. The incident report was seventeen pages long.
The second warning came when they tried to add capacity. Scaling HBase horizontally wasn't as simple as adding more servers — you had to rebalance regions, which meant moving gigabytes of data across the cluster while serving production traffic. One rebalancing operation locked up the entire database for forty-three minutes.
Marty Weiner knew they were in trouble. HBase had become their single point of failure, and they were running out of runway.
The MySQL Heresy
In February 2014, Weiner gathered his infrastructure team for what he called "the most controversial meeting we'll ever have." His proposal was borderline heretical for a company at Pinterest's scale:
Move everything to MySQL.
The room erupted. MySQL? The relational database that everyone said didn't scale? The technology that was supposed to be dead in the NoSQL era? Several engineers literally laughed.
Weiner pulled up a slide that silenced the room: "Instagram: 25 billion photos on 3 Postgres shards. Tumblr: 30 billion posts on MySQL. Friendfeed: Thousands of writes per second on MySQL."
The conventional wisdom was wrong. The problem wasn't SQL vs. NoSQL — it was operational complexity vs. engineering velocity.
HBase required a team of distributed systems experts just to keep the lights on. Every outage turned into a multi-day debugging session involving Java heap dumps and HDFS block replication analysis. They were spending more time babysitting their database than building features.
MySQL, on the other hand, was boring technology. Decades of operational knowledge. Battle-tested tooling. Predictable failure modes. And with the right sharding strategy, it could absolutely scale to Pinterest's workload.
Weiner laid out the plan: they'd build a custom sharding layer on top of MySQL, splitting the data across hundreds of independent master-slave replication pairs. Each shard would handle a subset of users. Reads would go to slaves. Writes would go to masters. Simple, predictable, debuggable.
The team was skeptical but intrigued. The real question wasn't whether it would work — it was whether they could migrate 16 petabytes of live data without bringing down the site.
The Architecture
The Pinterest team spent three weeks designing what they called "the zero-downtime migration." The architecture was elegant in its simplicity:
Layer 1: The Sharding Logic They built a Python service called "Pinstore" that sat between the application layer and the databases. Every user ID got hashed and mapped to a specific MySQL shard. Shard 1 might hold users 0-1 million. Shard 2 held 1-2 million. And so on, across 512 independent shards.
Each shard was a MySQL master with two read replicas. The master handled all writes. Replicas handled reads. If a replica died, traffic automatically shifted to another replica. If a master died, they'd promote a replica to master in under sixty seconds.
Layer 2: The Dual-Write System Here's where it got clever. During the migration, every write operation would go to BOTH HBase and MySQL simultaneously. The application would write to HBase (the old system), get confirmation, then asynchronously write the same data to MySQL (the new system). If the MySQL write failed, it would get queued and retried.
This meant they could build up the MySQL dataset in parallel while still serving all traffic from HBase. Once MySQL caught up, they'd flip a switch and make MySQL the source of truth.
Layer 3: The Verification Pipeline They built a separate verification system that continuously compared HBase and MySQL, checking that every pin, every board, every follower relationship was identical in both systems. If it found discrepancies, it would log them and trigger a reconciliation job.
The entire architecture was designed around one principle: trust nothing, verify everything.
The 72-Hour Clock
On March 14, 2014, at 6:00 PM Pacific Time, they pulled the trigger.
Phase 1 was enabling dual-writes. Within minutes, every pin, like, and follow was being written to both systems. The application logs showed green across the board. MySQL was handling the write load beautifully — average latency was 12 milliseconds vs. HBase's 47 milliseconds.
Phase 2 was the backfill. They needed to copy 70 billion existing pins from HBase into MySQL. They wrote a custom MapReduce job that would read from HBase and write to MySQL, processing 10 million records at a time.
The MapReduce job kicked off at 8:00 PM. The team watched nervously as it started churning through data. 1 billion pins. 2 billion. 5 billion.
At 11:23 PM, disaster struck.
The backfill job was generating so much read traffic that it was overwhelming HBase's RegionServers. Latency spiked to 3 seconds. The site was slowing to a crawl. Users were starting to complain on Twitter.
Weiner made a split-second call: throttle the backfill to 10% speed. The site recovered, but now they had a new problem — at this rate, the backfill would take six days, not three.
The team split into two groups. One group worked on optimizing the backfill job, finding ways to read from HBase more efficiently. The other group started manually sharding the workload, running parallel backfill jobs that each handled a subset of the data.
By 4:00 AM, they had twelve backfill jobs running simultaneously. The progress graph started climbing again. 10 billion pins. 20 billion. 30 billion.
The Verification Nightmare
On Saturday morning, March 15, the verification pipeline started flagging discrepancies. Tens of thousands of pins that existed in HBase but not in MySQL. Pins that had different follower counts. Boards that were missing entirely.
The team went into full debugging mode. What they discovered was horrifying: HBase's "eventual consistency" model meant that sometimes writes would succeed on one RegionServer but fail to replicate to others. Their dual-write system had been faithfully copying what it saw, but what it saw wasn't always complete.
They had to build a reconciliation system on the fly. Every discrepancy got queued for manual review. The engineers wrote scripts to query HBase directly, bypassing the caching layer, and copying the "true" data into MySQL.
By Saturday night, they'd reconciled 40 million discrepancies. But the backfill was still running, and new discrepancies kept appearing. It was like trying to fill a bucket with a hole in it.
Weiner made another call: stop all dual-writes for non-critical operations. Profile updates, email preference changes, analytics events — all temporarily disabled. Only pins, likes, and follows would be dual-written. This reduced the write load by 60% and gave the reconciliation system room to breathe.
The Cutover
By Sunday afternoon, March 16, the verification pipeline showed 99.97% consistency between HBase and MySQL. They had sixty-seven hours into the migration. Eight engineers had been awake for most of it, sustained by Red Bull and sheer adrenaline.
At 4:00 PM, Weiner called for a go/no-go vote. The backfill was complete. The verification was clean. MySQL was handling dual-write traffic without breaking a sweat.
The vote was unanimous: go.
They'd planned the cutover as a gradual rollout. First, 1% of read traffic would shift from HBase to MySQL. If that held stable for ten minutes, they'd bump to 5%. Then 10%. Then 25%. Then 50%. Then 100%.
At 4:17 PM, they flipped the first switch. 1% of Pinterest's read traffic started hitting MySQL.
The graphs were beautiful. Latency dropped from 47ms to 11ms. Error rate went from 0.3% to 0.01%. The MySQL clusters were barely breaking a sweat — CPU utilization was at 23%.
Every ten minutes, they bumped the percentage. 5%. 10%. 25%. Each bump showed the same pattern: better performance, lower latency, fewer errors.
At 6:32 PM, they hit 100%. Every single read and write operation was now going to MySQL. HBase was officially retired.
The war room erupted in cheers. Someone popped champagne. Marty Weiner allowed himself a smile for the first time in seventy-two hours.
Pinterest's 50 million users had no idea that the entire database underneath their pins had been replaced. Not a single pin was lost. Not a single image disappeared. The site actually got faster.
The Legacy
The Pinterest migration became legendary in infrastructure circles. It proved that you didn't need exotic distributed databases to scale to billions of records — you needed smart sharding, careful planning, and boring, reliable technology.
The performance improvements were staggering. Average API latency dropped 60%. Database CPU usage dropped from 80% to 25%. Most importantly, the engineering team could actually sleep at night. MySQL failures were predictable and fixable. No more midnight pages about HBase RegionServers running out of heap space.
Other companies started following Pinterest's playbook. Uber migrated from Postgres to MySQL using a similar dual-write strategy. GitHub moved from MySQL to MySQL (with better sharding). The industry quietly acknowledged what the Pinterest team had proven: sometimes the best solution to a distributed systems problem is to not have a distributed system.
Marty Weiner wrote a post-mortem that became required reading for every infrastructure engineer Pinterest hired. The opening line: "We spent two years trying to make HBase work. We should have spent two weeks sharding MySQL."
The HBase clusters were decommissioned on March 20, 2014. The team held a small funeral, complete with a cardboard tombstone that read: "Here lies HBase. Eventually consistent. Eventually retired."
Today, Pinterest serves over 400 million users on an architecture that's fundamentally the same as what they built in that 72-hour sprint. Hundreds of MySQL shards. Thousands of replicas. Billions of pins.
And it just works.
Keep Reading
The 200-Millisecond Symphony: How Daniel Ek Built Spotify on 2,000 Microservices While the Music Industry Called Him a Pirate
You press play. 200 milliseconds later, music floods your ears. Behind that tap lies 2,000+ microservices, a recommendation engine trained on 4 billion playlist operations, and the story of a Swedish founder who built the architecture to serve 100 million songs while paying $0.003 per stream.
The 50-Engineer Army That Beat Silicon Valley: How Jan Koum Built WhatsApp on a Telecom Language From 1986 — And Made $19 Billion Saying 'No'
In 2014, WhatsApp served 900 million users with just 50 engineers — a ratio that made Facebook's 10,000 employees look inefficient. The secret? A programming language built for telephone switches, a CEO who grew up on food stamps, and an architecture so elegant it broke every Silicon Valley rule.
The Algorithm That Lets Two People Type in the Same Cell — And Why Google's 200ms Magic Nearly Broke Physics
You're typing in cell B4. So is your coworker. Neither of you crashes, overwrites, or loses data. That shouldn't be possible — but it is, thanks to a mathematical breakthrough from Xerox PARC and a war between two competing algorithms that power every collaborative doc on the internet.