🏗️System DesignJune 11, 2026 at 8:29 AM·9 min read

The 4AM Database Fire: How Uber's Engineer Discovered MySQL Was Deleting Rider Data — And Built Schemaless in 6 Months to Save 100 Million Trips

In 2014, Uber's databases were silently corrupting trip data across 300 cities. One engineer's 4AM discovery led to a complete rewrite of how Uber stores everything — and why they abandoned MySQL for a system that broke every database rule.

UberSystem DesignSchemalessDistributed SystemsMySQLDatabase ArchitectureEvan KlitzkeAppend-OnlyEventual ConsistencyNoSQLInfrastructureData Engineering

The 4AM Database Fire: How Uber's Engineer Discovered MySQL Was Deleting Rider Data — And Built Schemaless in 6 Months to Save 100 Million Trips

It was 4:17 AM on a Tuesday in September 2014. Uber engineer Evan Klitzke was staring at his laptop in the San Francisco office, watching something that shouldn't be possible: rider trip data was disappearing from the database. Not crashing. Not corrupting. Just... vanishing.

The production MySQL cluster serving 300 cities and 100 million trips per month was experiencing what the team would later call "silent data loss." Writes were succeeding. Queries were returning 200 OK. But somewhere between the application layer and disk, reality was being quietly rewritten.

By sunrise, Klitzke had discovered something worse: Uber's entire data architecture was fundamentally incompatible with the physics of how the company actually worked.

This is the story of how one engineer's all-nighter led to Uber abandoning MySQL entirely — and building a database system so radical that it deleted the concept of "schema" from the codebase.

The Database That Lied

Uber's architecture in 2014 looked like every other successful startup's: a fleet of MySQL masters and replicas, carefully sharded across geographic regions. San Francisco trips lived in sfo-mysql-01. New York trips lived in nyc-mysql-03. Standard stuff.

Except Uber wasn't a standard company.

A rider in San Francisco could request a trip to the airport, get picked up by a driver who started in Oakland, cross through Daly City, and end at SFO — crossing four different database shards in a single 22-minute trip. The trip record would start writing to sfo-mysql-01, then need to update in oakland-mysql-02, then again in daly-mysql-01, then finally in sfo-airport-mysql-04.

MySQL's replication model couldn't handle it. The database was designed for a world where data lived in one place and stayed there. Uber's data lived everywhere and moved constantly.

Klitzke discovered the problem while debugging a customer support ticket: a rider claimed they'd been charged for a trip that didn't show in their history. When he queried the database directly, the trip existed. When he queried through the API, it didn't. When he queried again five minutes later, it was gone entirely.

The culprit: replication lag and cross-shard write conflicts were creating a scenario where MySQL's binlog was silently discarding writes that violated foreign key constraints across shards.

In plain English: Uber's database was designed for a single restaurant. Uber had built a global food delivery service where the kitchen moved at 60 mph.

The Meeting Where Travis Kalanick Said "Fix It or We're Dead"

Klitzke brought the findings to a 9 AM architecture meeting with CTO Thuan Pham and CEO Travis Kalanick. The presentation was blunt:

Current state: MySQL is losing approximately 0.02% of trip data due to cross-shard replication conflicts
Scale projection: At current growth (3x year-over-year), data loss would hit 0.1% by 2015 — unacceptable for financial transactions
Root cause: Relational databases assume static data with fixed relationships. Uber's data is inherently graph-like and geospatially distributed.

Kalanick's response, according to engineers in the room: "We're a transportation company that can't reliably track transportation. Fix it or we're dead."

The constraint: Uber couldn't stop the product to rebuild the database. The company was processing 1 million trips per day across 300 cities. A migration would need to happen while the car was driving.

Pham gave Klitzke and a team of eight engineers six months to build a replacement. The codename: Schemaless.

The Heresy: Deleting the Schema

Traditional databases work like filing cabinets with labeled folders. Before you can store anything, you define the structure: "Trip records will have these 47 fields in this exact format." Change the format? Migrate the entire database.

Uber's engineers made a radical bet: what if we just stored JSON blobs and let the application layer handle structure?

This was heresy in 2014. Every database textbook warned against "schema-on-read" architectures. You'd lose:

Data integrity: No foreign keys means no guaranteed relationships
Query performance: Can't index what you don't know exists
Consistency: Different application versions could write incompatible data

But Uber's team realized something: they were already losing those guarantees with MySQL. The sharding and cross-region replication meant foreign keys were advisory at best. Queries were already slow because they had to fan out across dozens of shards. Consistency was already eventual because replication lag could hit 30+ seconds during peak hours.

Schemaless wouldn't make things worse. It would make the existing problems explicit and solvable.

The Architecture: Cells, Buffers, and Append-Only Reality

Here's how Schemaless actually works under the hood:

1. Storage Cells (Not Shards)

Instead of sharding by geography (which created the cross-shard nightmare), Schemaless shards by entity type and UUID. Every record gets a UUID. That UUID deterministically maps to a storage cell. A trip record with UUID 550e8400-e29b-41d4-a716-446655440000 always lives in the same cell, regardless of where the trip happens geographically.

Each cell is a cluster of MySQL instances (ironically, Schemaless is built on MySQL — they just use it like a key-value store, not a relational database). But the UUID-based routing means zero cross-shard writes.

2. Append-Only Logs (Not Updates)

Schemaless never updates records. Ever.

When a trip's status changes from "requested" to "accepted" to "completed," Schemaless doesn't UPDATE the row. It appends a new version of the record with an incremented version number.

The data model looks like this:

Primary Key: (UUID, version_number)
Payload: JSON blob
Timestamp: Write time
Deleted: Boolean flag

To read the current state of a trip, you query for the highest version number. To see history, you scan all versions. To "delete" something, you write a new version with deleted=true.

This sounds insane from a storage perspective — you're duplicating every update. But it solved Uber's core problem: append-only writes can't conflict. Two datacenters can independently write new versions of the same record. When replication catches up, you simply keep the version with the higher timestamp.

3. In-Memory Buffer Layer

Reading the highest version from disk for every query would be slow. Schemaless adds an in-memory buffer cache (originally Memcached, later migrated to a custom C++ solution) that holds the latest version of every recently-accessed record.

Read flow:

Application requests trip 550e8400...
Buffer cache returns latest version if present (cache hit: ~800μs)
If miss, query storage cell for highest version number (cache miss: ~15ms)
Store result in buffer cache
Return to application

Write flow:

Application writes new version of trip
Generate UUID (if new record) or increment version (if update)
Write to local storage cell (async, ~5ms acknowledgment)
Invalidate buffer cache entry
Asynchronously replicate to other regions

4. Eventually Consistent Geography

Schemaless embraced eventual consistency explicitly. When you request a trip in San Francisco and the driver is in Oakland, here's what happens:

Trip record writes to the SF region storage cell (because that's where your UUID hashes)
Driver accepts trip in Oakland region
Oakland writes a new version with status=accepted
SF and Oakland eventually sync (typically <100ms, worst case 2-3 seconds)
Both regions converge on the same version history

The key insight: Uber doesn't need strong consistency. It needs eventual consistency with monotonic version ordering. As long as version numbers always increase and timestamps are synchronized, the system will converge on the same reality.

The 6-Month Migration: Moving 25 Billion Records Without Stopping the Car

By March 2015, Schemaless was ready for production. But Uber had accumulated 25 billion trip records in MySQL. The migration plan was surgical:

Phase 1: Dual Writes (Weeks 1-4)

Application layer writes to both MySQL and Schemaless
Reads still come from MySQL
Verify write consistency between systems

Phase 2: Shadow Reads (Weeks 5-8)

Reads come from MySQL (primary)
Shadow reads query Schemaless and compare results
Alert on any discrepancies
Fix bugs in Schemaless query layer

Phase 3: Gradual Cutover (Weeks 9-20)

Route 1% of read traffic to Schemaless
Monitor latency (target: p99 < 50ms), error rate (< 0.01%), data consistency
Increase to 5%, 10%, 25%, 50%, 100% over 12 weeks
Keep MySQL as fallback

Phase 4: Historical Backfill (Weeks 21-24)

Migrate 25 billion historical records from MySQL to Schemaless
Use Hadoop MapReduce jobs to read MySQL, transform to append-only format, write to Schemaless
Rate-limited to avoid overwhelming storage cells (target: 10K records/second/cell)

The full migration took 22 weeks. By August 2015, 100% of Uber's trip data lived in Schemaless. MySQL was decommissioned three months later.

The Numbers: Why It Worked

Before Schemaless (MySQL sharded architecture):

Write latency: p99 of 200ms (cross-shard writes)
Read latency: p99 of 150ms (fan-out queries)
Data loss incidents: 2-3 per week
Schema changes: Required 6-8 week migration windows

After Schemaless:

Write latency: p99 of 12ms (single-cell writes)
Read latency: p99 of 8ms (buffer cache hits), p99 of 45ms (cache miss)
Data loss incidents: Zero in first 18 months
Schema changes: Zero — applications just write different JSON structures

By 2016, Schemaless was handling:

40 million trips per day
300+ billion stored versions (remember: append-only)
15 petabytes of storage across 1,200 storage cells
Serving 5 million QPS at peak hours

The Legacy: The Database That Taught Silicon Valley to Embrace Chaos

Schemaless proved something controversial: sometimes the right architecture isn't the one with the strongest guarantees — it's the one that matches how your data actually behaves.

Uber's data is messy, distributed, and constantly moving. Trying to force it into a relational model with ACID guarantees was like trying to fit the ocean into a filing cabinet.

The Schemaless pattern influenced an entire generation of database design:

Stripe built a similar append-only ledger system for payment records
Airbnb adopted schema-on-read for listing data that varies by country and regulation
DoorDash uses append-only writes for order tracking across restaurants, drivers, and customers

Evan Klitzke left Uber in 2016 to work on distributed systems at Google. But his 4AM discovery changed how tech companies think about data at scale.

In 2024, Schemaless still runs under the hood of every Uber trip. The system that was supposed to be a temporary fix became permanent infrastructure.

Because sometimes, the best database is the one that admits what databases can't do — and gets out of the way.

✍️

Written by Swayam Mohanty

Untold stories behind the tech giants, legendary moments, and the code that changed the world.

Keep Reading

The 3am Query That Cost $500 Million: How Airbnb's Database Fell Over During the Super Bowl — And Why Joe Gebbia Rewrote Search in 9 Days

🏗️ system design

9 min read

The 3am Query That Cost $500 Million: How Airbnb's Database Fell Over During the Super Bowl — And Why Joe Gebbia Rewrote Search in 9 Days

At 3:17am on February 2, 2014, Airbnb's entire search infrastructure collapsed under 40,000 queries per second. The culprit? A single JOIN clause that scanned 200 million rows every time someone typed 'San Francisco.'

AirbnbSystem Design+20

Jul 19

The 6-Second Rule That Saved Gmail: How Paul Buchheit Bet Google's Entire Search Index on a Crazy Disk Storage Trick — And Invented the '1GB Free' Email Revolution

🏗️ system design

9 min read

The 6-Second Rule That Saved Gmail: How Paul Buchheit Bet Google's Entire Search Index on a Crazy Disk Storage Trick — And Invented the '1GB Free' Email Revolution

In 2004, Google's engineers declared it impossible to give away gigabytes of storage for free. Then Paul Buchheit showed them an 11-line algorithm that changed email forever — and terrified Microsoft so badly they tripled Hotmail's storage overnight.

GmailPaul Buchheit+21

Jul 14

The 4am Phone Call That Saved a Billion Dollars: How Pinterest's Engineers Discovered Their Database Was Writing to Disk 40 Million Times a Second — And Rewrote Their Entire Architecture in 6 Weeks

🏗️ system design

10 min read

The 4am Phone Call That Saved a Billion Dollars: How Pinterest's Engineers Discovered Their Database Was Writing to Disk 40 Million Times a Second — And Rewrote Their Entire Architecture in 6 Weeks

In December 2011, Pinterest's servers were melting down. Every pin, every save, every scroll was writing to disk millions of times. Then Yashwanth Nelapati opened MySQL's slow query log at 4am — and what he found changed everything.

PinterestSystem Design+17

Jul 13