The 4AM Database Fire: How Uber's Engineer Discovered MySQL Was Deleting Rider Data — And Built Schemaless in 6 Months to Save 100 Million Trips
In 2014, Uber's databases were silently corrupting trip data across 300 cities. One engineer's 4AM discovery led to a complete rewrite of how Uber stores everything — and why they abandoned MySQL for a system that broke every database rule.
The 4AM Database Fire: How Uber's Engineer Discovered MySQL Was Deleting Rider Data — And Built Schemaless in 6 Months to Save 100 Million Trips
It was 4:17 AM on a Tuesday in September 2014. Uber engineer Evan Klitzke was staring at his laptop in the San Francisco office, watching something that shouldn't be possible: rider trip data was disappearing from the database. Not crashing. Not corrupting. Just... vanishing.
The production MySQL cluster serving 300 cities and 100 million trips per month was experiencing what the team would later call "silent data loss." Writes were succeeding. Queries were returning 200 OK. But somewhere between the application layer and disk, reality was being quietly rewritten.
By sunrise, Klitzke had discovered something worse: Uber's entire data architecture was fundamentally incompatible with the physics of how the company actually worked.
This is the story of how one engineer's all-nighter led to Uber abandoning MySQL entirely — and building a database system so radical that it deleted the concept of "schema" from the codebase.
The Database That Lied
Uber's architecture in 2014 looked like every other successful startup's: a fleet of MySQL masters and replicas, carefully sharded across geographic regions. San Francisco trips lived in sfo-mysql-01. New York trips lived in nyc-mysql-03. Standard stuff.
Except Uber wasn't a standard company.
A rider in San Francisco could request a trip to the airport, get picked up by a driver who started in Oakland, cross through Daly City, and end at SFO — crossing four different database shards in a single 22-minute trip. The trip record would start writing to sfo-mysql-01, then need to update in oakland-mysql-02, then again in daly-mysql-01, then finally in sfo-airport-mysql-04.
MySQL's replication model couldn't handle it. The database was designed for a world where data lived in one place and stayed there. Uber's data lived everywhere and moved constantly.
Klitzke discovered the problem while debugging a customer support ticket: a rider claimed they'd been charged for a trip that didn't show in their history. When he queried the database directly, the trip existed. When he queried through the API, it didn't. When he queried again five minutes later, it was gone entirely.
The culprit: replication lag and cross-shard write conflicts were creating a scenario where MySQL's binlog was silently discarding writes that violated foreign key constraints across shards.
In plain English: Uber's database was designed for a single restaurant. Uber had built a global food delivery service where the kitchen moved at 60 mph.
The Meeting Where Travis Kalanick Said "Fix It or We're Dead"
Klitzke brought the findings to a 9 AM architecture meeting with CTO Thuan Pham and CEO Travis Kalanick. The presentation was blunt:
- Current state: MySQL is losing approximately 0.02% of trip data due to cross-shard replication conflicts
- Scale projection: At current growth (3x year-over-year), data loss would hit 0.1% by 2015 — unacceptable for financial transactions
- Root cause: Relational databases assume static data with fixed relationships. Uber's data is inherently graph-like and geospatially distributed.
Kalanick's response, according to engineers in the room: "We're a transportation company that can't reliably track transportation. Fix it or we're dead."
The constraint: Uber couldn't stop the product to rebuild the database. The company was processing 1 million trips per day across 300 cities. A migration would need to happen while the car was driving.
Pham gave Klitzke and a team of eight engineers six months to build a replacement. The codename: Schemaless.
The Heresy: Deleting the Schema
Traditional databases work like filing cabinets with labeled folders. Before you can store anything, you define the structure: "Trip records will have these 47 fields in this exact format." Change the format? Migrate the entire database.
Uber's engineers made a radical bet: what if we just stored JSON blobs and let the application layer handle structure?
This was heresy in 2014. Every database textbook warned against "schema-on-read" architectures. You'd lose:
- Data integrity: No foreign keys means no guaranteed relationships
- Query performance: Can't index what you don't know exists
- Consistency: Different application versions could write incompatible data
But Uber's team realized something: they were already losing those guarantees with MySQL. The sharding and cross-region replication meant foreign keys were advisory at best. Queries were already slow because they had to fan out across dozens of shards. Consistency was already eventual because replication lag could hit 30+ seconds during peak hours.
Schemaless wouldn't make things worse. It would make the existing problems explicit and solvable.
The Architecture: Cells, Buffers, and Append-Only Reality
Here's how Schemaless actually works under the hood:
1. Storage Cells (Not Shards)
Instead of sharding by geography (which created the cross-shard nightmare), Schemaless shards by entity type and UUID. Every record gets a UUID. That UUID deterministically maps to a storage cell. A trip record with UUID 550e8400-e29b-41d4-a716-446655440000 always lives in the same cell, regardless of where the trip happens geographically.
Each cell is a cluster of MySQL instances (ironically, Schemaless is built on MySQL — they just use it like a key-value store, not a relational database). But the UUID-based routing means zero cross-shard writes.
2. Append-Only Logs (Not Updates)
Schemaless never updates records. Ever.
When a trip's status changes from "requested" to "accepted" to "completed," Schemaless doesn't UPDATE the row. It appends a new version of the record with an incremented version number.
The data model looks like this:
Primary Key: (UUID, version_number)
Payload: JSON blob
Timestamp: Write time
Deleted: Boolean flag
To read the current state of a trip, you query for the highest version number. To see history, you scan all versions. To "delete" something, you write a new version with deleted=true.
This sounds insane from a storage perspective — you're duplicating every update. But it solved Uber's core problem: append-only writes can't conflict. Two datacenters can independently write new versions of the same record. When replication catches up, you simply keep the version with the higher timestamp.
3. In-Memory Buffer Layer
Reading the highest version from disk for every query would be slow. Schemaless adds an in-memory buffer cache (originally Memcached, later migrated to a custom C++ solution) that holds the latest version of every recently-accessed record.
Read flow:
- Application requests trip
550e8400... - Buffer cache returns latest version if present (cache hit: ~800μs)
- If miss, query storage cell for highest version number (cache miss: ~15ms)
- Store result in buffer cache
- Return to application
Write flow:
- Application writes new version of trip
- Generate UUID (if new record) or increment version (if update)
- Write to local storage cell (async, ~5ms acknowledgment)
- Invalidate buffer cache entry
- Asynchronously replicate to other regions
4. Eventually Consistent Geography
Schemaless embraced eventual consistency explicitly. When you request a trip in San Francisco and the driver is in Oakland, here's what happens:
- Trip record writes to the SF region storage cell (because that's where your UUID hashes)
- Driver accepts trip in Oakland region
- Oakland writes a new version with
status=accepted - SF and Oakland eventually sync (typically <100ms, worst case 2-3 seconds)
- Both regions converge on the same version history
The key insight: Uber doesn't need strong consistency. It needs eventual consistency with monotonic version ordering. As long as version numbers always increase and timestamps are synchronized, the system will converge on the same reality.
The 6-Month Migration: Moving 25 Billion Records Without Stopping the Car
By March 2015, Schemaless was ready for production. But Uber had accumulated 25 billion trip records in MySQL. The migration plan was surgical:
Phase 1: Dual Writes (Weeks 1-4)
- Application layer writes to both MySQL and Schemaless
- Reads still come from MySQL
- Verify write consistency between systems
Phase 2: Shadow Reads (Weeks 5-8)
- Reads come from MySQL (primary)
- Shadow reads query Schemaless and compare results
- Alert on any discrepancies
- Fix bugs in Schemaless query layer
Phase 3: Gradual Cutover (Weeks 9-20)
- Route 1% of read traffic to Schemaless
- Monitor latency (target: p99 < 50ms), error rate (< 0.01%), data consistency
- Increase to 5%, 10%, 25%, 50%, 100% over 12 weeks
- Keep MySQL as fallback
Phase 4: Historical Backfill (Weeks 21-24)
- Migrate 25 billion historical records from MySQL to Schemaless
- Use Hadoop MapReduce jobs to read MySQL, transform to append-only format, write to Schemaless
- Rate-limited to avoid overwhelming storage cells (target: 10K records/second/cell)
The full migration took 22 weeks. By August 2015, 100% of Uber's trip data lived in Schemaless. MySQL was decommissioned three months later.
The Numbers: Why It Worked
Before Schemaless (MySQL sharded architecture):
- Write latency: p99 of 200ms (cross-shard writes)
- Read latency: p99 of 150ms (fan-out queries)
- Data loss incidents: 2-3 per week
- Schema changes: Required 6-8 week migration windows
After Schemaless:
- Write latency: p99 of 12ms (single-cell writes)
- Read latency: p99 of 8ms (buffer cache hits), p99 of 45ms (cache miss)
- Data loss incidents: Zero in first 18 months
- Schema changes: Zero — applications just write different JSON structures
By 2016, Schemaless was handling:
- 40 million trips per day
- 300+ billion stored versions (remember: append-only)
- 15 petabytes of storage across 1,200 storage cells
- Serving 5 million QPS at peak hours
The Legacy: The Database That Taught Silicon Valley to Embrace Chaos
Schemaless proved something controversial: sometimes the right architecture isn't the one with the strongest guarantees — it's the one that matches how your data actually behaves.
Uber's data is messy, distributed, and constantly moving. Trying to force it into a relational model with ACID guarantees was like trying to fit the ocean into a filing cabinet.
The Schemaless pattern influenced an entire generation of database design:
- Stripe built a similar append-only ledger system for payment records
- Airbnb adopted schema-on-read for listing data that varies by country and regulation
- DoorDash uses append-only writes for order tracking across restaurants, drivers, and customers
Evan Klitzke left Uber in 2016 to work on distributed systems at Google. But his 4AM discovery changed how tech companies think about data at scale.
In 2024, Schemaless still runs under the hood of every Uber trip. The system that was supposed to be a temporary fix became permanent infrastructure.
Because sometimes, the best database is the one that admits what databases can't do — and gets out of the way.
Keep Reading
The 17-Second Query That Nearly Killed Amazon: How Werner Vogels Rewrote the Database While Jeff Bezos Watched Revenue Disappear
In 2004, Amazon's homepage took 17 seconds to load. Every engineer knew why — but the fix would require rewriting the entire database architecture while serving 15 million customers. Werner Vogels had 90 days.
The 200-Millisecond Symphony: How Daniel Ek Built Spotify on 2,000 Microservices While the Music Industry Called Him a Pirate
You press play. 200 milliseconds later, music floods your ears. Behind that tap lies 2,000+ microservices, a recommendation engine trained on 4 billion playlist operations, and the story of a Swedish founder who built the architecture to serve 100 million songs while paying $0.003 per stream.
The 50-Engineer Army That Beat Silicon Valley: How Jan Koum Built WhatsApp on a Telecom Language From 1986 — And Made $19 Billion Saying 'No'
In 2014, WhatsApp served 900 million users with just 50 engineers — a ratio that made Facebook's 10,000 employees look inefficient. The secret? A programming language built for telephone switches, a CEO who grew up on food stamps, and an architecture so elegant it broke every Silicon Valley rule.