Move Fast and Don't Break Everything: How Facebook Built the Architecture Behind 3 Billion Users
๐Ÿ—๏ธSystem DesignMarch 5, 2026 at 2:30 PMยท14 min read

Move Fast and Don't Break Everything: How Facebook Built the Architecture Behind 3 Billion Users

In 2004, Mark Zuckerberg's PHP app served 1,200 Harvard students. By 2024, it served 3 billion users with 99.999% uptime. This is the system design story of the most scaled social network ever built.

FacebookMetaSystem DesignDistributed Systems

A Dorm Room in Kirkland House

February 4, 2004. Mark Zuckerberg launched thefacebook.com from his Harvard dorm room. The stack was dead simple: a single Apache server running PHP, connected to a single MySQL database. The entire application โ€” profiles, the wall, friend connections, photo uploads โ€” lived in one PHP codebase that Zuckerberg had written in a few weeks.

Within 24 hours, 1,200 Harvard students had signed up. Within a month, half of Harvard's undergraduates were on it. The server was already struggling.

Nobody looking at that PHP script could have imagined that it would evolve into an infrastructure serving 3 billion monthly active users, processing 10+ million photos per second, handling 100+ billion API calls daily, and storing over an exabyte of data. But the seeds of Facebook's architecture โ€” the scrappy, pragmatic, "make it work now and fix it later" philosophy โ€” were planted in that dorm room.

The PHP Era: Scaling the Unscalable

Facebook's first scaling challenge came fast. By 2005, they had a million users, and PHP โ€” an interpreted, single-threaded language designed for rendering web pages โ€” was buckling under the load.

Most companies at this point would have rewritten everything in Java or C++. Facebook didn't. Instead, they made a decision that would define their engineering culture for the next decade: make PHP fast enough, rather than abandon it.

The reasoning was pragmatic. Facebook had hundreds of PHP files, dozens of engineers who knew PHP, and a deployment process optimized for PHP. A rewrite would take months, and in those months, MySpace (then 10x their size) could bury them. Speed of iteration mattered more than technical elegance.

So they built tools to make PHP viable at scale:

APC (Alternative PHP Cache): Opcode caching to avoid re-parsing PHP files on every request. This alone gave them a 3-5x throughput improvement.

Memcached at scale: Facebook became the world's largest Memcached user. By 2008, they had thousands of Memcached servers storing hundreds of terabytes of cached data. The insight: if you can serve 95% of reads from cache, your database only needs to handle 5% of the traffic. They eventually built McRouter, a Memcached proxy that handled consistent hashing, replication, and failover across global Memcached pools.

HipHop for PHP (2010): A transpiler that converted PHP source code into optimized C++ code, then compiled it into a native binary. This gave Facebook a 5-6x CPU reduction across their entire web tier. A single HipHop binary contained the entire Facebook PHP codebase โ€” millions of lines of PHP compiled into one massive C++ executable.

HHVM (2012): HipHop's successor. Instead of ahead-of-time compilation, HHVM used a JIT (Just-In-Time) compiler, similar to Java's HotSpot VM. PHP code was compiled to bytecode, then dynamically compiled to machine code at runtime based on actual execution patterns. HHVM eventually became so different from standard PHP that Facebook created their own language, Hack, which added static typing, generics, and async/await to PHP.

This PHP journey tells you everything about Facebook's engineering philosophy: don't rewrite โ€” evolve. Make incremental improvements that compound over time. Ship today, optimize tomorrow.

The Social Graph: The Hardest Data Structure in Tech

At its core, Facebook is a graph problem. 3 billion nodes (users) connected by hundreds of billions of edges (friendships, follows, group memberships, page likes). The News Feed algorithm needs to traverse this graph in real-time, asking: "Of the 5,000 posts created by this user's 500 friends in the last 24 hours, which 50 should appear in their feed?"

The data structure that stores this graph has evolved through three eras:

Era 1: MySQL (2004-2009). Friend relationships stored in a simple MySQL table: (user_id_1, user_id_2, timestamp). Finding mutual friends was a JOIN. Finding friends-of-friends was a JOIN on a JOIN. At 100 million users, these queries were killing the database.

Era 2: TAO (2009-present). Facebook built TAO (The Associations and Objects) โ€” a custom, globally distributed graph store optimized for the social graph. TAO's key design decisions:

  • Objects and Associations: Everything is either a node (user, post, photo) or an edge (friendship, like, comment). This maps perfectly to the social graph.
  • Read-optimized: TAO is heavily optimized for reads over writes (the social graph is read 1000x more than it's written). Writes go to a leader region and replicate asynchronously.
  • Cache-first: TAO has an integrated caching layer. Most reads never hit the underlying database (MySQL, initially). Cache hit rates exceed 99.9%.
  • Geographic distribution: TAO caches are deployed in every Facebook datacenter worldwide. A user in Tokyo reads from the Tokyo TAO cache, not from a server in Oregon. Write consistency is maintained through a leader-follower model per data shard.

TAO handles billions of reads per second globally. It's arguably the most scaled graph storage system ever built.

Era 3: TAO + ML (2018-present). Modern TAO integrates with Facebook's ML infrastructure. Graph embeddings โ€” vector representations of users and their connections โ€” are precomputed and stored alongside the graph data. This enables ML-powered features like "People You May Know" and content ranking to operate at graph-query speed.

News Feed: The Algorithm That Changed the Internet

The News Feed is Facebook's most important โ€” and most controversial โ€” product. When it launched in 2006, users hated it (there was literally a "Students Against Facebook News Feed" group with 700,000 members). Today, it's the core engagement driver.

The architecture behind News Feed is a multi-stage ranking pipeline:

1. Inventory Collection (~1,000 candidates) When you open Facebook, the system collects all potential stories โ€” posts, photos, videos, life events from your friends and pages you follow. For a user with 500 friends who are moderately active, this is typically 1,000-5,000 candidate stories.

2. Feature Extraction For each candidate story, the system computes hundreds of features: how close you are to the author (based on interaction history), the post type (photo vs. text vs. link), engagement signals (how many others have liked/commented), recency, and the predicted relevance based on your past behavior.

3. First-Pass Ranking (Lightweight Model) A lightweight ML model quickly scores all candidates, reducing the set from thousands to ~500. This model prioritizes recall โ€” it's better to keep a potentially good story than to mistakenly drop it.

4. Deep Ranking (Heavy Model) A deep neural network (initially a logistic regression, now a massive deep learning model) scores the remaining 500 candidates across multiple objectives: predicted probability of like, comment, share, click, and โ€” controversially โ€” time spent viewing. This model runs on Facebook's GPU inference infrastructure and is the most computationally expensive step.

5. Integrity Filtering Before final display, stories pass through integrity classifiers that detect misinformation, hate speech, and policy violations. These classifiers can demote or remove stories from the feed.

6. Final Composition The top-ranked stories are assembled into the feed, with additional logic for diversity (don't show 5 posts from the same friend in a row), ad insertion, and story format optimization.

This entire pipeline executes in under 500 milliseconds. For 2+ billion daily active users. Every time they open the app.

The Data Layer: An Exabyte and Counting

Facebook stores more data than almost any organization on Earth. The storage architecture is layered:

Hot storage (Memcached + TAO Cache): The most frequently accessed data โ€” recent posts, active user profiles, current session state. Terabytes across thousands of cache servers. Sub-millisecond reads.

Warm storage (MySQL + RocksDB): The primary data store for structured data. Facebook runs one of the world's largest MySQL deployments โ€” thousands of database instances, sharded by user ID. In recent years, they've migrated many workloads to MyRocks, a MySQL storage engine built on RocksDB (an LSM-tree based key-value store developed by Facebook). MyRocks reduces storage space by 50% compared to InnoDB through better compression.

Cold storage (Haystack + f4): Photos and videos. Facebook's Haystack system was built specifically for photo storage after they discovered that traditional file systems (with their metadata overhead) couldn't handle billions of small files efficiently. Haystack stores photos in large, append-only log files, with an in-memory index mapping photo IDs to file offsets. A single Haystack machine can serve thousands of photo reads per second.

For rarely accessed older photos, f4 provides a more space-efficient storage tier using Reed-Solomon erasure coding โ€” trading CPU for a 2.8x reduction in storage overhead compared to triple replication.

Analytics storage (Hive + Presto): Facebook's data warehouse runs on a modified version of Apache Hive, storing petabytes of event data in HDFS. For interactive analytics, they built Presto โ€” a distributed SQL query engine that can query data across Hive, MySQL, Cassandra, and custom data sources in seconds rather than the hours that Hive MapReduce jobs would take. Facebook open-sourced Presto, and it's now used by Netflix, Uber, Airbnb, and hundreds of other companies.

The Turning Point: The 2012 IPO and the Mobile Bet

When Facebook went public in May 2012, it had a dirty secret: it had no mobile strategy. The Facebook mobile app was a WebView wrapper around the mobile website โ€” slow, buggy, and losing users to native apps. Wall Street panicked. The stock dropped 50%.

Zuckerberg made a radical call: every engineer, regardless of their team, would learn mobile development. The company rewrote its iOS and Android apps from scratch as fully native applications. They built React Native (2015) to eventually bridge the gap between web and mobile development speed.

But the bigger architectural shift was on the backend. Mobile clients had different requirements than the web:

  • Bandwidth sensitivity: Mobile networks are slow. You can't send the same payload you'd send to a desktop browser.
  • Intermittent connectivity: Mobile apps need to work offline and sync gracefully.
  • Battery constraints: Chatty protocols that poll every few seconds drain batteries.

These constraints led to two infrastructure innovations:

GraphQL (2012, open-sourced 2015): Instead of REST APIs that returned fixed data shapes (often over-fetching or under-fetching), GraphQL let mobile clients specify exactly which fields they needed. A mobile News Feed query could request just titles and thumbnails, while the desktop version requested full content. This single change reduced mobile data transfer by 50%+.

MQTT for Push (2011): For real-time features like messaging, Facebook adopted MQTT โ€” a lightweight pub/sub protocol designed for IoT devices. MQTT's minimal overhead (2 bytes for a keepalive packet vs. HTTP's hundreds of bytes) made it ideal for maintaining persistent connections on mobile devices without destroying battery life.

The Legacy: What Facebook Teaches Every Architect

Facebook's architecture story is ultimately about pragmatic evolution under existential pressure. They didn't design for 3 billion users. They designed for 1,200 Harvard students and then fought โ€” feature by feature, system by system, crisis by crisis โ€” to keep the thing running as it grew 2,500,000x.

The lessons:

Don't rewrite โ€” evolve. Facebook's PHP-to-HipHop-to-HHVM-to-Hack journey saved years of rewrite time while delivering the same performance gains incrementally.

Cache everything, everywhere. Facebook's architecture is, at its core, an elaborate caching system. TAO, Memcached, CDNs โ€” the database is the last resort, not the first call.

Build the tools you need. When existing technology couldn't solve their problems, Facebook built their own: Cassandra, Presto, GraphQL, React, RocksDB, PyTorch. Many became industry standards.

Architecture follows product. The shift to mobile didn't just change the app โ€” it changed the entire backend (GraphQL, MQTT, data transfer optimization). Let product requirements drive technical decisions, not the other way around.

From a dorm room PHP script to the most complex social infrastructure ever built โ€” Facebook's architecture is proof that you don't need to get it right the first time. You just need to keep getting it right, over and over, faster than anyone else.

โœ๏ธ
Written by Swayam Mohanty
Untold stories behind the tech giants, legendary moments, and the code that changed the world.

Keep Reading