Perfectionist Programmer

If you've spent any time in the depths of the Linux kernel, you've undoubtedly encountered a synchronization mechanism called RCU (Read-Copy-Update). It's a powerful tool that enables blazingly fast read performance under the right circumstances. While there are now many flavors of RCU (like Sleepable RCU, or PREEMPT_RT), it all started with a brilliant, core idea often called "Classic RCU".

This post is a deep dive into that classic design: its history, its soul, and why it remains a cornerstone of high-performance computing.

Why Do We Need RCU Anyway?

Before RCU, kernel developers primarily used locks—like spinlocks or mutexes—to protect shared data structures. Locks work on a simple principle: only one thread can "hold" the lock at a time, preventing concurrent access.

The problem? Locks are a bottleneck.

Readers block Writers: If a reader holds a lock, a writer must wait.
Writers block Readers: If a writer holds a lock, all readers must wait.
Scalability Suffers: On systems with hundreds of CPUs, this contention for a single lock can bring performance to a grinding halt.

RCU emerged to solve this for a specific but common scenario: data structures where reads are very frequent, and writes are quite rare. Think of routing tables, process directories, or certain kernel configuration parameters. For these cases, wouldn't it be ideal if readers could just fly through without any locking overhead?

That's exactly what RCU provides.

A Brief History: The Origin of an Idea

The core concepts of what would become RCU were first published by academics in the early 1990s. However, it was Paul McKenney, while working at Sequent Computer Systems (which later became part of IBM), who recognized its immense potential for scaling the Linux kernel on large NUMA systems.

He spearheaded its adoption into the Linux kernel in the early 2000s (it was merged around 2.5.43). The driving force was the need to make the kernel scale efficiently to dozens, and eventually hundreds, of CPUs. Classic RCU was the first implementation and served as the foundation for all the variants that followed.

The Soul of Classic RCU: A Three-Act Play

The magic of RCU isn't in complex algorithms; it's in a simple, elegant design built on three core pillars. Understanding this is key to grasping its power.

1. The Reader (The Graceful Flight)

Readers are incredibly lightweight. Their code path is designed to be as fast as possible.

No Locking: Readers do not acquire locks. They simply don't.
No Writing: Readers promise not to modify the shared data.
The Protocol: Readers delimit their access with rcu_read_lock() and rcu_read_unlock(). In classic RCU, these are often just compiler barriers (or preemption disable/enable in earlier kernels), meaning they have almost zero overhead.

What readers are really doing is announcing: "Hey, I might be looking at this old version of the data. Please don't get rid of it just yet."

2. The Writer (The Careful Sculptor)

Writers have a more complex job. They need to update data without disrupting the ongoing readers.

Create a Copy: A writer first creates a copy of the data element it wishes to update.
Modify the Copy: The writer then modifies this copy, leaving the original data intact and available for any current readers.
Publish with a Pointer Swap: Finally, in a single, atomic operation, the writer swaps the pointer to make the new copy visible to all future readers. This is the "Update."

This atomic swap is the moment of change. Readers that started before the swap will see the old data. Readers that start after will see the new data. This is safe because the old data remains consistent.

3. The Reclaimer (The Patient Janitor)

This is the most ingenious part. After the writer swaps the pointer, the old version of the data is now obsolete. But we can't just free() it immediately because there might still be pre-existing readers looking at it.

This is where the concept of a "Grace Period" comes in.

A grace period is a span of time long enough to guarantee that every CPU in the system has undergone at least one context switch (or been idle). Why? Because a context switch means any reader that was in a critical section (rcu_read_lock()) must have finished (rcu_read_unlock()).
After a full grace period has elapsed, the kernel knows that no readers could possibly still be holding a reference to the old data. It is now safe to reclaim (free) the memory.

This waiting is typically done via synchronize_rcu() or call_rcu() (which schedules the reclamation asynchronously).

Putting It All Together: A Simple Example

Imagine a linked list. A writer wants to update node B.

Reader CPU 1 & 2: Traverse the list A->B->C. They see the old B.
Writer: Creates B', a copy of B with the new data. It carefully links B' to point to C, and then links A to point to B' in one atomic operation. The list is now A->B'->C, but the old B->C still exists.
Reader CPU 3: Starts after the swap. It traverses A->B'->C and sees the new data.
Reader CPU 1: Is still looking at the old B. This is perfectly safe!
Grace Period: The kernel waits. Eventually, CPUs 1 and 2 schedule other tasks, proving they are no longer in their RCU read-side critical sections.
Reclaim: After the grace period, the kernel knows no one is using old B anymore, so it can safely kfree(B).

The Beauty and The Trade-off

The soul of classic RCU is this trade-off:

Blazingly fast reads (no locks, no writes, minimal overhead).
Synchronized, safe writes.
The cost is deferred memory reclamation, which can cause higher memory pressure if updates are extremely frequent.

It's a perfect trade-off for its target use case. By understanding the history and the three-act play of readers, writers, and the reclaimer, you understand the elegant soul of one of the kernel's most important synchronization primitives. While newer RCU variants add features like preemption or sleepability, they all build upon this brilliant classic foundation.

Introduction to Linux RCU