Cache coherence is the system that keeps data consistent when multiple processor cores each have their own private cache. Every core in a modern CPU stores copies of frequently used data in a small, fast memory called a cache. When two cores hold copies of the same data and one core changes its copy, the other core’s copy is instantly outdated. Cache coherence protocols detect and resolve these conflicts automatically, ensuring that every core always reads the most recent version of any piece of data.
Why Caches Create a Consistency Problem
A single-core processor never has this issue. There’s one cache, one copy of any given data, and no conflict. But modern CPUs have 4, 8, 16, or more cores, each with its own private cache. When multiple cores work on shared data, something as simple as updating a counter can go wrong: Core 0 reads the counter as 5, Core 1 reads it as 5, Core 0 increments it to 6 in its cache, and now Core 1 still thinks the value is 5. Without coherence, Core 1 could make decisions based on stale data, or both cores could overwrite each other’s work.
The fundamental rule a coherence protocol enforces is straightforward: any read of shared data must return the most recently written value. That sounds simple, but implementing it across dozens of cores running billions of operations per second is one of the hardest problems in processor design.
How Cores Track Data: The MESI Protocol
The most widely used coherence scheme is MESI, which assigns every block of cached data (called a “cache line”) one of four states. These states tell the hardware what each core is allowed to do with its copy and what needs to happen when another core wants access.
- Modified: This core has changed the data, and no other core has a copy. The value in main memory is outdated. This core is responsible for eventually writing the updated data back.
- Exclusive: This core has the only cached copy, and it matches main memory. The core can read it freely, and if it decides to write, it can switch directly to Modified without notifying anyone else.
- Shared: Multiple cores hold copies of this data, and all copies match. Any core can read, but none can write without first invalidating everyone else’s copies.
- Invalid: This cache line is stale or empty. The core must fetch a fresh copy before using it.
The transitions between these states are what enforce coherence. When a core wants to write to data that other cores are also caching in the Shared state, it broadcasts an “invalidate” message. Every other core marks its copy as Invalid, and the writing core moves to Modified. If another core later needs that data, the modified core intercepts the request, supplies its updated copy, and both cores move to Shared. This back-and-forth happens entirely in hardware, invisible to software, in a matter of nanoseconds.
AMD processors use an extended version called MOESI, which adds a fifth state: Owned. A core in the Owned state holds a modified copy that other cores are also reading in Shared. The difference is that the Owned core can supply data directly to other cores on a read without writing it back to main memory first, reducing memory traffic.
Snooping vs. Directory-Based Systems
The hardware needs a mechanism to actually deliver those invalidation and data-sharing messages. Two main approaches exist, and the choice between them depends largely on how many cores the system has.
In a snooping system, all cores share a common communication bus. Every memory request is broadcast to all cores, and each core “snoops” on the bus to see if the request involves data it has cached. If it does, it responds accordingly: supplying data, invalidating its copy, or both. Snooping is simple and fast because buses are natural broadcast channels. It dominates in consumer processors and smaller systems where the core count stays manageable.
The problem is that buses don’t scale. As you add more cores, the bus becomes a bottleneck. The bandwidth is finite, and every core’s traffic competes for it. Transmission latency also rises with each additional node on the bus. For large systems with dozens or hundreds of cores, this approach breaks down.
Directory-based systems solve this by replacing the broadcast model with a centralized (or distributed) directory that tracks which cores have cached copies of each memory block. When a core needs to invalidate or fetch data, the directory tells it exactly which cores to contact, and the system sends targeted, point-to-point messages instead of broadcasting to everyone. This scales far better because traffic grows proportionally to actual sharing rather than to the total number of cores.
The tradeoff is overhead. The directory itself consumes memory. In the simplest “full-map” design, every memory block has a directory entry with one bit per core in the system. The total memory used by the directory scales with the square of the number of processors, because both the amount of memory and the size of each directory entry grow with core count. MIT research on large-scale multiprocessors found this to be a significant constraint, and much of the complexity in modern directory schemes goes toward compressing or distributing these entries to keep the overhead manageable.
Write-Back vs. Write-Through Caches
Cache coherence protocols also interact with how the cache handles writes in the first place. In a write-through cache, every write immediately updates both the cache and main memory. This keeps memory up to date at all times, but it’s slow because every single write operation has to travel all the way to main memory. You get no speed benefit on writes, only on reads.
In a write-back cache, writes only update the cache line and mark it as “dirty.” The data gets written to main memory later, typically when the cache line is evicted to make room for something else. This is much faster for programs that write to the same data repeatedly, because those writes stay local to the cache. Nearly all modern processors use write-back caches for this reason.
The consequence for coherence is that write-back caches are more complex to keep consistent. Because main memory can be out of date at any time, the coherence protocol must be able to pull the latest data from another core’s cache rather than from memory. This is exactly what the Modified and Owned states in MESI and MOESI handle: they signal that a core holds the only correct copy and must supply it on request.
False Sharing: A Hidden Performance Trap
Even when software doesn’t intentionally share data between cores, the coherence protocol can still cause significant slowdowns through a phenomenon called false sharing. It happens because coherence operates on entire cache lines (typically 64 bytes), not on individual variables.
Imagine two unrelated variables that happen to sit next to each other in memory, close enough to land on the same cache line. Core 0 frequently updates one variable, while Core 1 frequently reads the other. From the coherence protocol’s perspective, both cores are accessing the same cache line. Every time Core 0 writes, it invalidates Core 1’s copy of the entire line. Core 1 must reload the full cache line from Core 0 just to read its variable, even though that variable never actually changed.
The Linux kernel documentation describes a concrete example: a structure containing a reference counter (updated constantly) and a name string (set once and never changed). When many cores access this structure, cores that only read the name are forced to reload their cache line over and over because another core keeps bumping the counter. The data they care about hasn’t changed, but it shares a cache line with data that has.
False sharing gets worse as core counts increase, because more cores means more potential for unrelated data to be accessed concurrently. The fix is usually to pad data structures so that frequently written variables land on their own cache lines, separated from read-heavy data. It’s a software-level solution to a hardware-level problem, and diagnosing it typically requires specialized profiling tools that can detect excessive cache line invalidations.
Scalability Limits in Many-Core Systems
Cache coherence works well for the 4-to-16 core processors in most consumer hardware, but it becomes increasingly expensive as core counts climb into the dozens or hundreds. The core challenge is that coherence generates communication traffic proportional to how much data is shared, and the more cores you add, the more sharing (and invalidation) occurs.
Bus-based snooping systems hit a hard bandwidth wall. The bus simply cannot carry enough messages when dozens of high-speed cores are all broadcasting requests. Directory-based systems push this limit higher, but they face their own bottleneck: the latency of write operations. When a core writes to data that many other cores have cached, it must send invalidation messages to all of them and wait for acknowledgments. In some directory designs that use linked-list structures (called chained directories), invalidations happen sequentially rather than in parallel, and write latency grows directly with the number of cores sharing that data.
This is one reason why some highly parallel processors, like GPUs, use different memory models that don’t enforce strict coherence across all cores. And it’s why chip designers spend enormous effort on coherence protocol optimizations: the protocol’s efficiency directly determines how well a processor’s cores can work together without stepping on each other.

