What Is a Cache Line? How CPUs Fetch 64 Bytes

A cache line is a fixed-size chunk of data, typically 64 bytes, that serves as the smallest unit of transfer between your CPU’s cache and main memory. When the processor needs even a single byte from memory, it doesn’t fetch just that byte. It loads the entire 64-byte block surrounding it into the cache. This design exploits a simple pattern in how programs use memory: if you access one address, you’ll probably access nearby addresses very soon.

Why the CPU Fetches 64 Bytes at a Time

Programs tend to access memory in predictable patterns. When you loop through an array, you read elements one after another. When you use a data structure, its fields sit next to each other in memory. This behavior is called spatial locality, and cache lines are built to take advantage of it. By pulling in a 64-byte neighborhood on every memory request, the CPU bets that much of that block will be needed shortly. Most of the time, it wins that bet.

The 64-byte size is a tradeoff. A larger cache line would capture more surrounding data, improving hit rates for sequential access patterns, but it would also waste more space when programs jump around in memory unpredictably. A smaller line would reduce waste but require more trips to main memory for sequential workloads. Modern x86 and ARM processors have settled on 64 bytes as the sweet spot, and this has been the standard for roughly two decades.

How the CPU Finds a Cache Line

When your processor generates a memory address, it splits that address into three parts to locate data in the cache. The offset bits identify the exact byte within the 64-byte line. The index bits tell the cache which set of slots to look in. The tag bits are then compared against stored tags in that set to confirm whether the right data is actually there.

How many slots each set contains determines the cache’s “associativity.” In a direct-mapped cache, each memory address maps to exactly one slot, making lookups fast but inflexible. In a 4-way set associative cache (common in modern L1 caches), each address can map to any of four slots within its set, reducing conflicts where two frequently used addresses compete for the same location. A fully associative cache allows a line to go anywhere, which is the most flexible but requires comparing every tag on each lookup.

Cache Levels and Speed

Modern processors organize their caches into multiple levels, each trading size for speed. The L1 cache sits closest to the processor core and responds in about 1 to 5 clock cycles. It’s small, usually 32 to 64 kilobytes per core, but extraordinarily fast. The L2 cache is larger and takes 5 to 20 cycles. The L3 cache is shared across all cores, can be several megabytes, and takes 30 to 100 cycles.

For comparison, fetching data from main memory (RAM) costs roughly 200 to 300 cycles. That means a cache line hit in L1 can be 50 to 100 times faster than going to RAM. This enormous speed gap is the entire reason caches exist. The cache line is the vehicle that moves data across these levels.

What Happens on a Cache Miss

When the processor requests an address and the corresponding cache line isn’t present, that’s a cache miss. The CPU stalls (or works on other tasks if it can) while the 64-byte line is fetched from a lower cache level or from main memory. Once loaded, the line occupies one slot in the cache.

If the cache set is already full, one existing line must be evicted to make room. Processors use replacement policies to decide which line to discard. The most common approaches are least-recently used (LRU), which evicts the line that hasn’t been accessed for the longest time, and pseudo-LRU, a cheaper approximation that requires less hardware to implement. The evicted line is written back to the next cache level if it was modified, or simply discarded if it’s still identical to its copy in memory.

Cache Line States in Multicore Processors

When multiple cores share memory, each core’s cache may hold its own copy of the same cache line. Keeping these copies consistent is called cache coherency, and most processors handle it with a protocol called MESI. Each cache line is tracked in one of four states:

Modified: This core has changed the data. No other cache has a copy, and the line differs from main memory.
Exclusive: This core has the only cached copy, and it matches main memory. The core can modify it without notifying others.
Shared: Multiple caches may hold copies. All match main memory. If any core wants to write, it must first invalidate the other copies.
Invalid: The line contains no usable data, either because it was evicted or another core invalidated it.

These state transitions happen automatically in hardware. When one core writes to a cache line, the hardware broadcasts an invalidation signal. Other cores holding that line mark their copies as invalid and must re-fetch the data if they need it. This bus traffic has real performance costs, which is why cache-line-aware programming matters in multithreaded code.

False Sharing

False sharing is one of the most common cache-related performance traps in parallel programming. It happens when two threads on different cores modify different variables that happen to sit on the same 64-byte cache line. Neither thread is actually sharing data with the other, but the hardware doesn’t know that. It sees two cores writing to the same cache line and forces constant invalidation and re-fetching between them.

The fix is straightforward: pad your per-thread data so each thread’s working variables occupy their own cache line. This is why some performance-critical code aligns data structures to 64-byte boundaries. Intel’s documentation specifically recommends this for per-thread allocations to avoid the unnecessary coherency traffic that false sharing creates.

Alignment and Performance Penalties

When a piece of data straddles two cache lines, the processor may need to load both lines to complete a single read or write. For most operations, this penalty is modest. But in specific scenarios it can be severe. Benchmarks on AMD Ryzen processors show that a simple 4-byte store crossing a cache line boundary runs about 5 times slower than one staying within a single line. Crossing a page boundary (where virtual memory mappings change) is even worse, around 21 times slower in the same tests.

SIMD operations, which process multiple values at once using wider registers, are even more sensitive. On the same hardware, unaligned 16-byte SIMD stores showed penalties even within a cache line depending on the offset, with slowdowns of 2 to 5 times compared to aligned stores. If you’re working with performance-critical inner loops, aligning your data to cache line boundaries is one of the simpler optimizations available.

Hardware Prefetching

Modern CPUs don’t just react to cache misses. They try to predict which cache lines you’ll need next and fetch them before you ask. This is called hardware prefetching, and it runs in the background without any action from the programmer.

A stream prefetcher detects when your program is reading consecutive cache lines (like iterating through an array) and starts loading upcoming lines ahead of your code’s progress. A stride prefetcher handles patterns where accesses skip a regular number of bytes, common in loops that step through every Nth element of an array. An adjacent line prefetcher simply loads the neighboring cache line whenever you access a given line, effectively doubling the fetch size for workloads that benefit from it.

These prefetchers are why sequential memory access is so much faster than random access in practice. Walking through an array in order lets the prefetcher stay ahead of your code, hiding most of the memory latency. Jumping to random locations defeats the prefetcher entirely, and every access risks a full cache miss with its 200-plus cycle penalty.