What Is a TLB? Translation Lookaside Buffer Explained

A TLB, or Translation Lookaside Buffer, is a small, fast cache inside your CPU that speeds up memory access by storing recent translations between virtual memory addresses and physical memory addresses. Every time your processor needs to read or write data, it has to convert the address your software uses (a virtual address) into the actual location in RAM (a physical address). Without a TLB, this translation would require looking through page tables stored in main memory on every single access, slowing everything down dramatically. The TLB keeps the most recently used translations on hand so the CPU can skip that lookup most of the time.

Why Virtual-to-Physical Translation Matters

Modern operating systems don’t let programs access RAM directly. Instead, each program gets its own virtual address space, a kind of private map of memory that the OS manages behind the scenes. When a program reads a variable or loads an instruction, the CPU must translate that virtual address into a real physical location in RAM. This translation happens through data structures called page tables, which the OS maintains for every running process.

Page tables live in main memory, and looking something up in main memory is slow compared to what the CPU can do internally. A single main memory access can take dozens of processor cycles. If the CPU had to consult page tables for every memory operation, it would spend most of its time waiting. The TLB solves this by caching translations right on the chip, where access takes just a cycle or two.

How a TLB Hit and Miss Work

When the CPU needs an address translation, it checks the TLB first. If the translation is there, that’s called a TLB hit, and the physical address is returned almost instantly. The program continues without delay.

If the translation isn’t in the TLB, that’s a TLB miss. The cost is typically one or two main memory access cycles while the hardware (or software, depending on the architecture) searches the page tables. During this time, the process is blocked from accessing the cache, even if the actual data it wants happens to already be there. The system won’t allow access until it confirms the correct mapping and permission information. Once the translation is found, it gets loaded into the TLB, and the original memory access is retried, this time hitting in the TLB.

Hardware-Managed vs. Software-Managed TLBs

Not all processors handle TLB misses the same way. In x86 chips from Intel and AMD, as well as ARM Cortex processors and IBM’s Power Server chips, the TLB is hardware-managed. When a miss occurs, dedicated circuitry automatically walks through the page tables, finds the right entry, loads it into the TLB, and retries the access. The operating system doesn’t need to get involved at all.

Other architectures, including MIPS and UltraSPARC, use software-managed TLBs. On these processors, a TLB miss triggers a special exception, and the operating system’s code is responsible for searching the page tables, loading the correct translation into the TLB, and restarting the instruction that caused the miss. This gives the OS more flexibility in how it organizes translations, but it adds overhead, especially in virtualized environments. Running virtual machines on a software-managed TLB architecture can cause severe slowdowns because two layers of translation need to happen in software. Testing on one such architecture showed performance degradation averaging 72%, with some workloads nearly doubling in execution time.

Split TLBs and Multiple Levels

Modern CPUs typically don’t have just one TLB. Most processors split the first level into two: an instruction TLB (ITLB) for code the CPU is about to execute, and a data TLB (DTLB) for data the program is reading or writing. This split allows the CPU to look up an instruction translation and a data translation at the same time without conflict.

Beyond the first level, many chips include a larger, unified second-level TLB that handles misses from either the ITLB or DTLB. Intel’s Golden Cove cores (used in 12th-gen Alder Lake processors) illustrate the scale: the ITLB holds 256 entries for standard 4KB pages, the L1 data TLB holds 96 entries, and the ITLB also has 32 dedicated entries for large 2MB or 4MB pages. Intel increased the L1 data TLB capacity by 50% over the previous generation, reflecting how important TLB performance is to overall speed.

Huge Pages and TLB Reach

Standard x86 processors address memory in 4KB pages. With a TLB that can hold, say, 256 entries, that covers only about 1MB of memory. For applications working with large datasets, this means constant TLB misses as the processor cycles through translations for thousands of pages.

Huge pages solve this by using 2MB or even 1GB pages instead of 4KB ones. A single TLB entry that would normally cover 4KB of memory now covers 2MB or more. The total amount of memory the TLB can map, known as TLB reach, jumps dramatically. Fewer page table entries are needed overall, which means less memory consumed by page tables themselves and far fewer TLB misses. This is particularly valuable for memory-intensive workloads like databases, virtual machines, and scientific computing. Red Hat’s documentation notes that deploying virtual machines with huge page support significantly increases performance by boosting CPU cache hits against the TLB.

Context Switching and Process Identifiers

When the operating system switches from one process to another, the virtual address mappings change entirely. Process A’s virtual address 0x1000 points to a completely different physical location than process B’s 0x1000. Historically, this meant the OS had to flush (clear) the entire TLB on every context switch, throwing away all cached translations. The new process would start cold, suffering TLB misses on every memory access until the cache warmed up again.

Modern processors avoid this with a feature called PCID (Process-Context Identifier) on Intel chips, or ASID (Address Space Identifier) more generally. Each process gets a small numeric tag, and every TLB entry is stamped with the tag of the process that created it. When the CPU switches to a different process, it simply starts using a different tag. The old translations stay in the TLB, and if the OS switches back to the previous process, those cached translations are still valid and ready to use. This is programmed through the same CPU register (CR3) that already handles page table switching, so it adds essentially no extra cost.

Linux adopted PCID support to speed up transitions between user-space and kernel-space, which became especially important after security mitigations in 2018 required separating kernel and user page tables. Without PCID, those mitigations would have required flushing the TLB on every system call, creating a measurable performance hit.

TLB Shootdowns in Multi-Core Systems

On a single-core processor, the OS only needs to manage one TLB. Multi-core systems are more complicated. When two cores are running threads from the same process, both cores may have cached the same virtual-to-physical mappings in their local TLBs. If the OS changes or removes a mapping (for example, when freeing memory or swapping a page to disk), it needs to ensure every core’s TLB reflects that change. Stale translations could let a process access memory it shouldn’t, or read data that’s no longer there.

The mechanism for this is called a TLB shootdown. The core initiating the change flushes the stale entry from its own TLB, then sends an inter-processor interrupt (IPI) to every other core that might have the same entry cached. Each remote core pauses what it’s doing, flushes the relevant TLB entries, signals back that it’s done, and then resumes. The initiating core waits for all acknowledgments before continuing.

This coordination isn’t free. IPI delivery alone can take several hundred CPU cycles, and the entire shootdown process can stretch into microseconds. On systems with many cores running highly multithreaded applications, shootdowns can become a real bottleneck. The overhead is significant enough that performance-conscious programmers are sometimes advised to avoid frequent memory mapping changes or to structure applications to minimize the need for shootdowns in the first place. Research into optimizing shootdowns, such as tracking which cores actually accessed a given page, continues to be an active area of systems engineering.