What Is Pinned Memory and Why GPUs Need It

Pinned memory is a region of RAM that the operating system is prevented from moving or swapping to disk. Normally, your OS can shuffle memory pages around or write them to a swap file to free up physical RAM. Pinned memory (also called page-locked memory) stays fixed at a specific physical address for as long as it’s allocated, guaranteeing that hardware devices can access it at any moment without finding an empty page.

This concept matters most in GPU computing and high-speed networking, where hardware needs to read or write system memory directly, bypassing the CPU. If the OS moved that memory to disk at the wrong moment, the transfer would fail or corrupt data. Pinning solves that problem.

How Virtual Memory Normally Works

Your operating system gives every program the illusion of a large, continuous block of memory. Behind the scenes, it breaks that memory into small chunks called pages (typically 4 KB each) and maps them to physical RAM. When RAM fills up, the OS can “page out” some of those chunks to your hard drive or SSD, freeing physical memory for other tasks. When the program needs that data again, the OS quietly loads it back. This is called paging or swapping, and it happens constantly without programs noticing.

Pinned memory opts out of this system. When you pin a memory region, you’re telling the OS: keep these pages in physical RAM at all times, at the same physical address. The OS will not move them, swap them out, or remap them. This makes the memory predictable for external hardware but comes at a cost, since that RAM is now unavailable for anything else, even if the program isn’t actively using it.

Why GPUs Need Pinned Memory

The most common reason developers encounter pinned memory is GPU programming, particularly with NVIDIA’s CUDA platform. When you transfer data between your CPU’s main memory (the “host”) and a GPU (the “device”), the GPU’s DMA engine handles the copy. DMA, or Direct Memory Access, lets hardware move data without involving the CPU, but it can only target memory at a known, fixed physical address. That means it needs pinned memory.

If your data lives in regular, pageable memory, the system has to do extra work. It first allocates a temporary pinned buffer, copies your data into that buffer, and only then transfers it to the GPU. That intermediate copy adds measurable overhead. Benchmarks using NVIDIA’s profiling tools show pinned memory achieving roughly 6.68 GB/s for host-to-device transfers and 6.70 GB/s in the reverse direction. Pageable memory, by contrast, managed only about 3.69 GB/s and 3.87 GB/s respectively. That’s nearly double the throughput just by eliminating the extra copy step.

By allocating your host arrays as pinned memory from the start, you skip that temporary buffer entirely. The GPU’s DMA engine reads directly from your data’s fixed location in RAM.

RDMA and High-Speed Networking

GPUs aren’t the only hardware that needs pinned memory. Remote Direct Memory Access (RDMA) networking relies on the same principle. RDMA lets one machine read or write another machine’s memory directly over a network, bypassing both CPUs for extremely low latency. For this to work, the network card needs a guaranteed, stable mapping between virtual memory addresses and physical ones.

When you register a memory region for RDMA, the system pins its pages and records the virtual-to-physical address mapping in a translation table stored on the network card itself. The card caches parts of this table in a small onboard SRAM (just a few megabytes), and a cache miss is expensive. If the OS were free to remap or swap those pages, the entire mechanism would break. Pinning the memory is what makes the address translation reliable and the whole system fast.

How To Pin Memory

The specific method depends on your platform and what you’re doing with the memory.

Linux: The mlock() system call locks a specified address range so the kernel won’t page it out. You provide a starting address and a length in bytes. There’s also mlockall(), which locks every mapped page in the process. To release them, use munlock() or munlockall(). For real-time applications, you can also pass the MAP_LOCKED flag when allocating memory with mmap(), combining allocation and pinning in one step.
Windows: The equivalent is VirtualLock(), which prevents a range of committed pages from being paged to disk.
CUDA: Functions like cudaHostAlloc() or cudaMallocHost() allocate memory on the host side that is already pinned and ready for GPU transfers.

In all cases, unpinning (unlocking) the memory when you’re done is important. Forgetting to release pinned memory is a common source of resource leaks.

The Tradeoff: System Memory Pressure

Pinned memory is not free. Every page you lock into RAM is a page the operating system can no longer manage flexibly. On a system with 16 GB of RAM, pinning 4 GB means the OS has only 12 GB left to work with for all other processes, file caches, and buffers. Pin too much and you starve the rest of the system, forcing other applications to swap more aggressively, which slows everything down.

Allocation is also slower than regular memory. The OS has to find contiguous physical pages and update its internal bookkeeping to mark them as non-swappable. For small, short-lived allocations, this overhead can outweigh the transfer speed benefits. The general rule is to pin memory for large buffers that you’ll reuse across many transfers, not for every small allocation.

Most operating systems also impose per-process limits on how much memory can be locked. On Linux, unprivileged users typically have a default limit of 64 KB, though administrators can raise it. This exists precisely to prevent a single application from locking up all physical RAM.

Zero-Copy Memory: A Step Beyond Pinning

Standard pinned memory still requires an explicit copy command to move data between host and device. Zero-copy memory takes the concept further: it’s pinned host memory that the GPU can access directly over the PCIe bus, without any copy at all. The GPU reads from (or writes to) your system RAM as if it were local memory.

This works well when the GPU only needs to touch the data once or when the data set is too large to fit in GPU memory. Performance evaluations have shown zero-copy memory outperforming standard pinned transfers by roughly 19% in some workloads, because it eliminates the transfer step entirely. The tradeoff is that repeated access to the same data is slower, since every read crosses the PCIe bus instead of hitting fast GPU memory. For data accessed many times during a computation, copying it to the GPU’s own memory with a pinned transfer is still faster overall.

When Pinned Memory Makes Sense

Pinned memory is worth using when external hardware needs direct, reliable access to system RAM. The three most common scenarios are GPU data transfers, RDMA networking, and real-time systems where page faults (the brief stall when the OS loads a swapped-out page) are unacceptable. Audio processing, robotics control loops, and financial trading systems all fall into that last category.

For general-purpose programming with no hardware DMA or real-time constraints, pinned memory adds complexity without benefit. The OS’s normal paging system is well optimized, and letting it manage memory freely gives better overall system performance. Pin only what you need, keep it pinned only as long as you need it, and profile before and after to confirm it actually helps.