A CPU processes data by repeatedly performing a simple three-step loop: fetching an instruction from memory, decoding what that instruction means, and executing it. A modern processor running at 3.2 GHz completes 3.2 billion of these cycles every second, and clever engineering tricks let it handle multiple instructions at once. The basic idea is straightforward, but the details of how it all works at speed are what make modern computing possible.
The Fetch-Decode-Execute Cycle
Every piece of work a CPU does, whether it’s adding two numbers or checking if a password matches, comes down to the same repeating loop.
In the fetch stage, the CPU grabs the next instruction from main memory (RAM). It keeps track of where it is in the program using an internal counter that holds the address of the next instruction. That address gets sent out to RAM, and the instruction stored at that location travels back into the processor. The counter then advances by one, pointing to the next instruction in line.
In the decode stage, the CPU’s control unit figures out what the instruction is actually asking for. Every instruction has two parts: an operation code that says what to do (add, compare, move data) and an operand that says what to do it with (a number, a memory address, a register). The control unit splits these apart and sends the right signals to the rest of the processor so it’s ready to act.
In the execute stage, the CPU carries out the instruction. If it’s a math problem, the arithmetic circuits fire. If it’s a comparison, the logic circuits handle it. If the instruction says to store a result, the data gets written back to a register or to memory. Then the cycle starts over with the next instruction.
This loop runs billions of times per second. Every app on your phone, every frame of a video game, every cell in a spreadsheet is ultimately the product of this cycle repeating at extraordinary speed.
The Two Engines Inside the CPU
The actual work of processing splits between two main components: the control unit and the arithmetic logic unit (ALU).
The control unit is the manager. It doesn’t do any math itself. Instead, it reads each decoded instruction and sends electrical signals that coordinate the rest of the processor. It tells the ALU what operation to perform, opens and closes the right data pathways, and keeps everything synchronized with the clock.
The ALU is the workhorse. It contains two types of circuits: arithmetic circuits that handle addition, subtraction, multiplication, and division, and logic circuits that perform comparisons and binary operations like AND, OR, and XOR. When your computer checks whether a value is greater than another, or adds up a column of numbers, the ALU is doing that work. Despite the complexity of modern software, every computation ultimately breaks down into these simple operations performed at incredible speed.
Clock Speed and What GHz Actually Means
A CPU’s clock is an internal signal that ticks at a fixed rate, and each tick is one “cycle.” Clock speed, measured in gigahertz, tells you how many of those ticks happen per second. A 3.2 GHz processor ticks 3.2 billion times per second, with each cycle lasting roughly 0.3 nanoseconds.
Not every instruction finishes in a single cycle, though. Some simple operations complete in one tick. More complex instructions, like dividing two numbers, can take several cycles. And sometimes the reverse is true: the processor finishes multiple instructions in a single cycle by working on different stages of different instructions simultaneously. So clock speed gives you a rough sense of how fast a processor is, but it’s not the whole picture.
How Cache Keeps the CPU Fed
RAM is fast by everyday standards, but it’s painfully slow compared to the CPU itself. Fetching data from RAM takes roughly 50 to 100 nanoseconds, which translates to hundreds of wasted clock cycles where the processor would just be sitting idle. To solve this, CPUs have small, ultra-fast memory banks called caches built directly onto the chip.
Cache comes in three levels, each trading size for speed:
- L1 cache is the smallest (usually tens of kilobytes per core) but the fastest, with a latency of just 1 to 4 clock cycles, under 1 nanosecond.
- L2 cache is larger (256 KB to 2 MB) with latency of 7 to 14 cycles, roughly 3 to 5 nanoseconds.
- L3 cache is shared across all cores, ranges from 4 to 64 MB, and takes 20 to 40 cycles to access, about 10 to 20 nanoseconds.
When the CPU needs a piece of data, it checks L1 first, then L2, then L3, and only goes to RAM as a last resort. Because programs tend to reuse the same data and access nearby memory locations in sequence, cache catches the vast majority of requests before they ever reach RAM. This is one of the biggest reasons modern processors feel fast despite the gap between CPU speed and memory speed.
Pipelining: Overlapping Instructions
If the CPU finished every instruction completely before starting the next one, it would waste a lot of time. During the fetch stage, for instance, the ALU would be sitting idle. Pipelining solves this by overlapping instructions like an assembly line.
While instruction #1 is being executed, instruction #2 is being decoded, and instruction #3 is being fetched from memory, all at the same time. Each stage of the pipeline handles a different instruction simultaneously. The first instruction still takes the full number of stages to complete, but after that, the processor finishes one instruction per cycle.
The math is simple. Without pipelining, running 100 instructions through a 4-stage pipeline takes 400 cycles. With pipelining, it takes only 103: 4 cycles to fill the pipeline, then one cycle per remaining instruction. That’s nearly a 4x speedup for the same clock speed.
Branch Prediction and Guessing Ahead
Pipelining has a weakness: decisions. When your code hits an “if” statement, the CPU doesn’t know which path to take until the comparison is actually executed. But by then, it’s already fetched and started working on the next few instructions. If it guessed wrong about which path the code would follow, it has to throw away that work and start over, costing several cycles.
To minimize this, modern CPUs use branch prediction. The processor keeps a record of how each decision point resolved in the past and bets that it will go the same way next time. This sounds crude, but it works remarkably well. Basic prediction tables achieve 80 to 90 percent accuracy, and more advanced two-level predictors (which track patterns across multiple recent branches) reach over 95 percent accuracy. That means the pipeline stalls on a wrong guess fewer than 5 times out of 100.
When the CPU is confident in its prediction, it goes further with speculative execution: it doesn’t just fetch the predicted instructions, it actually starts executing them. If the prediction turns out to be correct, the results are already done. If not, the processor discards the speculative work and picks up on the correct path. It’s a gamble that pays off the overwhelming majority of the time.
Scale: Billions of Transistors at Work
All of these mechanisms, the ALU, control unit, caches, pipeline stages, and prediction tables, are built from transistors: tiny switches that flip between on and off to represent 1s and 0s. During each clock cycle, billions of these transistors open and close to carry out calculations.
The scale of modern chips is staggering. Apple’s M4 processor contains 87 billion transistors on a chip manufactured with a 3-nanometer process. The M3 Ultra, a dual-chip design, packs 184 billion. These transistors aren’t just making the processor faster in raw clock speed. They’re enabling wider pipelines, larger caches, better branch predictors, and more cores, all of which let the CPU do more useful work per second.
At its core, though, the principle hasn’t changed since the earliest computers. Fetch an instruction, figure out what it means, do the work, repeat. Everything else is engineering designed to make that loop run as many times per second as physically possible.

