Simultaneous multithreading (SMT) is a processor design technique that lets a single CPU core run instructions from multiple program threads at the same time. Instead of one thread monopolizing the core’s resources, two or more threads share them, filling gaps that would otherwise go unused. Intel markets this technology as Hyper-Threading, while AMD simply calls it SMT, but the underlying concept is the same.
How SMT Works Inside a Processor
Modern processors are “superscalar,” meaning they have multiple execution units that can each handle an instruction simultaneously. Think of these units as separate workstations on a factory floor: one handles math, another handles memory lookups, another handles comparisons. In theory, the processor can run several instructions every clock cycle. In practice, a single program rarely keeps all those workstations busy at once. Instructions often depend on each other, or the processor stalls while waiting for data from memory. Those idle execution units represent wasted potential.
SMT addresses this by giving the processor awareness of more than one thread. Each thread gets its own set of registers (small, fast storage slots that track a program’s state) and its own program counter (the marker that tracks which instruction comes next). But the threads share the core’s execution units, caches, and scheduling hardware. Every cycle, the processor’s scheduler picks instructions from all available threads and sends them to whichever execution units are free. No special scheduling hardware is needed for this. The dynamic scheduling logic already present in modern out-of-order processors is functionally capable of handling instructions from multiple threads.
The result: when Thread A stalls waiting on a slow memory fetch, Thread B’s instructions can flow into the execution units that would have sat idle. The core extracts both instruction-level parallelism (doing multiple things from one thread) and thread-level parallelism (doing things from different threads) in the same cycle.
Hardware Threads vs. Software Threads
The term “thread” means different things at different levels of a computer system, and SMT sits at the hardware level. Software threads are managed by the operating system’s kernel. The OS schedules them, each can make independent system calls, and switching between them costs many CPU cycles because the OS has to save and restore program state. Hardware threads, by contrast, are managed by the processor itself. Switching between them can happen in a single cycle because the core already holds separate registers for each thread. It just changes which program counter it’s reading from.
When your operating system sees an SMT-enabled processor, it reports each hardware thread as a “logical core.” A 6-core CPU with two-way SMT appears as 12 logical cores. The OS can schedule 12 software threads across them, but only 6 are full physical cores with their own dedicated execution resources. The other 6 logical cores are sharing those same resources with their siblings.
Real-World Performance Gains
SMT does not double your performance. Two threads sharing one core’s resources inevitably compete for those resources. The actual benefit depends heavily on the workload. Historically, SMT has delivered single-digit to low double-digit percentage improvements in throughput on common benchmarks. Some workloads see meaningful gains; others see no improvement or even slight degradation.
The workloads that benefit most are those where individual threads frequently stall, particularly on memory accesses. When one thread is waiting for data, the other thread can use the core’s execution units productively. Server workloads, virtualization, and heavily multithreaded software like video encoding or 3D rendering tend to benefit because they run many threads that independently wait on memory or I/O. Conversely, workloads that already keep the execution units saturated with a single thread gain little from sharing those units with a second thread. Compute-heavy tasks with data already in cache often fall into this category.
Testing on IBM’s 8-core POWER7 processor with four-way SMT illustrates the inconsistency. When researchers quadrupled the thread count by enabling SMT4, some benchmarks improved while others, like the Equake simulation, actually got slower. The average gains for workloads that did benefit hovered above 15%, but the spread was wide.
Power and Efficiency Tradeoffs
Adding SMT support to a core increases its power consumption by roughly 38 to 46 percent, according to research from the University of Virginia and IBM. The extra power comes from duplicated registers, additional physical resources needed to prevent new bottlenecks, and the simple fact that more instructions are executing per second. However, the increase in useful work (throughput) more than compensates for the higher power draw in most scenarios, making SMT a net win for energy efficiency per instruction completed.
Where SMT particularly shines in efficiency is on workloads with high rates of cache misses, where threads spend a lot of time waiting on slow main memory. In those situations, the idle execution resources are so abundant that a second thread can fill them cheaply. For workloads that rarely miss the cache, adding more independent cores (chip multiprocessing) tends to be more energy-efficient than sharing one core between threads.
Security Concerns With Shared Resources
Because SMT threads share physical resources inside a core, one thread can sometimes infer what the other is doing by observing timing differences in those shared resources. This has led to a series of demonstrated security attacks. PortSmash exploits contention on execution ports to extract encryption keys from a co-resident thread. TLBleed leaks a victim thread’s memory access patterns through the shared translation lookaside buffer. CacheBleed and MemJam target shared cache banks and memory detection units, respectively, to extract secret data.
These vulnerabilities are serious enough that Google disabled SMT by default in Chrome OS starting with version 74, specifically in response to a class of attacks called MDS (microarchitectural data sampling). Cloud providers, where untrusted code from different customers might run on the same physical core, have been especially cautious. Some disable SMT entirely on sensitive workloads to eliminate the shared-resource side channel.
Why Some Chips Are Dropping SMT
Intel’s Arrow Lake desktop processors, released in late 2024, removed Hyper-Threading entirely. This reflects a broader shift in how chip designers think about core count and efficiency.
The logic is straightforward. Modern desktop and laptop processors now pack enough cores that the original problem SMT solved, keeping execution units busy, can be addressed differently. When you have many cores and one stalls on a memory access, other cores continue working independently. They don’t even share cache with the stalled core, so they’re less likely to interfere with each other than two SMT threads on the same core would be. Meanwhile, modern out-of-order execution engines have grown so large that a single thread can usually keep the execution units busy on its own, because the processor looks far ahead in the instruction stream to find independent work.
Intel’s recent desktop chips also use a hybrid design with a mix of high-performance and high-efficiency cores. In this architecture, the chip area and power budget that SMT would consume on big cores is better spent adding more small, efficient cores. Three to five efficient cores can fit in the space of one large core and deliver far higher total throughput. With that tradeoff available, SMT on the big cores becomes redundant for multi-threaded performance while still carrying the costs of added complexity, power draw, and security surface area.
Server processors, which run heavily threaded workloads with frequent memory stalls, still widely use SMT. AMD’s latest server chips support two-way SMT, and IBM’s POWER processors support up to eight-way SMT. The calculus is different in a data center, where squeezing maximum throughput from every core directly affects operating costs and where workloads are tuned to benefit from the extra hardware threads.

