A quantized model is an AI model whose internal numbers have been converted from high-precision formats (like 32-bit or 16-bit floating point) to lower-precision formats (like 8-bit or 4-bit integers), making it smaller, faster, and cheaper to run. The core trade-off is straightforward: you lose a small amount of accuracy in exchange for dramatic reductions in memory, storage, and energy use. Quantization is the single most common technique people use to run large language models on consumer hardware instead of expensive server GPUs.
How Quantization Works
Every AI model stores its learned knowledge as millions (or billions) of numerical values called weights. In a full-precision model, each weight is a 32-bit floating-point number, meaning it uses 32 ones and zeros to represent a single value with high decimal precision. That’s a lot of data. A model with 7 billion weights at 32 bits per weight takes about 28 gigabytes just for the weights alone.
Quantization compresses those numbers by mapping them onto a smaller grid. Think of it like rounding: instead of storing a value as 0.7834921, you store it as 0.78, or even just 1. The process uses a scale factor and a zero-point to translate each floating-point value to the nearest integer on a fixed grid. For 8-bit quantization, that grid has 256 possible values. For 4-bit, it has just 16. Values that fall outside the grid’s range get clamped to the nearest boundary.
Dropping from 32-bit floats to 8-bit integers gives you a 4x reduction in data size. Going to 4 bits cuts it by 8x. That 28-gigabyte model shrinks to roughly 7 GB at 8-bit precision, or about 3.5 GB at 4-bit. That’s the difference between needing a high-end data center GPU and running the model on a laptop.
Why Accuracy Doesn’t Collapse
Neural networks are surprisingly tolerant of imprecise numbers. The billions of weights in a large model contain a lot of redundancy, and most individual values don’t need to be exact for the model to produce good output. Small rounding errors across millions of weights tend to partially cancel each other out rather than compounding into large mistakes.
That said, not all weights are equally important. Some carry far more influence on the model’s output than others, and rounding those aggressively can hurt quality. Modern quantization methods address this by allocating more precision to important weights and less to unimportant ones. The result is that a well-quantized 4-bit model often performs remarkably close to its full-precision parent, with perplexity (a measure of language model quality) only slightly worse.
Post-Training vs. Quantization-Aware Training
There are two main approaches to quantizing a model, and they differ in when the compression happens.
Post-training quantization (PTQ) takes a model that’s already been fully trained and converts it to lower precision after the fact. A small calibration dataset is run through the model to measure the typical range of values at each layer, and those ranges determine how the conversion is done. PTQ is fast, simple, and doesn’t require access to the original training data or compute. The downside is that the model never had a chance to adjust to the precision loss, so accuracy can suffer, especially at very low bit-widths like 4-bit.
Quantization-aware training (QAT) builds the quantization process into training itself. During each training step, the model’s weights are temporarily quantized and then used to compute the output, so the model learns to compensate for rounding errors as it trains. QAT generally produces better results than PTQ because the model actively adapts to its lower-precision constraints. The trade-off is cost: it requires retraining (or fine-tuning) the model, which demands significant compute and the original training pipeline.
For large language models, PTQ dominates in practice because retraining a billion-parameter model is expensive. QAT is more common in smaller models deployed on phones or embedded devices, where squeezing out every bit of accuracy at low precision matters and training costs are manageable.
Weight Quantization vs. Activation Quantization
A model has two kinds of numbers flowing through it: weights and activations. Weights are the fixed values the model learned during training. They don’t change at inference time, making them easy to quantize because their distribution is static and known in advance.
Activations are the intermediate values computed as data passes through each layer of the network. They change with every input, which makes them harder to compress. The ideal scaling factor for activations can vary significantly depending on what text or image the model is processing. Two common strategies handle this: static quantization uses a calibration dataset to fix the activation scaling factors once, while dynamic quantization recalculates them on the fly for each input. Dynamic quantization adapts better to varying inputs but adds a small computational overhead.
Most consumer-facing quantized models you’ll encounter (like GGUF files for local LLMs) quantize only the weights. Full weight-and-activation quantization is more common in production deployments where both memory and raw throughput matter.
Common Quantization Formats for LLMs
If you’ve looked into running AI models locally, you’ve probably seen format names like GGUF, GPTQ, AWQ, and EXL2. These are different implementations of quantization, each with distinct strengths.
- GGUF (used by llama.cpp) is the most popular format for running models on CPUs or mixed CPU/GPU setups. GGUF files are fast to create, taking just a few minutes, and achieve surprisingly good accuracy. They’re the go-to choice for most people running models on personal computers.
- GPTQ is a GPU-focused format that was one of the first widely adopted quantization methods for large language models. It performs well but has been somewhat overtaken by newer options.
- AWQ (Activation-aware Weight Quantization) produces smaller files than GPTQ with similar or better accuracy, though it tends to use more GPU memory at runtime.
- EXL2 (used by ExLlama v2) is the speed champion. It generates tokens 85% faster than llama.cpp and 147% faster than basic 4-bit loading. EXL2 also supports variable bit-widths, letting you choose precise quality levels like 4.0, 4.25, or 4.65 bits per weight. It requires a dedicated GPU.
In benchmarks comparing these formats on a 13-billion parameter model, EXL2 consistently offered the best balance of accuracy and file size. GGUF held its ground on accuracy despite being far simpler to produce. AWQ delivered strong accuracy per file size but consumed notably more video memory. The practical choice often comes down to your hardware: GGUF if you’re running on CPU or have limited GPU memory, EXL2 if you have a dedicated GPU and want maximum speed.
Energy and Performance Gains
Quantization doesn’t just save memory. Because the processor is working with smaller numbers, every operation requires less energy. Low-precision computation can reduce energy consumption by up to 50%, which matters enormously at data center scale, where AI inference is one of the fastest-growing sources of electricity demand. On mobile devices, it translates directly to less battery drain and less heat.
Speed improvements come from two sources. First, smaller numbers mean more of the model fits into fast cache memory, reducing the time spent waiting for data. Second, modern processors have dedicated hardware for integer arithmetic that can process 8-bit or 4-bit operations several times faster than equivalent floating-point math. The combined effect is that a quantized model can run 2 to 4 times faster than its full-precision version on the same hardware.
What You Lose
Quantization is not free. At 8-bit precision, most models lose very little quality, often indistinguishable in practice. At 4-bit, the degradation becomes measurable but usually remains acceptable for conversational AI, creative writing, and general-purpose tasks. Below 4 bits, quality drops more noticeably: the model may become less coherent, lose factual precision, or struggle with complex reasoning.
The impact also depends on model size. A 70-billion parameter model quantized to 4 bits typically retains more of its original quality than a 7-billion parameter model at the same precision, because the larger model has more redundancy to absorb the rounding errors. This is why quantization has been so transformative for making large models accessible: the models that benefit most from compression are exactly the ones that needed it most.

