What Is Knowledge Distillation in Machine Learning?

Knowledge distillation is a machine learning technique that transfers what a large, powerful AI model has learned into a smaller, faster one. Think of it as a seasoned expert teaching an apprentice not just the right answers, but how to think through problems. The large model is called the “teacher” and the smaller one is the “student.” The technique has become essential for running AI on phones, sensors, and other devices that lack the raw computing power of data center hardware.

How the Teacher-Student Setup Works

In conventional machine learning, you train a model by showing it examples with correct answers: this image is a cat, this email is spam. The model learns to match those hard labels. Knowledge distillation works differently. Instead of training the student model on raw examples, you train it to replicate the teacher model’s predictions, including the ones the teacher got wrong.

Those “wrong” predictions are actually where the richest information lives. When an image classification model looks at a photo of a fox, it might assign a 70% probability to “fox,” a 25% probability to “dog,” and a tiny fraction to “sandwich.” That distribution reveals something important: the model has learned that foxes and dogs share visual features, while foxes and sandwiches do not. These probability distributions, called “soft targets,” carry far more information per training example than a simple label that just says “fox.”

The student model trains on these soft targets, learning not just what the right answer is but how the teacher weighs alternatives. The result is a compact model that captures much of the teacher’s reasoning ability despite having far fewer parameters.

The Role of Temperature

One practical problem with soft targets is that a well-trained teacher model is often very confident. It might assign 99.8% probability to the correct answer and spread the remaining 0.2% across hundreds of other classes. Those tiny probabilities are hard for a student to learn from because the differences between them are so small.

To fix this, distillation uses a setting called “temperature” that softens the teacher’s output. Raising the temperature spreads probability more evenly across classes, making the relationships between incorrect answers more visible. A higher temperature reveals, for instance, that the teacher considers “dog” a much more plausible alternative to “fox” than “car” is. During training, both teacher and student use the same elevated temperature so the student can absorb these subtle patterns. At inference time, the student switches back to a normal temperature to make sharp, confident predictions.

What the Student Actually Optimizes

The student model typically balances two goals during training. First, it tries to match the teacher’s softened probability distributions as closely as possible. This alignment is measured using a function that quantifies how different two probability distributions are, penalizing the student more when its predictions diverge sharply from the teacher’s. Second, the student also learns from the original hard labels (the actual correct answers), ensuring it stays grounded in reality rather than just mimicking the teacher’s biases.

The final training objective blends these two signals. A weighting factor controls how much the student prioritizes matching the teacher versus getting the hard labels right. In practice, the teacher signal dominates because it carries so much more nuanced information per example.

Real-World Compression Results

The gains from distillation can be dramatic. One of the best-known examples is DistilBERT, a distilled version of the widely used BERT language model. DistilBERT has 66 million parameters compared to BERT’s 110 million, retains 97% of BERT’s performance on reading comprehension benchmarks, and cuts inference time from 0.30 seconds to 0.12 seconds per question on standard hardware.

More aggressive compression is possible when some accuracy loss is acceptable. Recent research has systematically evaluated compression ratios ranging from 2.2x all the way to 1,122x, mapping the tradeoff between size and accuracy at each step. In one case, a distilled model achieved a 4.1x compression factor while actually improving test accuracy by 1.1 percentage points over a non-distilled model of the same size, cutting inference time from 140 milliseconds to 13 milliseconds. Distillation doesn’t just shrink models; it can make small models smarter than they’d be on their own.

Why It Matters for Edge Devices

Large AI models like GPT-3, with its billions of parameters, require high-performance GPUs or TPUs to run. That hardware is standard in cloud data centers but absent from the places where AI is increasingly needed: smartphones, medical wearables, factory sensors, autonomous vehicles. These edge devices are designed to be small, energy-efficient, and cost-effective. They run on batteries, have limited memory, and can’t afford the latency of sending data to a remote server and waiting for a response.

Distillation bridges this gap. By compressing a powerful model into one that fits on constrained hardware, it makes sophisticated AI capabilities available in real time, on device, without a network connection. This matters for applications where speed and privacy are non-negotiable, like real-time medical monitoring or on-device speech recognition.

Distillation for Large Language Models

The most common approach to distilling large language models doesn’t require access to the teacher model’s internal workings at all. Instead, you use the large model to generate labeled data in bulk: feeding it thousands of prompts and collecting its responses. That labeled dataset then becomes the training material for a smaller student model. The student learns to produce similar outputs without needing the teacher’s massive architecture.

The teacher can also provide richer signals than simple labels. Instead of just marking a response as good or bad, it can assign numerical scores that capture degrees of quality. This gives the student a more detailed training signal, helping it learn to handle ambiguous or borderline cases rather than just clear-cut ones. Google describes this as the teacher “funneling its knowledge” to the student through labeled data, producing a model that generates predictions much faster at a fraction of the computational and energy cost.

The Capacity Gap Problem

Distillation has an unintuitive limitation: a bigger teacher doesn’t always produce a better student. When the gap in complexity between teacher and student is too large, the student’s performance can actually degrade. Researchers call this the “curse of the capacity gap.” A student model often learns best from a moderately sized teacher rather than the largest available one.

This happens because the teacher’s reasoning patterns become too complex for the student to approximate. The soft targets from an enormous model may encode relationships that a small student simply lacks the capacity to represent, leading to worse generalization rather than better. The practical implication is that distillation works best when you choose a teacher that’s larger than your target student but not overwhelmingly so. Finding that sweet spot, the optimal capacity gap, requires experimentation for each specific task and model architecture.