The vanishing gradient problem occurs when the error signals used to train a neural network shrink to near-zero as they travel backward through the network’s layers. This means early layers barely learn at all, while later layers closer to the output do most of the updating. The problem was first identified by Sepp Hochreiter in his 1991 diploma thesis at the Technical University of Munich, and it was the primary reason most early attempts to train deep neural networks failed.
How Gradients Vanish During Training
Neural networks learn through a process called backpropagation, which works backward from the output layer to the input layer, calculating how much each weight contributed to the network’s error. This calculation relies on the chain rule from calculus: to figure out how a weight in an early layer affected the final output, you multiply together a series of small derivatives, one for each layer the signal passed through.
The problem is that many of these derivatives are fractions less than 1.0. When you multiply many fractions together, the result shrinks exponentially. In a network with five layers, this might be manageable. In a network with fifty layers, the gradient reaching the first layer can be astronomically small, effectively zero for practical purposes. Hochreiter’s original analysis showed this mathematically: if each scaling factor in the chain is less than 1.0, the error flow “vanishes” exponentially as it travels back through time steps or layers.
Why Sigmoid and Tanh Make It Worse
The choice of activation function plays a central role. The sigmoid function, which squashes any input into a value between 0 and 1, has a derivative defined as its output multiplied by one minus its output. The maximum value this derivative can reach is 0.25, which happens when the input is exactly zero. For most inputs, the derivative is much smaller. This means every layer that uses sigmoid multiplies the gradient by at most 0.25, and typically far less.
Hochreiter’s analysis showed that with sigmoid activation functions, the scaling factor at each layer stays below 1.0 as long as the weights are less than 4.0 in absolute value. Since networks are typically initialized with small weights, this condition holds at the start of training, meaning the error flow tends to vanish especially during the critical early phase when the network is supposed to be learning its foundational representations. The tanh function, whose derivative peaks at 1.0 but drops toward zero for large or small inputs, has a similar issue. When a neuron’s output gets pushed into these flat regions (called saturation), the gradient through that neuron effectively dies.
The practical result: neurons get “stuck.” When a unit’s output is close to 0 or 1 under sigmoid, gradient descent changes it extremely slowly. Training stalls or converges painfully slowly, particularly for the layers closest to the input.
What It Looks Like During Training
You can spot vanishing gradients without understanding the math. The most obvious sign is that the training loss decreases very slowly or plateaus early, well before the network has learned anything useful. If you plot the gradient distributions across layers, networks suffering from this problem show gradient values clustered tightly around zero in the early layers, while later layers still have healthy gradient magnitudes. The weights in those early layers barely change from their initialized values across thousands of training steps.
Comparing a deep network using sigmoid activations against the same architecture using modern alternatives, the sigmoid version’s loss decreases noticeably slower. The network isn’t broken in an obvious way. It just sits there, learning at a glacial pace or not at all, which made the problem especially frustrating for researchers in the 1990s and 2000s who couldn’t always tell whether their architecture was flawed or just undertrained.
Recurrent Networks and Long Sequences
The problem is especially severe in recurrent neural networks (RNNs), which process sequences like text or time-series data by passing information from one time step to the next. An RNN processing a 200-word sentence needs to backpropagate gradients through 200 effective layers. The gradient shrinks at every step, so by the time the error signal reaches the beginning of the sentence, it carries almost no useful information. The network can learn short-range patterns (the last few words) but fails to capture dependencies spanning dozens or hundreds of steps.
This limitation motivated the development of Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). LSTMs use memory cells with gating mechanisms that can carry information forward through many time steps without it being repeatedly multiplied by small derivatives. The key insight is a “constant error carousel,” a pathway where gradients can flow backward through time without being squeezed at every step. GRUs achieve similar benefits with a simpler architecture, using fewer gates. Both designs made it practical to learn long-range dependencies in sequential data for the first time.
ReLU: A Simpler Activation Function
One of the most impactful solutions turned out to be surprisingly simple: change the activation function. The Rectified Linear Unit (ReLU) outputs zero for negative inputs and passes positive inputs through unchanged. Its derivative is either 0 or 1. For any active neuron (one receiving positive input), the gradient passes through completely undiminished, no matter how deep the network.
This property largely eliminates the multiplicative shrinking that causes vanishing gradients. ReLU has its own issue: neurons that always receive negative input have a permanently zero gradient (sometimes called “dying ReLU”), but this is far less crippling than the systematic gradient decay sigmoid causes across all layers. Variants like Leaky ReLU, which allow a small nonzero gradient for negative inputs, address even this edge case.
Skip Connections and Residual Networks
For very deep networks (dozens or hundreds of layers), even ReLU isn’t enough on its own. Residual networks, introduced in 2015, add skip connections that let the input to a block of layers bypass those layers entirely and be added directly to the output. Instead of each block learning a complete transformation, it only needs to learn the difference (the “residual”) between its input and the desired output.
The effect on gradient flow is dramatic. During backpropagation, the gradient has a direct path through the skip connection that doesn’t pass through any activation functions or weight matrices. This identity mapping preserves gradient magnitude across the entire network depth, ensuring that even the earliest layers receive meaningful error signals. This architecture made it possible to train networks with over 100 layers effectively, something that was impossible before.
Weight Initialization Strategies
How you set the initial weights before training begins also matters enormously. If weights start too small, activations and gradients shrink as they propagate. Too large, and they explode. Two initialization methods were developed to hit the sweet spot.
Xavier initialization (also called Glorot initialization) scales initial weights based on the number of inputs and outputs of each layer. The variance of the weights is set to 2 divided by the sum of those two numbers. This keeps the variance of activations roughly constant across layers when using sigmoid or tanh activations, preventing the signal from fading or blowing up as it passes through the network.
He initialization, proposed in 2015, adapts this idea for ReLU. Because ReLU zeroes out roughly half of its inputs, it effectively cuts the variance in half at each layer. He initialization compensates by doubling the variance: it sets the weight variance to 2 divided by just the number of inputs. Using Xavier initialization with ReLU, or He initialization with sigmoid, leads to suboptimal gradient flow. Matching the initialization to the activation function is one of those small details that makes a large practical difference.
Batch Normalization
Batch normalization adds a step between layers that normalizes each layer’s inputs to have zero mean and unit variance across each training batch. This keeps activations centered in the region where activation functions have their steepest gradients, preventing neurons from drifting into the saturated, flat regions where gradients vanish. As a side benefit, it stabilizes training enough to allow higher learning rates, which speeds up convergence. Combined with ReLU and proper initialization, batch normalization is now a standard component in most deep network architectures.

