ReLU, short for Rectified Linear Unit, is the most widely used activation function in neural networks. It takes any input value and returns either that value (if positive) or zero (if negative). That simple rule, applied at each neuron in a network, is what allows deep learning models to learn complex patterns from data.
How ReLU Works
An activation function decides whether a neuron “fires” and by how much. ReLU’s rule is straightforward: if the input is positive, pass it through unchanged. If the input is zero or negative, output zero. Mathematically, that’s f(x) = max(0, x).
Picture a graph with a sharp corner at the origin. Everything to the left of zero is flat along the x-axis. Everything to the right is a straight line rising at a 45-degree angle. That’s the entire function. There are no curves, no squishing of values into a narrow range. A positive input of 0.5 outputs 0.5. A positive input of 500 outputs 500. A negative input of any size outputs 0.
The derivative (the rate of change that neural networks use during training) is equally simple: it’s 1 for any positive input and 0 for any negative input. Technically, the derivative isn’t defined right at zero, but in practice neural network frameworks just treat it as 0 or 1 and move on.
Why ReLU Replaced Sigmoid and Tanh
Before ReLU became standard, most neural networks used sigmoid or tanh activation functions. These older functions squash their outputs into a narrow range (0 to 1 for sigmoid, -1 to 1 for tanh). That squashing causes a serious training problem: when a neuron’s output gets close to either end of that range, the derivative shrinks toward zero. During training, the network adjusts its weights using these derivatives. Near-zero derivatives mean near-zero adjustments, so the neuron gets “stuck” and stops learning.
This is called the vanishing gradient problem, and it gets worse with every layer you add. Most early attempts to train deep neural networks failed because of it. ReLU solved the problem by keeping a constant derivative of 1 for all positive values. No matter how large the input, the gradient flows through unchanged, so neurons keep learning at a steady pace. UC Berkeley lecture materials on neural networks summarize it simply: ReLUs are preferred over sigmoids because they’re much less likely to get stuck.
ReLU is also cheaper to compute. Sigmoid and tanh require exponential calculations, while ReLU just compares a number to zero. That difference adds up fast when you’re training a network with millions of neurons.
Sparsity: Why Zeroing Out Neurons Helps
Because ReLU outputs zero for every negative input, a large portion of neurons in any given layer end up “off” at any time. Research published in PMC found that only about 15 to 30 percent of neurons in a ReLU network actually activate for a given input. This creates what’s called a sparse representation, where the network encodes information using only a small fraction of its neurons rather than spreading it across every one.
Sparsity has real benefits. A sparse code is more resistant to noise, because small, meaningless fluctuations in the input are likely to fall among the neurons that are already zeroed out. By contrast, in a dense representation where every neuron is partially active, even minor input changes ripple across the entire network, making it harder to extract a clean signal. This helps explain why ReLU networks often generalize better to new data they haven’t seen during training.
The Dying ReLU Problem
ReLU’s simplicity comes with a tradeoff. If a neuron’s weights shift during training so that its input is always negative, the output is permanently stuck at zero. With a zero output, the gradient is also zero, so the weights never update and the neuron never recovers. It’s effectively dead. In a large network, a significant number of neurons can die this way, reducing the model’s capacity to learn.
The flip side of the vanishing gradient problem also applies. Because ReLU has no upper bound on its output, very large activations can produce very large gradient updates, causing weights to swing wildly. This is known as the exploding gradient problem. Proper weight initialization helps prevent both issues. A method called He initialization, designed specifically for ReLU, sets starting weights at a scale that accounts for the fact that half of each neuron’s inputs will be zeroed out.
Common ReLU Variants
Several variations have been developed to address the dying neuron problem while keeping ReLU’s core advantages.
- Leaky ReLU allows a small, non-zero output for negative inputs instead of hard zero. For negative values, it multiplies the input by 0.01, so a neuron receiving -10 outputs -0.1 instead of 0. This keeps a tiny gradient flowing, preventing neurons from dying completely.
- Parametric ReLU (PReLU) works like Leaky ReLU, but lets the network learn the best slope for negative values during training rather than fixing it at 0.01. The network treats that slope as another adjustable parameter.
- ELU (Exponential Linear Unit) uses a smooth exponential curve for negative inputs instead of a straight line. This makes the function fully smooth at zero, which can improve training stability. Outputs for negative values curve down toward a minimum rather than continuing to decrease linearly.
- SELU (Scaled Exponential Linear Unit) is a scaled version of ELU with specific fixed parameters that cause neuron outputs to naturally settle around a mean of zero and a variance of one. This self-normalizing behavior reduces the need for additional normalization layers in the network.
ReLU vs. GELU in Modern Models
In transformer architectures (the type behind large language models and many modern AI systems), a newer function called GELU has become the default choice. Unlike ReLU’s sharp cutoff at zero, GELU uses a smooth curve that gradually transitions between blocking and passing inputs. This means values close to zero aren’t abruptly killed but are scaled down proportionally.
Benchmarks comparing activation functions in transformer models found that GELU reached 96.53% validation accuracy, while standard ReLU achieved 94.84%. Swish, another smooth alternative, landed at 95.83%. The smoother functions also produced lower loss values, suggesting they’re better at capturing the non-linear relationships transformers need to model.
That said, ReLU remains a strong choice when computational resources are limited. It trains faster per step and uses less memory, making it practical for smaller projects, real-time applications, or environments without powerful GPUs. For most standard deep learning tasks outside of transformers, ReLU is still the go-to starting point. If you’re building a convolutional network for image classification or a basic feedforward network, ReLU will almost certainly be the first activation function you try, and often the only one you need.

