What Is Gradient Descent and How Does It Work?

Gradient descent is the core optimization algorithm behind most of modern machine learning. It’s how a model learns: by repeatedly adjusting its internal settings to reduce its errors, one small step at a time. Nearly every AI system you interact with, from voice assistants to image generators, was trained using some form of this algorithm.

The Core Idea

Imagine you’re standing on a hilly landscape in dense fog. You can’t see the lowest valley, but you can feel the slope of the ground beneath your feet. Your strategy is simple: take a step in whichever direction goes downhill most steeply. Repeat until the ground flattens out and you can’t go any lower. That’s gradient descent.

In mathematical terms, the “landscape” is a function that measures how wrong a model’s predictions are, often called a cost function or loss function. The “ground beneath your feet” is the slope of that function, calculated using derivatives. The algorithm computes which direction reduces the error most quickly, then nudges the model’s parameters in that direction. Each step shrinks the error a little. Over hundreds or thousands of steps, the model converges on a set of parameters that produce good predictions.

As the algorithm approaches a low point, the slope naturally flattens out. The steps get smaller and smaller until the slope is essentially zero, meaning the model has settled into a minimum.

How the Update Step Works

At each step, gradient descent follows a straightforward rule: take the current value of a parameter, calculate the slope of the error with respect to that parameter, and subtract a small fraction of that slope. The “small fraction” is controlled by a setting called the learning rate.

In plain terms: if the slope says “increasing this parameter makes the error worse,” the algorithm decreases it. If the slope says “increasing this parameter makes the error better,” the algorithm increases it. The learning rate determines how big each adjustment is.

Why the Learning Rate Matters

The learning rate is the single most important setting you choose when running gradient descent. Set it too low and the algorithm inches toward a solution so slowly it may need an impractical number of steps to get there. Set it too high and something worse happens: the adjustments overshoot the minimum, bounce back and forth, and the error actually increases instead of decreasing. The model never settles down.

Finding the right learning rate is part science, part experimentation. A rate that works well on one problem may fail on another. This is why many modern systems use adaptive learning rates that automatically shrink or grow during training based on how the optimization is progressing.

Convex vs. Non-Convex Problems

Whether gradient descent finds the best possible solution depends on the shape of the error landscape. For convex problems (think of a smooth, bowl-shaped surface with a single lowest point), gradient descent is guaranteed to reach the global minimum no matter where it starts. Every path downhill leads to the same bottom.

Most real-world machine learning problems, especially deep learning, aren’t convex. Their error landscapes are rugged, with many valleys, hills, and plateaus. In these non-convex settings, gradient descent can get stuck in a local minimum, a valley that isn’t the deepest one overall. Starting from different initial positions can lead to completely different solutions. This isn’t always a disaster, though. In practice, many local minima in large neural networks produce models that work nearly as well as the theoretical best.

Saddle Points and Plateaus

Local minima aren’t the only trap. In high-dimensional problems (models with millions of parameters), saddle points are actually more common. A saddle point is a spot where the slope is zero, just like a minimum, but it’s only a low point in some directions while being a high point in others. Picture the middle of a horse saddle: it curves down side to side but curves up front to back.

The trouble is that standard gradient descent can stall at these points because the slope is flat, so the algorithm thinks it has arrived somewhere useful. Research from Carnegie Mellon has shown that the number of saddle points increases exponentially as the number of dimensions grows. For modern neural networks with millions or billions of parameters, saddle points vastly outnumber true minima. This is one of the key reasons more sophisticated optimization techniques have been developed.

Three Flavors of Gradient Descent

The original version, called batch gradient descent, computes the slope using the entire training dataset before making a single update. This gives a precise direction to step in, but it’s slow and wasteful, especially when much of the data is redundant.

At the other extreme, stochastic gradient descent (SGD) computes the slope from just one randomly chosen data point at a time. This is much faster per step and can escape shallow local minima because the noisy updates keep the algorithm bouncing around. The tradeoff is that individual steps are imprecise, and processing examples one by one doesn’t take advantage of the parallel processing power of modern GPUs.

Mini-batch gradient descent splits the difference. It computes the slope from a small batch of data, typically 32, 64, or 128 examples at a time. This balances accuracy with speed. Benchmarks from the “Dive into Deep Learning” textbook show that a mini-batch of just 10 examples already outperforms pure SGD in efficiency, while a batch of 100 can beat full batch gradient descent in total runtime. Mini-batch is the default approach in virtually all modern deep learning.

Gradient Descent in Neural Networks

When training a neural network, gradient descent works hand in hand with a process called backpropagation. Here’s the cycle: the network makes a prediction, the cost function measures how far off the prediction was, then backpropagation traces backward through the network’s layers to calculate how much each parameter (each weight and bias) contributed to the error. Those calculations produce the gradient, the set of slopes for every parameter in the model. Gradient descent then uses that gradient to update all the parameters simultaneously. One pass through this cycle is one training step, and training a modern model can involve millions of these steps.

Modern Optimizers Built on Gradient Descent

Plain gradient descent has known weaknesses: it can be slow on flat surfaces, it treats every parameter the same regardless of how frequently it’s updated, and it has no memory of previous steps. Modern optimizers address these issues while keeping gradient descent as their foundation.

The most widely used optimizer today is Adam, short for Adaptive Momentum. It improves on basic gradient descent in two ways. First, it maintains a running average of recent gradients, giving the algorithm momentum. Like a ball rolling downhill, this momentum helps it push through flat regions and saddle points instead of stalling. Second, it adapts the learning rate for each individual parameter. Parameters that have been receiving large, consistent gradients get smaller learning rates to prevent overshooting, while parameters with small or infrequent gradients get larger rates to speed them up. Default settings that work well across a wide range of problems (a learning rate of 0.001, with momentum factors of 0.9 and 0.999) have made Adam the go-to choice for most practitioners.

Why It Dominates Machine Learning

Gradient descent isn’t the only optimization algorithm that exists. Methods like evolutionary algorithms or random search don’t require computing slopes at all. But gradient descent has a decisive advantage: it uses information about the direction of improvement, not just whether things got better or worse. This makes it vastly more efficient in high-dimensional spaces. A modern language model might have billions of parameters. Trying to optimize those by random trial and error would take longer than the age of the universe. Gradient descent, by following the slope, navigates that enormous space in a directed way, which is why it remains the engine behind virtually every machine learning system in production today.