What Is Learning Rate in Gradient Descent?

The learning rate is a number that controls how much a model adjusts its internal settings during each step of training. In gradient descent, the algorithm figures out which direction to move to reduce errors, and the learning rate determines how big each step in that direction actually is. It’s typically a small positive number, often somewhere between 0.1 and 0.00001, and getting it right is one of the most consequential decisions in training a machine learning model.

How Gradient Descent Uses the Learning Rate

Gradient descent works by repeatedly tweaking a model’s parameters to minimize a measure of error called the loss. At each step, the algorithm calculates the gradient, which tells it the direction of steepest increase in error. It then moves the parameters in the opposite direction to reduce that error. The learning rate is the multiplier that scales how far to move.

Think of it like descending a foggy mountain. The gradient tells you which way is downhill. The learning rate decides whether you take a cautious half-step or a bold leap. A small learning rate means tiny, careful steps. A large one means big strides. The update at each step is simply: new parameter = old parameter minus (learning rate × gradient). That multiplication is where the learning rate does its work, shrinking or stretching the raw gradient signal into an actual step size.

What Happens When the Learning Rate Is Too High

A learning rate that’s too large causes the model to overshoot the best solution. Instead of settling into the lowest point of the error landscape, the parameters leap past it, land on the other side, and bounce back even further. The loss oscillates wildly, and in the worst case the model diverges entirely, with errors spiraling toward infinity. Setting a learning rate of just 1.1 on a simple optimization problem can cause the solution to overshoot the minimum at zero and gradually blow up.

This happens because gradient descent relies on a local approximation of the error surface. It assumes the landscape looks roughly like a smooth slope in the immediate neighborhood. When the step size is too big, you land somewhere far from where that approximation holds, and the update no longer guarantees any improvement at all.

What Happens When It’s Too Small

A very small learning rate avoids the overshooting problem but introduces its own cost. Each update barely nudges the parameters, so the model needs far more iterations to reach a good solution. Training that could take hours stretches into days. The computational expense adds up quickly, especially with large models and big datasets.

There’s also a subtler risk. The error landscape of a neural network isn’t a simple bowl with one clear bottom. It has flat regions, saddle points, and shallow local dips. A tiny learning rate can leave the model stuck in one of these suboptimal spots, inching along a flat plateau without enough momentum to escape and find a better solution.

The Relationship Between Learning Rate and Curvature

Not all regions of the error landscape are shaped the same way. Some areas curve gently, while others curve sharply. The learning rate interacts directly with this curvature. Research published in Nature Communications describes a phenomenon called the “edge of stability,” where, during training, the sharpest curvature of the loss surface consistently hovers near a threshold of 2 divided by the learning rate. When curvature exceeds that threshold, the training trajectory becomes unstable and errors start expanding.

In practical terms, this means a single fixed learning rate can be appropriate for some phases of training and dangerous for others. As the model moves through differently shaped regions of the landscape, the same step size can go from perfectly fine to destabilizing. This is a core reason why simply picking one number and leaving it alone for the entire training run often isn’t the best strategy.

Learning Rate Schedules

Most modern training runs change the learning rate over time using a schedule. The idea is straightforward: start with a larger learning rate to make fast progress early on, then gradually shrink it as the model gets closer to a good solution and needs more precision.

Several common schedules exist:

  • Step decay reduces the learning rate by a fixed factor at predetermined points during training. A standard approach divides the rate by 10 at 30%, 60%, and 90% of the way through training.
  • Cosine decay smoothly decreases the learning rate following a cosine curve from its starting value down to near zero. In comparative tests, cosine decay matches or outperforms step decay on most tasks.
  • Linear decay reduces the rate at a constant pace from start to finish. Recent research has shown it performs competitively with cosine schedules across a range of problems.
  • 1/t decay shrinks the rate proportionally to the inverse of the step number. It appears frequently in theoretical analysis but is rarely used by practitioners.

Choosing a schedule is partly empirical. You try a few options, watch how the loss decreases over training, and pick the one that converges fastest without becoming unstable.

Adaptive Optimizers

Rather than applying the same learning rate to every parameter, adaptive optimizers adjust the rate individually for each parameter based on its history of gradients. If a particular parameter has been receiving consistently large gradient signals, its effective learning rate gets scaled down. If another parameter’s gradients have been small and noisy, its rate gets scaled up.

The most widely used adaptive optimizer is Adam, which tracks both the average gradient direction and the average gradient magnitude for each parameter, then uses those running averages to tailor the step size. This reduces sensitivity to the initial learning rate you pick, because the algorithm compensates on the fly. You still set a base learning rate, but the optimizer reshapes it continuously.

Standard gradient descent with a fixed learning rate remains a strong choice for generalization, meaning how well the trained model performs on new data it hasn’t seen. But its fixed rate makes it more sensitive to manual tuning. Adaptive methods trade some of that raw generalization strength for much easier setup and more stable training, which is why they dominate in practice.

Finding a Good Starting Value

A practical technique for choosing an initial learning rate is the learning rate range test, introduced by researcher Leslie Smith. The process is simple: start training with a very small learning rate and increase it linearly over a few epochs. Track the loss at each step. Early on, the loss will barely change because the rate is too small. As the rate grows, the loss starts dropping quickly. Eventually, the rate gets too large and the loss spikes upward or becomes erratic.

The sweet spot is the learning rate where the loss was decreasing most steeply, typically a bit before the point where it started climbing. This gives you a reasonable starting value without extensive trial and error. From there, you can pair it with a schedule or adaptive optimizer and fine-tune based on how training progresses.

Common default values also serve as useful starting points. For Adam, 0.001 is the standard default. For plain gradient descent on neural networks, values between 0.01 and 0.1 are typical first guesses. These aren’t magic numbers, but they land in a workable range for many problems and give you a baseline to adjust from.