Learning rate is a number that controls how much a machine learning model adjusts itself each time it learns from data. It’s one of the most important settings in training any neural network, and getting it wrong can mean the difference between a model that learns effectively and one that never converges on a useful answer. Typical values range from 0.00001 to 0.001 for modern large language models, though the right choice depends heavily on the model, the data, and the training setup.
How Learning Rate Works
When a machine learning model trains, it repeatedly checks how wrong its predictions are, then nudges its internal parameters to be a little less wrong next time. The learning rate determines the size of that nudge. The core formula looks like this: new weight equals old weight minus the learning rate multiplied by the slope of the error. The “slope” tells the model which direction to move, and the learning rate tells it how far to step in that direction.
Think of it like descending a foggy mountain. You can feel the slope under your feet, so you know which direction is downhill. The learning rate is your step size. Too large and you might leap right over the valley and end up on the opposite ridge. Too small and you’ll inch along so slowly that you run out of daylight before reaching the bottom.
What Happens When It’s Too High or Too Low
A learning rate that’s too high causes overshooting. The model overcorrects its mistakes at each step, bouncing back and forth across the optimal solution without settling down. In severe cases, the error actually grows with each training step and the model diverges completely, producing nonsensical outputs.
A learning rate that’s too low creates the opposite problem. The model updates its parameters so cautiously that training takes an impractically long time. Worse, it can get trapped in local minima, which are points where the error seems low relative to nearby values but isn’t actually the best the model could achieve. The steps are simply too small to climb out of these shallow traps and find a better solution.
Common Starting Values
There’s no universal “correct” learning rate, but certain ranges have become standard starting points. For fine-tuning transformer language models like GPT-2, researchers commonly test values of 0.00001, 0.00005, 0.0001, and 0.0005. Most large language models in recent years use the AdamW optimizer paired with a cosine decay schedule, which starts at a peak learning rate and gradually reduces it over the course of training.
One practical method for finding a good learning rate is the range test, introduced by researcher Leslie Smith. You start training with a very small learning rate and linearly increase it over a few epochs, tracking the error at each step. The error will drop steadily at first, then start climbing once the learning rate gets too large. The sweet spot is typically just before the error begins rising again.
Learning Rate Schedules
Most serious training runs don’t use a single fixed learning rate. Instead, they follow a schedule that changes the rate over time. The most common approaches are:
- Step decay: The learning rate drops by a fixed factor at predetermined points during training, such as dividing by 10 at 30%, 60%, and 90% completion. This was the standard approach for years, particularly in image recognition.
- Cosine decay: The learning rate follows a smooth cosine curve from its peak value down to near zero. This has become the default for most large language models.
- Linear decay: The learning rate decreases in a straight line from peak to zero. Recent research shows this matches or outperforms cosine decay on most tasks.
Older theoretical schedules that decrease proportionally to the square root of the training step or inversely to the step number are common in textbooks but rarely used by practitioners. They perform poorly in practice for deep learning despite having nice mathematical properties.
Why Warmup Matters
Many training schedules begin with a warmup phase, where the learning rate starts near zero and gradually ramps up to its target value over the first portion of training. This might seem counterintuitive: why slow down at the beginning when the model has the most to learn?
The benefit is that warmup pushes the model’s parameters into regions of the error landscape that are more stable and well-conditioned before the full learning rate kicks in. This lets the model tolerate a higher peak learning rate than it otherwise could, which improves both final performance and the robustness of hyperparameter tuning. Without warmup, jumping straight to an aggressive learning rate can destabilize early training, especially in large models.
Adaptive Optimizers and Per-Parameter Rates
Standard gradient descent applies the same learning rate to every parameter in the model. Adaptive optimizers like Adam take a different approach: they maintain a separate effective learning rate for each parameter based on that parameter’s recent gradient history. Adam tracks two running averages for every parameter: the average direction of recent gradients and the average magnitude of recent gradients. It then uses these to scale each parameter’s update individually.
In practice, this means parameters that consistently receive large gradients get smaller effective steps, while parameters with small or noisy gradients get relatively larger ones. This automatic adjustment makes Adam far less sensitive to the initial learning rate choice compared to plain gradient descent, which is a major reason it became the default optimizer for most deep learning work. You still set a base learning rate, but Adam adapts it on the fly for each of the model’s millions or billions of parameters.
Cyclical Learning Rates
Rather than only decreasing the learning rate over time, cyclical learning rates oscillate between a lower and upper bound throughout training. This approach, also developed by Leslie Smith, achieves better accuracy without requiring careful tuning of a decay schedule, and often converges in fewer training steps.
The key insight is that periodically increasing the learning rate helps the model escape saddle points, which are flat regions of the error landscape where gradients become very small and training stalls. Bumping the rate up temporarily lets the model traverse these plateaus more quickly. The short-term increase in error is more than offset by the long-term benefit of reaching better solutions. This trades a brief spike in training loss for access to regions of the parameter space that a monotonically decreasing schedule would never reach.
Scaling With Batch Size
When you increase the batch size (the number of training examples the model sees before each update), you typically need to adjust the learning rate as well. Two common rules of thumb exist: linear scaling, where you increase the learning rate proportionally to the batch size increase, and square root scaling, where you increase it by the square root of the batch size ratio. Linear scaling is more common for very large training runs, but the right choice depends on the optimizer and model architecture. Getting this relationship wrong is a common source of wasted compute when scaling up training infrastructure.

