An optimizer is the algorithm that adjusts a neural network’s internal parameters during training to minimize errors in its predictions. Every time a neural network processes training data and measures how wrong its output is (using a “loss function”), the optimizer decides how much and in which direction to shift each parameter so the network performs a little better on the next pass. Without an optimizer, a neural network would be a static set of random numbers with no ability to learn.
How Optimizers Update the Network
A neural network can have millions of adjustable parameters, often called weights. During training, the network makes a prediction, compares it to the correct answer, and calculates an error score. Then, through a process called backpropagation, it figures out how much each individual weight contributed to that error. This contribution is expressed as a gradient: essentially a slope that tells you whether increasing or decreasing a particular weight would reduce the error.
The optimizer takes those gradients and uses them to update every weight in the network. The simplest version of this, gradient descent, follows a straightforward rule: shift each weight in the opposite direction of its gradient by a small amount. That small amount is controlled by the learning rate, a number that determines how big each step is. A learning rate that’s too large causes the network to overshoot good solutions and become unstable. One that’s too small makes training painfully slow or causes the network to get stuck.
Batch, Stochastic, and Mini-Batch Variants
The original form of gradient descent, called batch gradient descent, computes the gradient using the entire training dataset before making a single update. This gives you a precise direction to move in, but for large datasets it’s extremely slow because you need to process every example just to take one step.
Stochastic gradient descent (SGD) sits at the opposite extreme. It computes the gradient from a single training example and updates the weights immediately. Each individual step is “noisier” because it’s based on just one data point, but the network gets updated far more frequently. In practice, taking many imperfect steps is more effective at reducing error quickly than taking a small number of perfect ones, especially early in training when the weights are far from a good solution.
Mini-batch gradient descent splits the difference. You divide your dataset into small groups (typically 32, 64, or 128 examples), compute the gradient on each group, and update the weights after each one. A batch size of 1 is pure SGD. A batch size equal to the full dataset is batch gradient descent. Mini-batch is the standard approach used in virtually all modern neural network training because it balances computational efficiency with stable updates.
Momentum: Building Speed in the Right Direction
Plain SGD can be slow when the loss surface has long, shallow valleys or when gradients point in slightly different directions from step to step. Momentum optimization fixes this by keeping a running average of past gradients, similar to a ball rolling downhill that accumulates speed. Instead of relying only on the current gradient, the optimizer blends it with the direction it was already heading. This helps the network accelerate through flat regions and reduces the zigzagging that wastes steps. Momentum also helps the optimizer roll past small local minima rather than getting trapped in them.
Adaptive Learning Rates
One of the biggest breakthroughs in optimizer design was the idea that different weights might need different learning rates. A weight that consistently receives large gradients probably needs smaller steps, while one with small gradients might benefit from larger ones. Adaptive optimizers handle this automatically.
AdaGrad was the first widely used adaptive optimizer. It tracks the sum of all past squared gradients for each weight and uses that history to scale down the learning rate for frequently updated weights. This works well when training data is sparse (some features appear rarely and need bigger updates when they do show up), but for dense data the learning rate can shrink so aggressively that training stalls before reaching a good solution.
RMSprop addressed that problem by using an exponential moving average of recent squared gradients instead of the full history. This means the learning rate adapts to recent conditions rather than being permanently dragged down by early training behavior, making it more robust for complex tasks.
Adam: The Default Choice
Adam (Adaptive Moment Estimation) combines the best ideas from momentum and RMSprop. It maintains two running averages for each weight: one tracking the direction of recent gradients (like momentum) and another tracking their magnitude (like RMSprop). By combining these, Adam adapts both the direction and the size of each step on a per-weight basis.
The default settings for Adam in frameworks like PyTorch are a learning rate of 0.001, momentum decay factors of 0.9 and 0.999, and a small stability constant of 0.00000001. These defaults work surprisingly well across a wide range of problems, which is a major reason Adam became the go-to optimizer. If you need fast results with minimal tuning, Adam is typically where you start.
AdamW and Decoupled Weight Decay
Standard Adam has a subtle flaw in how it handles regularization, a technique that penalizes large weights to prevent overfitting. When you add a regularization penalty to the loss function (the traditional approach), it gets tangled up with Adam’s adaptive learning rates, weakening its effect in unpredictable ways.
AdamW fixes this by applying weight decay directly during the parameter update step, completely separate from the gradient calculations. This “decoupled” approach gives more consistent regularization and better generalization, meaning the model performs more reliably on data it hasn’t seen before. AdamW is now the preferred optimizer for training large models, including most large language models and vision transformers, because the benefits of proper regularization compound as model size increases.
Learning Rate Schedules
Even with an adaptive optimizer, the base learning rate often benefits from changing over the course of training. A learning rate schedule modifies the learning rate at each step or epoch according to a predefined rule. Several strategies are common:
- Step decay: The learning rate stays constant for a set number of epochs, then drops by a fixed factor. Simple and easy to tune.
- Exponential decay: The learning rate decreases quickly at first and more slowly later, which works well when early training needs big adjustments and later training needs fine-tuning.
- Cosine annealing: The learning rate follows a smooth cosine curve from its initial value down to near zero, often producing strong results in image classification tasks.
- Warmup: The learning rate starts very small and ramps up over the first few hundred or thousand steps before following another schedule. This prevents instability at the very beginning of training when gradients can be unreliable.
The right schedule depends on the problem. Many modern training recipes combine warmup with cosine annealing, but simpler schedules like step decay still work well for smaller models.
Saddle Points and the Loss Landscape
Neural network loss surfaces are not smooth bowls with a single minimum. In high-dimensional spaces, saddle points (where the surface curves up in some directions and down in others) are far more common than true local minima. A naive optimizer can slow to a crawl near a saddle point because the gradients become very small, making it look like the network has converged when it hasn’t.
SGD’s inherent noise actually helps here. The randomness from computing gradients on small batches naturally perturbs the optimizer enough to escape most saddle points. Research from Berkeley’s AI lab has shown that even a simple gradient descent algorithm with small random perturbations can escape saddle points in roughly the same time it takes standard gradient descent to find any stationary point at all. Momentum and adaptive methods provide additional escape mechanisms by maintaining velocity through flat regions. This is one reason why the noisy, approximate nature of SGD turns out to be a feature rather than a bug for training deep networks.
Choosing the Right Optimizer
For most projects, Adam or AdamW with default settings is a strong starting point. It converges quickly, requires minimal hyperparameter tuning, and handles a wide variety of architectures and data types. If you’re training large models or care about generalization, AdamW is the better choice because of its cleaner regularization behavior.
SGD with momentum remains popular in computer vision, where it often achieves slightly better final accuracy than Adam if you’re willing to invest time tuning the learning rate and schedule. For sparse data (natural language processing with large vocabularies, recommendation systems with many rare items), AdaGrad or RMSprop can outperform plain SGD because they give rare features proportionally larger updates. For very large batch training in distributed systems, specialized optimizers like LARS pair well with SGD to maintain training stability across many GPUs.
The practical reality is that optimizer choice matters less than getting the learning rate right. Spending time on a good learning rate schedule and tuning the learning rate itself will almost always improve your results more than switching between Adam and SGD.

