What Is Binary Cross Entropy? Formula and Applications

Binary cross entropy is a loss function that measures how far off a model’s predicted probabilities are from the actual correct answers in a yes-or-no classification problem. If a model predicts a 90% chance that an email is spam, and it actually is spam, the loss is small. If the model predicts 10% and it’s actually spam, the loss is large. The function converts that gap between prediction and reality into a single number the model can use to improve itself during training.

You’ll also see it called “log loss,” which is the same thing under a different name. Both terms refer to the negative average of the log of predicted probabilities, and they’re used interchangeably across machine learning libraries and textbooks.

How the Formula Works

The core idea is straightforward once you break it apart. For each example in your data, you have two things: the actual label (1 for yes, 0 for no) and the model’s predicted probability that the answer is yes. The formula computes a penalty based on how confident the model was in the wrong direction.

For a single example, binary cross entropy looks like this: multiply the actual label by the log of the predicted probability, then add (1 minus the actual label) times the log of (1 minus the predicted probability), and finally flip the sign. You then average this across all examples in your dataset.

Here’s why that works. When the true label is 1, the second half of the formula drops away because it’s multiplied by zero. All that’s left is the negative log of the predicted probability. If the model predicted 0.95, the log of 0.95 is close to zero, so the loss is tiny. If the model predicted 0.05, the log of 0.05 is a large negative number, producing a heavy penalty. The same logic applies in reverse when the true label is 0: only the second half of the formula matters, and the model gets punished for predicting a high probability of “yes” when the answer was “no.”

This logarithmic scaling is important. A confident wrong answer (predicting 0.99 when the truth is 0) gets penalized far more harshly than a slightly wrong answer (predicting 0.6 when the truth is 0). That steep penalty curve pushes the model to avoid overconfident mistakes.

Why It Connects to Maximum Likelihood

Binary cross entropy isn’t an arbitrary choice. It falls directly out of a foundational idea in statistics called maximum likelihood estimation. The goal of maximum likelihood is to find the model parameters that make the observed data most probable. If your training data shows that a particular user clicked on an ad, the best model is one that assigns a high probability to “click” for that user’s features.

Maximizing the product of all those probabilities across every training example is mathematically equivalent to minimizing the sum of the negative log probabilities, which is exactly the binary cross entropy formula. So when a neural network minimizes binary cross entropy during training, it’s doing the same thing as finding the statistically most likely explanation for the data it’s seen.

What the Loss Values Tell You

A binary cross entropy value of zero would mean the model perfectly predicted every outcome with 100% confidence, which never happens in practice. The loss will always be greater than zero because no model can perfectly approximate the true underlying distribution of your data. Lower values mean the predicted probabilities are closely tracking reality; higher values mean the model is frequently confident in the wrong direction.

There’s no universal threshold for “good” loss. A loss of 0.3 might be excellent for one problem and mediocre for another, depending on how noisy the data is. What matters is the trend: loss should decrease during training and remain stable on data the model hasn’t seen before. If training loss keeps dropping but validation loss starts rising, the model is memorizing rather than learning.

Where Binary Cross Entropy Applies

The name says “binary,” but the function covers more ground than simple two-class problems. There are three main scenarios where it’s the right choice:

Binary classification: Two mutually exclusive outcomes. Spam or not spam, fraud or legitimate, click or no click.
Multi-label classification: Each example can belong to multiple categories at once. A photo might contain both a dog and a car. Each label is treated as its own independent yes-or-no decision, and binary cross entropy is applied to each one separately.

The situation where you should not use it is multi-class classification, where each example belongs to exactly one category out of several (like classifying an image as a cat, dog, or bird). That calls for categorical cross entropy, which handles the mutual exclusivity between classes. Binary cross entropy treats each output as independent, so applying it to a multi-class problem where only one answer can be correct will produce misleading gradients during training.

The Sigmoid Connection

Binary cross entropy expects its input to be a probability between 0 and 1. Raw neural network outputs, however, can be any number on the real number line. The sigmoid activation function bridges that gap by squashing any value into the 0-to-1 range. A large positive number maps close to 1, a large negative number maps close to 0, and zero maps to 0.5.

This is why binary classification networks use a sigmoid function on their final layer. Without it, the model’s output wouldn’t represent a valid probability, and plugging it into the log function inside binary cross entropy could produce nonsensical results or numerical errors.

Numerical Stability in Practice

A practical issue arises when a model predicts a probability of exactly 0 or exactly 1. Taking the log of zero produces negative infinity, which crashes the computation. Deep learning frameworks handle this behind the scenes. PyTorch, for example, offers a combined version of the loss function that merges the sigmoid and the cross entropy calculation into a single operation. This takes advantage of a mathematical trick (called the log-sum-exp trick) that avoids computing the log of very small numbers directly, preventing overflow and underflow errors.

If you’re implementing binary cross entropy yourself, the simplest safeguard is to clip predicted probabilities to a tiny range like 0.0000001 to 0.9999999 before taking the log.

Handling Imbalanced Classes

Standard binary cross entropy treats every example equally, which becomes a problem when one class vastly outnumbers the other. If 99% of your data is “not fraud” and 1% is “fraud,” a model can achieve very low loss by predicting “not fraud” for everything. It barely gets penalized for missing the rare fraud cases because they contribute so little to the average loss.

Weighted binary cross entropy solves this by assigning a higher penalty to mistakes on the minority class. The weight for each class is typically calculated by dividing the total number of examples by twice the count of that specific class. This means the rare class gets a much larger weight, so missing a fraud case costs the model significantly more than missing a non-fraud case. Most frameworks let you pass these weights directly into the loss function as an argument, so you don’t need to modify the formula yourself.

This algorithm-level approach is often simpler than data-level fixes like oversampling the minority class or undersampling the majority class, and it avoids the risk of duplicating or discarding training data.