What Is Linear Discriminant Analysis and How It Works

Linear discriminant analysis (LDA) is a statistical method that finds the best way to separate data into known groups. It does this by projecting high-dimensional data onto a lower-dimensional space where the groups are as far apart as possible. LDA serves double duty: it’s both a classification technique (assigning new data points to a group) and a dimensionality reduction technique (simplifying complex datasets while preserving the information that distinguishes groups).

The Core Idea Behind LDA

Imagine you have measurements of two species of flower: petal length, petal width, stem height, and leaf size. Each flower has four numbers describing it, so it lives in a four-dimensional space you can’t visualize. LDA finds a single line (or a small set of lines) to project all those measurements onto, chosen so that the two species end up as far apart as possible on that line while each species stays tightly clustered.

Formally, LDA maximizes the ratio of between-class variance to within-class variance. Between-class variance measures how far apart the group centers are from each other. Within-class variance measures how spread out the data points are inside each group. A good projection pushes the group centers apart (large between-class variance) while keeping each group compact (small within-class variance). This ratio is known as the Fisher criterion, named after the statistician Ronald Fisher who introduced the approach.

How the Algorithm Works

LDA follows a structured sequence of calculations. First, it computes the mean of each class and the overall mean of the entire dataset. From these means, it builds two key matrices: the within-class scatter matrix, which captures how much individual data points deviate from their own class mean, and the between-class scatter matrix, which captures how much each class mean deviates from the overall mean.

Next, LDA solves what’s called a generalized eigenvalue problem using these two matrices. The resulting eigenvectors point in the directions that best separate the classes. You can think of each eigenvector as a “discriminant axis,” a new dimension along which the groups are maximally distinguishable. The data is then projected onto these axes, collapsing the original high-dimensional space into a much smaller one that retains the most class-relevant information.

For a problem with c classes, LDA produces at most c − 1 discriminant axes. If you’re separating two groups, you get one axis (a line). Three groups give you at most two axes (a plane). This is a hard ceiling on how many dimensions LDA can output, regardless of how many features your original data has.

How LDA Classifies New Data

Once the discriminant axes are found, LDA classifies a new data point by applying Bayes’ theorem. It assumes the data within each class follows a bell-shaped (Gaussian) distribution and that all classes share the same spread, or covariance structure. Given a new observation, it calculates the probability that the observation belongs to each class and assigns it to whichever class has the highest probability.

Because every class shares the same covariance, the math simplifies so that the decision boundaries between classes are straight lines (or flat planes in higher dimensions). That’s where the “linear” in linear discriminant analysis comes from. The boundary between any two classes is the set of points where the probability of belonging to either class is equal.

Assumptions LDA Requires

LDA rests on a few important assumptions. Violating them doesn’t always make LDA useless, but it can reduce accuracy or produce misleading results.

Multivariate normality. The features within each class should follow a roughly normal (bell-curve) distribution. This means LDA works best with continuous numerical variables. Categorical variables like “yes/no” or “red/blue/green” violate this assumption.
Equal covariance matrices. Each class should have a similar spread and shape in the feature space. If one group’s data is tightly clustered while another’s is widely scattered, LDA’s linear boundaries won’t fit well. A statistical check called Box’s M test can evaluate this assumption.
Independence of observations. Each data point should be independent from the others, not paired, repeated, or clustered in some way.

When the equal-covariance assumption doesn’t hold, a variant called quadratic discriminant analysis (QDA) allows each class to have its own covariance structure. The trade-off is that QDA needs more data to estimate those extra parameters reliably.

LDA vs. PCA

Both LDA and principal component analysis (PCA) reduce the number of dimensions in a dataset, but they do it for different reasons. PCA is unsupervised: it finds the directions of maximum overall variance in the data without knowing or caring about class labels. It answers the question “where does the data vary most?” LDA is supervised: it uses class labels to find the directions that best separate groups. It answers the question “where do these groups differ most?”

This distinction matters in practice. PCA might find a direction where the data varies a lot, but that variation could be entirely within a single group and useless for telling groups apart. LDA, by design, ignores within-group variation and focuses on the variation between groups. If your goal is classification or group separation, LDA is typically the better choice. If you’re exploring a dataset without predefined groups, PCA is the right tool.

Common Applications

LDA appears across many fields wherever the goal is to classify observations into known categories based on multiple measurements. In medicine, researchers have used LDA to classify patients as healthy or pathological based on clinical measurements. One study applied LDA to a dataset of over 10,000 people evaluated for heart disease and achieved classification accuracy between 84.5% and 86%, with specificity above 97%. That high specificity means the model rarely flagged a healthy person as having disease, though its sensitivity (catching actual disease cases) was more moderate at 62 to 66%.

In face recognition, LDA (sometimes called “Fisherfaces” in that context) projects facial images onto discriminant axes that highlight differences between individuals while downplaying variations in lighting or expression. It’s also used in natural language processing to separate documents by topic, in ecology to distinguish species from morphological measurements, and in marketing to segment customers into groups based on purchasing behavior.

Limitations to Be Aware Of

LDA has a well-known weakness called the small sample size problem. When the number of features exceeds the number of data points, the within-class scatter matrix becomes impossible to invert, and the standard algorithm breaks down. This is common in fields like genomics or image processing, where you might have thousands of features but only dozens of samples. Workarounds exist, such as applying PCA first to reduce dimensions before running LDA, or using regularized variants of LDA that stabilize the math.

Because LDA draws only linear (straight) boundaries, it struggles with data where the true boundary between classes is curved or highly complex. Nonlinear classifiers like support vector machines with nonlinear kernels or neural networks can handle those cases better, though they come with their own costs in interpretability and data requirements. LDA also tends to be sensitive to outliers, since extreme values can distort the class means and scatter matrices that the entire method depends on.

Despite these constraints, LDA remains popular because it’s fast, interpretable, and often competitive with more complex methods when its assumptions are reasonably met. The discriminant axes it produces have clear, understandable meanings: each one is a weighted combination of the original features, so you can inspect the weights to understand which features matter most for separating the groups.