What Is Agglomerative Clustering and How Does It Work?

Agglomerative clustering is a type of machine learning algorithm that groups data points by starting with each point as its own cluster, then repeatedly merging the two closest clusters until everything belongs to a single group. It’s called a “bottom-up” approach because it builds larger clusters from smaller ones, step by step. The result is a tree-like diagram called a dendrogram that lets you choose how many groups best fit your data after the algorithm has already run.

How the Algorithm Works

The process starts by calculating the distance between every pair of data points. This produces a distance matrix: a grid where each cell holds the distance between two points. The matrix is symmetric (the distance from A to B is the same as B to A) and has zeroes along the diagonal (every point is zero distance from itself).

From there, the algorithm finds the two points with the smallest distance and merges them into a single cluster. It then recalculates the distance matrix, replacing the two merged entries with one new entry that represents the combined cluster. The pair with the smallest distance in this updated matrix gets merged next. This continues, one merge at a time, until all points belong to one big cluster. For a dataset with seven items, that’s six merge steps.

What makes agglomerative clustering appealing is that you don’t need to decide the number of clusters before running it. The algorithm produces the full hierarchy of merges, and you pick your preferred number of groups afterward by “cutting” the dendrogram at a certain height. This is a significant advantage over algorithms like k-means, which require you to specify the number of clusters upfront.

Linkage Methods: Deciding What “Closest” Means

When two individual points merge into a cluster, measuring distance is straightforward. But once you have clusters containing multiple points, you need a rule for measuring the distance between one cluster and another. This rule is called the linkage criterion, and it has a major effect on the shape of your final clusters.

Single linkage uses the minimum distance between any point in one cluster and any point in the other. This tends to produce long, chain-like clusters and is more resilient to false positives, meaning it’s less likely to lump unrelated data together. It works well when clusters have irregular shapes.
Complete linkage uses the maximum distance between any two points across the clusters. This favors compact, similarly sized groups and is less prone to the “chaining” effect where clusters stretch out in thin strings.
Average linkage takes the average of all pairwise distances between points in the two clusters. It’s a middle ground between single and complete, producing moderately compact clusters.
Ward’s method merges the pair of clusters that results in the smallest increase in total variance. It tends to create roughly equal-sized, spherical clusters and is the default in many software implementations.

Choosing a linkage method depends on what your data looks like and what kind of clusters you expect. Ward’s method is a solid default for evenly distributed data, while single linkage handles oddly shaped groupings better.

Distance Metrics

Separate from linkage is the question of how you measure the distance between two individual points. The most common options are Euclidean distance (the straight-line distance between two points, the kind you’d measure with a ruler) and Manhattan distance (the sum of absolute differences along each dimension, like navigating a city grid). Cosine distance measures the angle between two points rather than the gap between them, which is useful when the magnitude of your data matters less than its direction, as in text analysis.

One constraint worth knowing: Ward’s linkage only works with Euclidean distance. The other linkage methods are compatible with any distance metric.

Reading a Dendrogram

The dendrogram is the primary output of agglomerative clustering and the tool you use to interpret results. It’s a branching diagram where each leaf at the bottom represents one data point. As you move upward, branches merge at specific heights, and each merge point represents a step in the algorithm.

The vertical height where two branches join tells you how different those groups were when they merged. Short vertical lines mean the merged clusters were very similar. Long vertical lines mean there was a big jump in distance, suggesting those groups are genuinely distinct. When you see a large gap between merge heights, that’s often a natural place to “cut” the tree and define your clusters.

Reading from the bottom up tells you which individual data points are most similar to each other. Reading from the top down reveals the large-scale structure of your data, showing the major groupings first. One useful detail: the left-to-right order of branches doesn’t matter. Think of the dendrogram like a mobile hanging from the ceiling. The arms can swing freely, so horizontal position is meaningless. Only the vertical height and the branching structure carry information.

Scalability and Performance

Agglomerative clustering has a reputation for being slow on large datasets, and the math backs that up. A basic implementation runs in O(N³) time, meaning if you double your data, processing takes roughly eight times longer. Memory usage scales as O(N²) because the algorithm stores the full distance matrix.

Faster versions exist. Using a heap data structure can bring runtime down to O(N² log N), and an approach called the NN-chain algorithm reaches O(N²), though it still requires O(N²) memory. Some speedups only work with Euclidean distance, so your choice of distance metric can limit your options.

In practical terms, agglomerative clustering handles datasets of a few thousand to tens of thousands of points comfortably on modern hardware. Once you’re working with hundreds of thousands of rows, you’ll likely need to switch to a faster algorithm or use a sampling strategy.

Where It’s Used

Gene expression analysis is one of the classic applications. Researchers use agglomerative clustering to group genes with similar activity patterns across different conditions, helping identify biologically meaningful subsets. For example, microarray data from Barrett’s esophagus patients, covering over 7,300 genes across 10 conditions, can be clustered to differentiate expression profiles across tissue types. Ovarian cancer research has used similar approaches on datasets spanning nearly 3,000 patients across 23 studies to identify molecular subtypes.

Beyond genomics, agglomerative clustering appears in customer segmentation, document grouping, image analysis, and any setting where you want to explore the natural structure of your data without committing to a specific number of groups in advance. Its hierarchical output makes it particularly valuable in exploratory analysis, where you’re still figuring out what patterns exist.

Implementation in Python

The most common implementation is scikit-learn’s AgglomerativeClustering class. The key parameters mirror the concepts above: n_clusters sets how many groups you want (defaulting to 2), linkage controls the merge strategy (defaulting to Ward’s method), and metric specifies the distance measure (defaulting to Euclidean). You can also set n_clusters to None and use a distance_threshold instead, which cuts the dendrogram at a specific height rather than a specific number of groups.

For generating the full dendrogram visualization, scikit-learn pairs with SciPy’s dendrogram function, which renders the tree diagram and lets you visually inspect where to make the cut.