What Is Hierarchical Clustering and How Does It Work?

Hierarchical clustering is a method of grouping data points based on how similar they are to each other, building a tree-like structure that shows relationships at every level of detail. Unlike methods that require you to specify how many groups you want upfront, hierarchical clustering reveals the full spectrum of groupings, from individual data points all the way up to one giant cluster, and lets you decide where to draw the line afterward. It’s widely used in biology, marketing, and data science whenever understanding the nested structure of your data matters as much as the final groupings themselves.

How the Algorithm Works

There are two directions you can approach hierarchical clustering from. The far more common version, called agglomerative clustering, works from the bottom up. It starts by treating every single data point as its own cluster. Then, step by step, it merges the two closest clusters together until everything belongs to one group. Each merge is recorded, building a complete history of how the data was combined.

The top-down approach, called divisive clustering, does the opposite. It starts with all data points in a single cluster and repeatedly splits the least coherent group into two subclusters. Each split involves choosing which cluster to divide, finding the best way to break it in two, and evaluating whether that split makes sense. Divisive methods can sometimes outperform the standard bottom-up approach, but they’re used far less often because the splitting decisions are more computationally expensive.

The Dendrogram: Reading the Tree

The signature output of hierarchical clustering is a diagram called a dendrogram. It looks like an upside-down tree. Each leaf at the bottom represents a single data point, and each horizontal line connecting branches represents a merge (or split) between clusters. The y-axis shows the distance or dissimilarity between the clusters being joined. The higher up a merge happens, the more different the two groups were when they were combined.

This is what makes hierarchical clustering powerful: the dendrogram preserves all the information. You can see which data points are most closely related (they merge low on the tree) and which groups are fundamentally different from each other (they only merge near the top). Long vertical lines between merges indicate natural gaps in the data, places where clusters are clearly separated before being forced together.

Choosing How to Measure Distance

Before the algorithm can decide which clusters to merge, it needs a way to measure how far apart individual data points are. The most common choice is Euclidean distance, the straight-line distance between two points. This works well for continuous numerical data like height, weight, or income.

Manhattan distance sums up the absolute differences along each dimension instead, which can be more appropriate when your data features operate on different scales or when outliers are a concern. For categorical data (like survey responses or product categories), Hamming distance counts the number of attributes that differ between two items. Mahalanobis distance accounts for correlations among your variables, which matters when features in your dataset tend to move together.

The choice of distance metric shapes the final clusters significantly. Two analyses of the same dataset can produce different groupings simply because they measured “closeness” differently.

Linkage Methods: Defining Cluster Distance

Once you know how far apart individual points are, you still need a rule for measuring the distance between two clusters that contain multiple points. This is called the linkage method, and it’s one of the most important decisions in hierarchical clustering. Choosing the wrong linkage for your data can reduce accuracy by around 16% on average.

The four main options each have distinct behavior:

Single linkage defines cluster distance as the minimum distance between any pair of points, one from each cluster. This tends to create long, chain-like clusters and is sensitive to noise.
Complete linkage uses the maximum distance between any pair of points across clusters. It produces compact, roughly equal-sized clusters but can be thrown off by outliers.
Average linkage takes the mean of all pairwise distances between points in the two clusters. A weighted version accounts for cluster size, while an unweighted version treats each subcluster equally regardless of how many points it contains.
Ward’s method merges whichever pair of clusters produces the smallest increase in total variance. At each step, it picks the merge that keeps clusters as internally tight as possible. This tends to produce well-balanced, similarly sized groups and is often the default choice for general-purpose analysis.

Deciding How Many Clusters to Keep

A dendrogram shows every possible number of clusters, from one to the total number of data points. To get a usable grouping, you need to “cut” the tree at some height. The simplest approach is drawing a horizontal line across the dendrogram. Every branch that line crosses becomes a separate cluster. If you want three clusters, you cut at a height that intersects exactly three vertical lines.

Choosing where to cut is partly judgment and partly technique. One common strategy is to look for the largest gap between successive merge heights in the dendrogram, the equivalent of finding a natural “jump” where clusters that were comfortably separate get forced together. More sophisticated approaches use optimization: given a target number of clusters, dynamic programming can find the best pruning of the tree that minimizes a clustering objective like variance, running efficiently even on large trees.

Computational Cost and Scale

Hierarchical clustering requires storing the full distance matrix between every pair of data points, which takes O(n²) memory. The standard algorithm runs in O(n³) time because at each of its n steps, it searches and updates that entire matrix. With efficient data structures, some implementations can bring this down to O(n² log n).

In practical terms, this means hierarchical clustering works well for datasets of a few thousand points but becomes slow or memory-prohibitive with tens of thousands. K-means and similar partition-based methods scale much better to large datasets. Specialized implementations have been developed to push hierarchical clustering to larger scales, particularly in fields like single-cell biology where the hierarchical structure is scientifically meaningful and worth the computational cost.

Where Hierarchical Clustering Is Used

Gene expression analysis is one of the most established applications. Researchers cluster genes or cells based on their activity patterns, and the dendrogram reveals biological hierarchies: cell subtypes nested within cell types, for instance. Single-cell RNA sequencing generates massive datasets where understanding this layered heterogeneity is the whole point, making hierarchical clustering a natural fit despite its computational demands.

In marketing, hierarchical clustering groups customers by purchasing behavior, demographics, or engagement patterns. The tree structure is especially useful here because it shows both broad segments (budget vs. premium shoppers) and finer subdivisions within them (budget shoppers who buy frequently vs. those who buy in bulk). Document clustering and search engine result grouping also benefit from the nested structure, since topics naturally contain subtopics.

Strengths and Limitations

The biggest advantage of hierarchical clustering is that you don’t need to choose the number of clusters before running the algorithm. You get the full picture and decide afterward. The dendrogram also provides interpretability that flat clustering methods can’t match: you can see exactly why two data points ended up in the same group and at what level of similarity they were joined.

The main limitation is performance. On larger datasets, hierarchical clustering tends to produce lower accuracy than methods like k-means, and its results are more sensitive to the number of features in your data. With high-dimensional data (many features per data point), accuracy can drop noticeably. The algorithm is also irreversible: once two clusters are merged in the agglomerative approach, that decision can never be undone, even if later evidence suggests they should have stayed separate.

That said, parameter tuning makes a big difference. Studies have shown that adjusting the linkage method alone can improve hierarchical clustering accuracy by over 30% on some datasets, bringing it in line with other methods. The default settings in most software packages aren’t always the best choice for your particular data, so experimenting with different linkage methods and distance metrics is worth the effort.