UMAP vs t-SNE: Choosing the Right Dimensionality Reduction

High-dimensional data, common in fields like bioinformatics, genomics, and social network analysis, often contains too many variables to be directly visualized. Dimensionality reduction is the technique used to address this by projecting complex, high-dimensional data onto a lower-dimensional space, typically a 2D plot. This allows researchers to identify underlying structures such as clusters or continuous gradients within the data. These visualizations serve as a powerful first step in data analysis, providing an intuitive map of the data’s inherent organization.

t-SNE: Focus on Local Structure

t-distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique designed primarily to preserve the local structure of the high-dimensional data. The core mechanism of t-SNE converts the Euclidean distances between data points in the original space into conditional probabilities. Specifically, it models the probability that one data point would select another as its neighbor, assuming a Gaussian distribution centered on the first point.

It then seeks to recreate a similar probability distribution in the lower-dimensional space, but it uses a Student’s t-distribution (with one degree of freedom) instead of a Gaussian. The heavy tails of the t-distribution allow dissimilar points to be modeled farther apart in the low-dimensional map, which helps to mitigate a problem known as the “crowding problem”. The algorithm minimizes the difference between these two probability distributions using a measure called Kullback-Leibler divergence through gradient descent.

This process results in tight, distinct clusters in the final plot, making t-SNE excellent for visually separating groups of similar data points. A key configurable element is the perplexity parameter, which can be loosely interpreted as a guess about the number of nearest neighbors each point possesses. Perplexity influences the balance between local and global structure preservation, and typical values range from 5 to 50.

UMAP: Focus on Global Manifold Approximation

Uniform Manifold Approximation and Projection (UMAP) is a newer non-linear dimensionality reduction technique built upon a strong theoretical framework. UMAP operates under the assumption that the high-dimensional data points lie on an underlying, smooth, curved surface called a Riemannian manifold. The goal is to construct a fuzzy topological representation of the data and then optimize a low-dimensional embedding that preserves this structure.

UMAP begins by building a weighted graph based on the local relationships, specifically the k-nearest neighbors, which approximates the manifold. The algorithm then optimizes the layout of this graph in the lower-dimensional space to be as structurally similar as possible to the original, which helps maintain both local and global data structures. Two main parameters control the outcome: `n_neighbors` and `min_dist`.

The `n_neighbors` parameter controls the size of the local neighborhood considered, directly influencing the balance between local and global structure preservation; higher values push UMAP toward a more global view. Conversely, `min_dist` controls how tightly the points can be packed in the low-dimensional space; a smaller value allows for denser clustering. The theoretical grounding and parameter design allow UMAP to often provide a more reliable representation of the overall arrangement of clusters compared to t-SNE.

Computational Performance and Scalability

t-SNE’s original implementation involves calculating pairwise similarities between all data points, leading to a computational complexity that scales quadratically with the number of data points, or $O(N^2)$. This quadratic scaling makes t-SNE computationally expensive and slow for datasets exceeding tens of thousands of points, though optimized versions like FIt-SNE have improved this.

UMAP, by contrast, demonstrates significantly better scalability, often approaching linear time complexity, or $O(N)$, for large datasets. This superior performance is a direct result of its graph-based approach, which relies on approximating the k-nearest neighbors structure rather than calculating all-pairs distances. For a dataset like the 70,000-point MNIST collection, UMAP can complete the projection in a few minutes, whereas the standard t-SNE implementation may take nearly an hour.

UMAP’s efficiency allows it to handle data with a high number of dimensions much more effectively than t-SNE. As the number of ambient dimensions increases into the tens of thousands, UMAP’s computational cost remains relatively stable, while the cost for t-SNE increases dramatically. This difference makes UMAP the more practical choice for modern big data applications where datasets frequently contain millions of data points, ensuring a quicker turnaround for exploratory analysis.

Interpreting the Visualizations: Preserving Local vs. Global Structure

t-SNE is highly effective at clustering similar data points together, creating clear, visually distinct groups. However, the distances between these separate clusters in a t-SNE plot are not meaningful. If two clusters appear far apart, it does not reliably indicate they are more dissimilar than two clusters that appear closer together.

The size and density of the clusters in a t-SNE plot are not informative of the true size or density in the original high-dimensional space. t-SNE tends to expand dense clusters and contract sparse ones in an attempt to equalize cluster sizes, which means that the visual size of a cluster should be ignored. The primary information that can be trusted is the proximity of points within a cluster, which indicates strong similarity.

UMAP, due to its foundation in manifold theory, attempts to preserve both the local and a greater degree of the global structure. This means that the relative positioning of clusters in a UMAP plot is often more trustworthy than in a t-SNE plot, providing insight into the overall topology of the data. For instance, if cluster A is positioned between clusters B and C in a UMAP plot, it suggests that cluster A is topologically intermediate to the other two in the original space. While the absolute distances in a UMAP plot are still not perfectly preserved, the inter-cluster relations and the overall shape of the data cloud are generally considered more representative of the original structure.

Choosing the Right Technique for Your Data

When the primary objective is exploratory clustering, where the analyst seeks to identify tight, distinct groups of similar data points within a moderately sized dataset, t-SNE is a suitable option. It excels at maximizing the visual separation of these local groupings, providing immediate insight into the internal structure of the data.

UMAP is the preferred technique when the overall arrangement of the data, or the global topology, is of greater interest than just the local clusters. If understanding how clusters relate to one another is necessary, UMAP’s better preservation of inter-cluster distances makes it the superior choice. Moreover, UMAP is the default selection for extremely large datasets or those with very high dimensionality due to its superior speed and linear scaling properties.