What Is t-SNE? Visualizing High-Dimensional Data

t-SNE (pronounced “tee-snee”) stands for t-distributed stochastic neighbor embedding. It’s an algorithm that takes data with hundreds or thousands of dimensions and compresses it down to two dimensions so you can see it as a flat scatter plot. Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE has become one of the most widely used tools for visualizing complex datasets, especially in fields like genomics and machine learning where data routinely has thousands of variables.

The Problem t-SNE Solves

Imagine you have data about 10,000 cells, and for each cell you’ve measured the activity of 3,000 different genes. That means each cell is described by 3,000 numbers. You can’t plot 3,000 dimensions on a screen, so you need a way to flatten all that information into just two dimensions (an x-axis and a y-axis) while keeping the important patterns intact. That’s what t-SNE does: it finds a 2D arrangement of your data points where things that were similar in the original high-dimensional space end up near each other, and things that were different end up far apart.

How the Algorithm Works

t-SNE operates in two stages. First, it looks at your original high-dimensional data and calculates how similar every pair of points is. It does this by centering a bell curve on each data point and asking: if this point had to pick a neighbor at random, how likely is it to pick that other point? Nearby points get high probability scores; distant points get scores close to zero. The result is a complete map of “who is close to whom” in your original data.

Second, t-SNE randomly scatters all the points on a 2D canvas and then iteratively shuffles them around. At each step, it computes a similar set of similarity scores for the current 2D arrangement and compares them to the original high-dimensional scores. The algorithm measures the mismatch between these two sets of scores using a metric called Kullback-Leibler divergence, which is essentially a number that captures how different the two similarity maps are. t-SNE nudges the points to reduce that mismatch, repeating the process until the 2D layout faithfully reflects the neighborhood relationships from the original data.

Why It Uses the t-Distribution

The “t” in t-SNE refers to the Student’s t-distribution, a statistical distribution with heavier tails than a normal bell curve. t-SNE’s predecessor, called SNE (developed by Hinton and Sam Roweis in 2003), used a standard bell curve in both the high-dimensional and low-dimensional spaces. This caused a “crowding problem”: when you squeeze hundreds of dimensions into two, moderately distant points have nowhere to go and pile up on top of each other, creating a crowded blob in the center. By using the t-distribution in the low-dimensional space, t-SNE gives distant points more room to spread out, producing cleaner separation between groups.

How t-SNE Compares to PCA

PCA (principal component analysis) is the other common tool for reducing dimensions, and it works very differently. PCA is a linear method: it finds the directions along which your data varies the most and projects everything onto those axes. This preserves the overall global shape of the data well, but it can miss curved or complex patterns that don’t fall along straight lines.

t-SNE is nonlinear. It doesn’t care about preserving global distances or overall shape. Instead, it focuses on preserving local relationships, making sure that each point’s nearest neighbors stay close in the 2D plot. This makes t-SNE far better at revealing clusters and subgroups in messy, high-dimensional data, but it means the big-picture arrangement of clusters relative to each other can be unreliable. In practice, many researchers use PCA first to reduce the data to, say, 50 dimensions, and then run t-SNE on that reduced data for the final visualization.

Where t-SNE Is Used

The most prominent use of t-SNE is in single-cell RNA sequencing, a technique that measures gene activity in individual cells. A typical experiment produces data with 3,000 or more dimensions (one per gene), and researchers use t-SNE to project all of those dimensions into a 2D plot where different cell types appear as distinct clusters. Tools like Seurat and Scanpy, the standard software packages for this kind of analysis, include t-SNE as a built-in visualization option.

Beyond biology, t-SNE is widely used in natural language processing to visualize word embeddings (where each word is represented by hundreds of numbers), in computer vision to explore how neural networks group images internally, and in cybersecurity to spot anomalous network traffic patterns. Essentially, any time someone has a large dataset with many variables and wants to see whether natural groupings exist, t-SNE is a go-to option.

The Key Setting: Perplexity

The most important parameter you’ll encounter when running t-SNE is perplexity. It controls how many neighbors each point considers when calculating similarities. A low perplexity (say, 5) means each point only pays attention to its very closest neighbors, producing tight, small clusters. A high perplexity (say, 50) makes each point consider a broader neighborhood, producing smoother, more spread-out groupings. The original authors recommend values between 5 and 50 for most datasets, and it’s common practice to try several values and compare the results.

Perplexity acts as a smooth measure of the effective number of neighbors each point has. If your dataset has dense, well-separated groups, lower perplexity tends to work well. If your data is more uniformly distributed or you have large clusters, higher perplexity often gives a clearer picture. Because t-SNE is sensitive to this setting, running it once with a single perplexity value and drawing conclusions is risky.

What t-SNE Plots Can’t Tell You

t-SNE plots are powerful but easy to misread. The three most common mistakes are treating cluster size, cluster distance, and cluster existence as meaningful in ways t-SNE doesn’t guarantee.

  • Cluster size doesn’t reflect group size. A cluster that looks large on a t-SNE plot doesn’t necessarily contain more points or more variation than a smaller cluster. The visual size is an artifact of how the algorithm balances forces during optimization.
  • Distance between clusters is unreliable. Two clusters that appear far apart on a t-SNE plot aren’t necessarily more different from each other than two clusters that appear close together. t-SNE optimizes local neighborhoods, not global distances.
  • Clusters can appear from noise. Research has shown mathematically that t-SNE can produce plots with clearly separated clusters even when the original data has near-uniform distances between all points, meaning no real cluster structure exists at all. A 2025 paper on arXiv proved that the strength of clustering in the input data “cannot be reliably inferred from the low-dimensional visualization.” In one demonstration, a dataset of 100 points in 100 dimensions with essentially uniform spacing produced a t-SNE plot showing two visually distinct clusters.

This doesn’t make t-SNE useless. It means t-SNE is best treated as an exploratory tool for generating hypotheses about structure in your data, not as proof that structure exists. Any clusters you see should be validated with other methods.

t-SNE vs. UMAP

UMAP (uniform manifold approximation and projection), introduced in 2018, has become t-SNE’s main competitor. In single-cell biology, UMAP has largely overtaken t-SNE as the default because it runs faster on large datasets, better preserves the broader topology of the data (not just local neighborhoods), and scales more gracefully to millions of data points. Many analysis pipelines now offer both side by side. t-SNE still produces excellent local cluster separation and remains widely used, but if you’re choosing between the two for a new project, UMAP is increasingly the default recommendation for large or complex datasets.