How K-Means Clustering Works, Step by Step

K-means clustering is an algorithm that groups data points into a set number of clusters by repeatedly assigning each point to its nearest cluster center, then recalculating those centers until they stabilize. It’s one of the most widely used unsupervised machine learning methods, meaning it finds patterns in data without being told what to look for. The core idea is simple, but the details of how it initializes, iterates, and converges matter for getting useful results.

The Three-Phase Process

K-means works in three distinct phases: initialization, assignment, and update. The last two repeat in a loop until the algorithm finishes.

Initialization: You start by choosing a number, k, that represents how many clusters you want the algorithm to find. The algorithm then places k centroids (cluster centers) in your data space. These starting positions can be chosen randomly from the existing data points or through smarter sampling methods designed to spread them out. Where these initial centroids land matters a lot, because poor starting positions can lead to poor final results.

Assignment: Once centroids are placed, every data point in the dataset gets assigned to whichever centroid is closest to it. “Closest” is typically measured using straight-line distance (Euclidean distance), the same way you’d measure distance on a map. After this step, you have k groups, each containing all the points nearest to one centroid.

Update: Now the algorithm recalculates each centroid by taking the average position of all the points assigned to that cluster. This shifts each centroid to the true middle of its group. With the centroids in new positions, some data points are now closer to a different centroid than the one they were assigned to, so the assignment step runs again. This cycle repeats, with points reassigned and centroids recalculated, until the results stabilize.

What the Algorithm Is Actually Optimizing

Behind the scenes, k-means is trying to minimize a single value: the total distance between every data point and its assigned centroid. More precisely, it minimizes the sum of squared distances within each cluster, then adds those sums together across all clusters. Think of it as trying to make each cluster as tight and compact as possible.

Each iteration of assignment and update is guaranteed to either reduce this total or leave it unchanged. That’s why the algorithm converges: it’s always moving toward a better arrangement, never a worse one. It won’t necessarily find the globally best arrangement, but it will find a locally optimal one where no small adjustment improves the result.

When the Algorithm Stops

K-means doesn’t run forever. It stops when one of several conditions is met. The most intuitive is when the centroids stop moving between iterations, meaning every data point stays in the same cluster from one round to the next. In practice, you can also set a threshold so the algorithm stops when improvements become negligibly small, or simply cap it at a fixed number of iterations. Most implementations use a combination: stop when centroids stabilize or when a maximum iteration count is reached, whichever comes first.

Choosing the Right Number of Clusters

K-means requires you to specify k before running it, and the algorithm has no way to tell you the “right” number. Two common techniques help you decide.

The elbow method runs k-means multiple times with different values of k (say, 2 through 10) and plots the total within-cluster distance for each. As k increases, this distance drops because more clusters means tighter groupings. At some point, adding another cluster stops producing meaningful improvement, and the curve bends. That bend, the “elbow,” suggests the best tradeoff between cluster count and cluster quality.

The silhouette method takes a different approach. It scores each data point on how similar it is to its own cluster compared to the nearest neighboring cluster, producing a value between negative one and one. Higher scores mean cleaner, more separated clusters. You run k-means at several values of k and pick the one with the highest average silhouette score. A score around 0.45 or above generally indicates clear separation between groups.

Why Euclidean Distance Is Standard

K-means defaults to Euclidean distance, the straight-line distance between two points, because the update step (recalculating the centroid as the mean of its members) is mathematically paired with it. Other distance measures exist, including city-block distance (which sums differences along each dimension separately) and various correlation-based measures. But swapping in a different distance metric without changing the update rule can break the guarantee that each iteration improves the result. Some variants of k-means, like k-medoids, are designed to work with alternative distance measures, but standard k-means and Euclidean distance are a matched set.

Where K-Means Struggles

K-means has four well-known limitations that can catch you off guard if you don’t expect them.

Non-spherical clusters: Because it uses straight-line distance from a single center point, k-means naturally produces round, ball-shaped clusters. If your data forms elongated, crescent-shaped, or otherwise irregular groups, k-means will force them into spheres and misclassify points near the boundaries.
Sensitivity to outliers: Since centroids are calculated as averages, a few extreme data points can drag a centroid far from where most of the cluster’s data actually sits. It’s generally a good idea to detect and handle outliers before running k-means.
Random initialization problems: Different starting centroid positions can produce different final clusters. One run might find a clean, meaningful grouping while another converges on a poor arrangement. Running the algorithm multiple times with different starting positions and keeping the best result is standard practice.
Choosing k in advance: The algorithm can’t discover how many natural groups exist in your data on its own. If you pick the wrong k, you’ll either merge distinct groups together or split coherent groups apart.

A Practical Example: Image Color Reduction

One of the most intuitive applications of k-means is reducing the number of colors in a digital image. A typical photo contains millions of pixels, each with its own color defined by three values (red, green, blue). Many of those colors are nearly identical to the human eye, so you can compress the image by replacing similar colors with a single representative color.

This works in two steps. First, k-means treats every pixel’s color as a data point in three-dimensional space (one dimension for each color channel) and clusters them into k groups. If you set k to 16, the algorithm finds 16 centroid colors that best represent the entire image. Second, every pixel in the original image gets remapped to the nearest centroid color. The result is an image with only 16 distinct colors that still looks recognizably like the original. This is called color quantization, and it’s a clean demonstration of the assignment and update cycle working on real-world data.

How Fast It Runs

K-means is popular partly because it scales well. Each iteration requires checking the distance from every data point to every centroid, so the work per iteration grows linearly with the number of data points and the number of clusters. For most real-world datasets, the algorithm converges in relatively few iterations, making it fast enough to run on millions of data points. Worst-case theoretical bounds are much higher (super-polynomial in extreme scenarios), but these edge cases rarely appear in practice. If you’re comparing clustering methods, k-means is typically one of the fastest options available for large datasets.