What Is Bootstrapping in Statistics: How It Works

Bootstrapping is a statistical technique that estimates the properties of a population by repeatedly resampling from a single dataset. Instead of collecting new data or relying on mathematical formulas that assume your data follows a specific distribution, you treat your existing sample as a stand-in for the whole population and draw thousands of new samples from it, each time with replacement. The method was introduced by statistician Bradley Efron in a 1979 paper and has since become one of the most widely used tools in applied statistics and machine learning.

How Bootstrapping Works

Say you have a dataset of 50 observations and you want to know how reliable your calculated average is. Bootstrapping works like this: you randomly draw 50 values from your dataset, but after each draw, you put that value back before drawing again. This “with replacement” step is critical. It means some observations will appear multiple times in a given resample, and others won’t appear at all. That single resample is one bootstrap sample.

You repeat this process many times, typically between 1,000 and 10,000 rounds, though some applications use as many as 100,000. For each bootstrap sample, you calculate whatever statistic you care about: the mean, median, regression coefficient, or anything else. When you’re done, you have a full distribution of that statistic across all your resamples. That distribution tells you how much your estimate would vary if you could somehow collect fresh data from the real population over and over again.

The beauty of this approach is that it requires no assumptions about the shape of the underlying data. Traditional methods often assume your data is normally distributed. Bootstrapping lets the data speak for itself.

What You Can Learn From It

The most common use of bootstrapping is building confidence intervals. If you want a 95% confidence interval for some estimate, you can sort all your bootstrap estimates from lowest to highest and take the values at the 2.5th and 97.5th percentiles. This is called the percentile method, and it’s the simplest approach.

More refined methods exist for situations where the simple percentile approach falls short. The bias-corrected and accelerated method (BCa) adjusts for cases where the bootstrap distribution is skewed or where the center of the bootstrap estimates doesn’t line up with your original estimate. It does this by calculating a correction factor based on how many bootstrap estimates fall below the original estimate, then shifting the confidence interval boundaries accordingly. For most routine analyses, the percentile method works fine, but BCa intervals are more accurate when your data has an asymmetric spread.

Beyond confidence intervals, bootstrapping gives you standard errors for virtually any statistic, even complex ones where no formula exists for calculating variability directly.

Nonparametric vs. Parametric Bootstrap

The version described above, where you resample your raw data points, is the nonparametric bootstrap. It makes no assumptions about the probability distribution your data came from. This is the most common form and the one most people mean when they say “bootstrapping.”

The parametric bootstrap takes a different approach. You first fit a statistical model to your data (assuming it follows a normal distribution, for example), then generate new samples by simulating from that fitted model rather than resampling the original data points. This can be more powerful when you have good reason to believe your data follows a known distribution, because the simulated samples incorporate that structure. When the assumed distribution is wrong, though, the results can be misleading.

In practice, the two methods often produce similar confidence intervals. The nonparametric version is the safer default when you don’t have strong prior knowledge about the shape of your data.

Bootstrapping in Machine Learning

Bootstrapping plays a foundational role in machine learning, most notably in ensemble methods. Bagging (short for “bootstrap aggregating”) trains multiple models on different bootstrap samples of the training data, then averages their predictions. This reduces overfitting and improves stability.

Random Forests, one of the most popular machine learning algorithms, rely on this principle. Each tree in a Random Forest is trained on a different bootstrap sample of the dataset. The observations left out of a particular bootstrap sample, called out-of-bag samples, serve as a built-in validation set, letting the algorithm estimate its own accuracy without needing a separate test dataset. Bootstrapping has also been proposed as an alternative to cross-validation for estimating model performance, since generating bootstrap samples and evaluating predictions on the out-of-sample observations follows a similar logic.

Key Assumptions

Bootstrapping is often described as “assumption-free,” but that’s not quite right. The core assumption is that your data points are independent and identically distributed. Each observation should be unrelated to the others, and all should come from the same underlying process. When this holds, resampling individual data points is valid because the structure of the original sample mirrors the structure of the population.

When data points are dependent on each other, as in time series where today’s value is influenced by yesterday’s, standard bootstrapping breaks down. Resampling individual observations destroys the temporal relationships that carry real information. A specialized technique called the block bootstrap addresses this by resampling chunks of consecutive observations rather than individual points. If the blocks are long enough, the original dependence structure is reasonably preserved. One study found that using a moving block bootstrap on dependent data reduced confidence interval length by roughly 25% compared to naively applying the standard independent bootstrap, reflecting a genuine gain in precision from respecting the data’s structure.

Where Bootstrapping Struggles

Bootstrapping does not rescue you from having too little data. For small samples, bootstrap distributions tend to be too narrow by a factor of roughly (n-1)/n, where n is your sample size. With 10 observations, for example, the bootstrap underestimates variability by about 10%. This means confidence intervals will be too tight and will fail to contain the true value as often as they should.

The problem gets worse in stratified or multi-group designs where individual groups are tiny. A real-world example illustrates this sharply: the U.K. Department of Work and Pensions attempted to bootstrap a stratified survey on welfare fraud, but their design placed only two subjects in each stratum. The uncorrected bootstrap standard error was too small by a factor of 1/2, rendering the results essentially useless. For very small samples, traditional methods that make explicit distributional assumptions often outperform bootstrapping.

Statistics that depend heavily on a few extreme observations, like the median or other quantiles, also give bootstrapping trouble in small samples. And if your data includes a rare category (say, only five observations with a particular label), there’s a meaningful chance that some bootstrap samples will contain zero observations from that category. In a regression setting, this means the model can’t estimate a coefficient for that group, causing entire bootstrap iterations to fail.

How It Compares to the Jackknife

Before Efron introduced the bootstrap, the primary resampling method was the jackknife, developed in the 1950s. The jackknife works by systematically leaving out one observation at a time, recalculating the statistic each time, and using the variation across those leave-one-out estimates to gauge uncertainty. With a sample of 50, the jackknife produces exactly 50 estimates. The bootstrap, by contrast, can produce as many as you want.

The jackknife is computationally simpler and was popular when processing power was limited. Mathematically, it can be viewed as a linear approximation to the bootstrap. The bootstrap is more flexible and generally more accurate, especially for statistics that aren’t smooth functions of the data. The jackknife remains useful for quick bias estimation, but for confidence intervals and standard errors, bootstrapping has largely replaced it in modern practice.

Choosing the Number of Resamples

The precision of your bootstrap results depends directly on how many resamples you generate. For estimating standard errors, 1,000 bootstrap samples is usually sufficient. For confidence intervals, 10,000 is a more common recommendation because the tails of the distribution need to be estimated accurately, and that requires more samples. If you’re computing p-values for hypothesis tests, especially when testing many hypotheses simultaneously, even larger numbers (up to 100,000) are sometimes used. More resamples always means better precision, but with diminishing returns. The practical limit is computation time, which for most datasets on modern hardware is measured in seconds.