What Is a Bootstrap? The Statistical Resampling Method

A bootstrap is a statistical technique that uses repeated resampling from a single dataset to estimate how reliable a result is. Instead of collecting new data or relying on mathematical assumptions about how data should behave, you essentially recycle the data you already have, drawing thousands of random samples from it to see how much your results might vary. The method was introduced by Stanford statistician Bradley Efron in a 1979 paper and has since become one of the most widely used tools in data analysis, clinical research, and machine learning.

The name comes from the old phrase “pulling yourself up by your own bootstraps,” the idea being that you can build all the statistical testing you need directly from the data at hand, without outside help.

How Bootstrapping Works

The core procedure is surprisingly simple. You start with one sample of data, say 200 patient records or 1,000 survey responses. From that sample, you draw a new sample of the same size, but with replacement. “With replacement” means that after you randomly pick a data point, you put it back before picking again, so the same observation can appear more than once in your new sample, and some original observations won’t appear at all.

You then calculate whatever statistic you care about (an average, a correlation, a risk ratio) from that resampled dataset. Then you repeat the whole process hundreds or thousands of times. Each round gives you a slightly different version of your statistic, and when you line up all those results, you get a distribution that shows you how much your estimate could reasonably bounce around. That spread is your measure of uncertainty.

In four steps:

  • Collect one sample from your population.
  • Resample from that sample (same size, with replacement).
  • Calculate the statistic you’re interested in from the resample.
  • Repeat steps two and three many times to build a distribution of that statistic.

The number of repetitions depends on your dataset size and computing power, but running 1,000 to 10,000 rounds is common in practice.

Why It Exists: The Problem It Solves

Traditional statistical methods often assume your data follows a specific pattern, typically a bell curve (normal distribution). If your data is skewed, has outliers, or comes from a small sample, those assumptions can break down, and the confidence intervals or significance tests you calculate may be misleading.

Bootstrapping sidesteps this problem entirely. Because it builds its estimates from the actual shape of your data rather than from a theoretical formula, it works well even when data doesn’t follow a neat distribution. This makes it especially useful for small studies, unusual datasets, or any situation where you aren’t confident the standard assumptions hold.

It also reduces human bias in the analysis process. In regression analysis, for example, researchers must choose which variables to include in a model. Bootstrap resampling can test the reliability of those choices by checking whether the same variables keep showing up as important across hundreds of resampled datasets. Variables that appear significant in your original analysis but drop out in most bootstrap samples are probably unreliable.

Types of Bootstrap Confidence Intervals

Once you have your distribution of bootstrap estimates, there are different ways to turn it into a confidence interval. The two most common approaches are the percentile method and the bias-corrected and accelerated (BCa) method.

The percentile method is the most intuitive. For a 95% confidence interval, you simply find the 2.5th and 97.5th percentiles of your bootstrap distribution and use those as your lower and upper bounds. It’s straightforward and tends to produce intervals that reliably contain the true value across a range of sample sizes.

The BCa method adds two corrections: one for bias (if the bootstrap estimates are systematically shifted in one direction) and one for skewness (if the distribution is lopsided). In theory, this should give more accurate intervals. In practice, a simulation study published in Frontiers in Psychology found that the percentile method often produced coverage closer to the desired 95% level, while the BCa method tended to create intervals that were slightly too narrow. The BCa method did, however, produce more balanced intervals, meaning the error was more evenly split between the upper and lower tails. For most general purposes, the percentile method is the safer default.

Where Bootstrapping Is Used

Bootstrapping shows up across nearly every field that works with data. In medical research, it’s used to validate prediction models, estimate treatment effects, and quantify uncertainty in clinical decisions. One recent application used bootstrapping to help determine correct dosages of blood-thinning medications like warfarin and heparin. By generating confidence intervals around dosage predictions, the method gives physicians not just a best guess but a range of plausible outcomes, letting them choose more conservative options when the stakes are high.

In machine learning, bootstrapping is the foundation of several widely used algorithms. Random forests, for example, train hundreds of decision trees on different bootstrap samples of the same dataset, then combine their predictions. This approach, called “bagging” (bootstrap aggregating), dramatically improves prediction accuracy over any single model.

It’s also a standard tool for validating models internally. When you don’t have a second dataset to test your model against, you can use bootstrap resampling to estimate how well the model would perform on new data, giving you a more honest picture of its accuracy than testing it on the same data used to build it.

When Bootstrapping Doesn’t Work Well

Bootstrapping has a critical assumption: the observations in your sample need to be independent of each other. When data points are linked, as in time series data where today’s stock price depends on yesterday’s, standard bootstrapping breaks down because resampling destroys the connections between observations.

Specialized versions exist for dependent data, such as the “block bootstrap,” which resamples in chunks rather than individual data points to preserve local patterns. But choosing the right block size is tricky. Too small and you lose the dependency structure. Too large and you don’t get enough variation between resamples. The block size becomes an additional decision that can meaningfully affect your results.

Bootstrapping also struggles with very small samples. The method can only work with what’s in your original dataset. If your sample of 15 people doesn’t capture the true range of variation in the population, no amount of resampling will fix that. The bootstrap distribution will simply reflect the limitations of the data you started with. As a rough guide, the technique becomes increasingly reliable as sample sizes grow, and results from samples under 20 or so should be interpreted cautiously.

How to Run a Bootstrap Analysis

Most statistical software makes bootstrapping straightforward. In R, the boot package provides a general-purpose boot() function where you specify your data, the statistic you want to estimate, and the number of resamples. The caret package also supports bootstrapping as a model validation method with a single line of configuration. In Python, the scipy.stats library includes a bootstrap function, and scikit-learn offers resampling utilities that work within machine learning pipelines.

The computational cost is the main practical consideration. Each bootstrap iteration recalculates your statistic from scratch, so if your analysis is already slow on one dataset, multiplying that by 1,000 or 10,000 can add up. For simple statistics like means or medians, this is trivial on modern hardware. For complex models with large datasets, it may require some patience or access to more computing power.