What Is Reproducibility and the Replication Crisis

Reproducibility is the ability to get the same results from a study or experiment when you follow the same methods. In science, it’s the basic test of whether a finding is real or a fluke. If another researcher uses the same data and the same analysis steps and gets the same answer, the work is reproducible. If nobody can get that answer again, the finding is on shaky ground.

The concept sounds simple, but it has become one of the most pressing issues in modern science. Large-scale efforts to re-test published findings have revealed that many of them don’t hold up, a problem now widely called the “reproducibility crisis.”

Reproducibility vs. Replicability

These two words are often used interchangeably, but they mean different things. The National Academies of Sciences, Engineering, and Medicine drew a clear line between them in a 2019 consensus report. Reproducibility means obtaining the same computational results using the same input data, the same code, and the same analysis conditions. You’re essentially re-running someone else’s work to verify the output. Replicability means obtaining consistent results across entirely new studies that collect their own data while asking the same scientific question.

Think of it this way: reproducibility checks the math, replicability checks the science. If you download a researcher’s dataset and code and press “run,” you should get identical numbers. That’s reproducibility. If you design a fresh experiment testing the same hypothesis and find similar results, that’s replicability. Both matter, but they fail for different reasons and require different fixes.

Why So Many Findings Don’t Hold Up

In 2005, statistician John Ioannidis published an essay in PLoS Medicine arguing that most published research findings are false. His reasoning was mathematical: when studies are small, when the effects being measured are subtle, and when researchers have flexibility in how they analyze data, the odds tilt toward false positives. The paper became one of the most cited in the history of science and helped launch a decade of reckoning.

The numbers that followed were sobering. In 2015, a large collaborative effort attempted to replicate 100 published psychology studies. While 97% of the originals had reported statistically significant results, only 36% of the replications did. The effects that did replicate were, on average, half the size of those originally reported. In cancer biology, the results were even more striking. When scientists at the pharmaceutical company Amgen tried to reproduce findings from landmark preclinical cancer papers, they succeeded just 11% of the time.

These aren’t isolated examples. Similar patterns have turned up in economics, social science, and preclinical medicine. The problem isn’t that scientists are dishonest. It’s that the system they work within creates conditions where unreliable results get published and then treated as established fact.

What Causes the Problem

Several forces push published science away from reliability, and most of them are structural rather than personal.

P-hacking. This is the practice of trying different statistical analyses, data subsets, or variable combinations until a result crosses the threshold of “statistical significance” (typically a p-value below 0.05). Common forms include checking results partway through an experiment to decide whether to keep collecting data, testing many outcome variables and only reporting the ones that “worked,” dropping outliers after seeing the results, or combining and splitting groups until a pattern emerges. None of these steps are necessarily dishonest on their own, but together they inflate the chance of a false positive dramatically. A telltale sign of p-hacking across a field is a suspicious clustering of p-values just below 0.05.

Publication bias. Journals, especially prestigious ones, disproportionately publish positive, statistically significant results. Studies that find “nothing happened” are far less likely to be published. This creates a scientific literature that over-represents effects that may not be real while hiding the null results that would provide balance. Researchers know this, and it shapes what they study, how they analyze data, and what they choose to write up. Since career advancement often depends on publishing in high-impact journals, the incentives all point in the same direction.

HARKing. This stands for “hypothesizing after results are known.” A researcher runs an exploratory analysis, finds an unexpected pattern, and then writes the paper as though that pattern was the hypothesis all along. This turns what should be a preliminary, tentative finding into something that looks like a confirmed prediction.

How Scientists Are Fixing It

The reproducibility crisis prompted a wave of reforms across science, many of which are now becoming standard practice.

Pre-registration

Pre-registration means publicly recording a study’s hypotheses, methods, and analysis plan before collecting any data. By defining everything upfront, it becomes possible to compare what researchers planned to do with what they actually reported. This makes p-hacking and HARKing far more difficult. Public pre-registration helps prevent selective reporting and strengthens trust in the findings that emerge.

Registered Reports

This is a newer publishing model that takes pre-registration a step further. Researchers submit their introduction, hypotheses, and detailed methods to a journal before conducting the study. Peer reviewers evaluate the research question and the rigor of the methods, and the journal makes a decision to publish based on those alone. If accepted, the researchers receive an “in-principle acceptance,” meaning the journal commits to publishing the final paper regardless of how the results turn out. This eliminates publication bias at its source: both positive and negative results get published. It also means peer review happens when it matters most, before data collection, when researchers can still improve their design.

Open Data and Code Sharing

Computational reproducibility requires that other researchers can access the original data and analysis code. A set of principles known as FAIR (Findable, Accessible, Interoperable, and Reusable) now guides how scientific data and software should be shared. The guidelines emphasize assigning unique identifiers to datasets, using standardized formats, providing clear documentation, and applying open licensing so others can actually reuse the materials. When a study’s code is available in a public repository with version tracking and clear documentation, anyone can re-run the analysis and verify the results.

Funding Agency Requirements

Major funders are now building reproducibility into their grant requirements. The U.S. National Institutes of Health requires applicants to address scientific rigor and transparency in their proposals. This includes detailing experimental design, justifying sample sizes, describing plans for authentication of key biological resources, and explaining how potential biases will be minimized. Reviewers evaluate these elements as part of the scientific merit of the application. The goal is to catch methodological weaknesses before money is spent, not after a paper is retracted.

Why It Matters Beyond the Lab

Reproducibility isn’t just an abstract concern for scientists. Medical treatments, public health policies, environmental regulations, and technology development all rest on published research. When a drug moves to clinical trials based on preclinical findings that can’t be reproduced, time and money are wasted, and patients in those trials are exposed to risk for no benefit. When policy is built on social science findings that don’t replicate, the policy may not work as intended.

For anyone reading a news headline about a scientific breakthrough, reproducibility is the reason to pause before assuming the finding is settled. A single study, no matter how exciting, is a starting point. The finding becomes trustworthy when independent teams, using their own data and methods, arrive at the same conclusion. That process is slow and unglamorous, but it’s the mechanism that separates durable knowledge from noise.