Evidence in science is information gathered through systematic observation or experimentation that can be used to support or challenge an explanation about how the world works. What separates it from everyday evidence, like a friend’s personal story or a gut feeling, is that scientific evidence must be testable, repeatable, and open to being proven wrong. These requirements are what give scientific findings their credibility and allow knowledge to build over time.
What Makes Evidence “Scientific”
At its core, science studies repeated events and looks for causal relationships. The underlying logic is called the similarity principle: similar conditions should produce similar outcomes. A piece of evidence qualifies as scientific when it comes from a process designed to test a specific prediction. If you claim that a new fertilizer helps tomatoes grow faster, the scientific approach requires you to set up a controlled test, measure the results, and show that someone else could run the same test and get the same outcome.
This is where falsifiability comes in. The philosopher Karl Popper argued that for a claim to count as scientific, it has to make predictions that could, in principle, be disproven by an experiment. A claim that explains everything and can never be contradicted isn’t science. It’s speculation. While Popper’s framework has been debated and refined over the decades, the core idea remains a pillar of scientific thinking: real evidence must be capable of showing you’re wrong.
Observation vs. Experimentation
Scientific evidence comes in two broad forms. Observational evidence is gathered by watching what happens without intervening. Researchers might track thousands of people over years to see who develops heart disease, noting what they eat, how much they exercise, and other factors. This approach captures what happens in the real world, which makes the findings broadly applicable to everyday life.
Experimental evidence comes from controlled tests where researchers deliberately change one variable and measure the effect. The gold standard here is the randomized controlled trial, where participants are randomly assigned to receive either the treatment being tested or a comparison. Randomization helps ensure that any difference in outcomes is caused by the treatment itself, not by some other factor the researchers didn’t account for. The tradeoff is that the strict conditions of an experiment can make results less reflective of what happens outside the lab, where patients are messier, less compliant, and more varied than study volunteers.
Both types of evidence matter. Observational studies are better at showing how things play out in large, diverse populations. Experiments are better at isolating cause and effect. Strong scientific conclusions usually draw on both.
Not All Evidence Is Equal
Scientists rank evidence by how reliably it can answer a question. This ranking, often called the hierarchy of evidence, places different study types at different levels. At the top sit systematic reviews, which pool and analyze results from multiple high-quality experiments to arrive at a single, more reliable conclusion. Below that are individual randomized controlled trials, then cohort studies (which follow groups over time), then case-control studies (which look backward from an outcome to find causes), and then case series, which simply describe what happened to a handful of patients.
At the bottom of the hierarchy is expert opinion. An experienced doctor’s informed judgment still counts for something, but it’s considered the weakest form of evidence because it’s the most vulnerable to personal bias and limited experience. One surgeon might see five patients recover with a particular technique and conclude it works, while another surgeon sees five patients who didn’t. Neither sample is large enough to tell you much.
Why Replication Matters
A single study, no matter how well designed, isn’t enough to settle a question. Science depends on replication: other researchers running similar studies to see if they get the same results. When multiple independent teams confirm a finding, confidence grows. When a finding can’t be reproduced, it signals a problem, whether with the original methods, the statistical analysis, or outright error.
This process of testing and retesting is what gives science its self-correcting nature. Published studies that can’t be replicated still get cited and used to support new work, which can send entire lines of research down unproductive paths. The scientific community has increasingly recognized this as a serious issue, pushing for more transparency in data sharing and greater incentive to publish replication studies rather than only novel findings.
The Role of Statistical Significance
When researchers analyze their results, they typically use statistical tests to determine whether their findings are likely real or could have occurred by chance. The most common threshold is a p-value of 0.05, which means there’s a 5% probability the observed result would appear if nothing meaningful were actually going on. If the p-value falls below 0.05, the result is labeled “statistically significant.”
This threshold is more of a convention than a law of nature. Researchers can set it at 1% or 10% depending on the stakes involved. A p-value of 0.06 doesn’t mean a finding is meaningless, and a p-value of 0.04 doesn’t guarantee the finding is true. It’s also important to understand what a p-value doesn’t do: rejecting the null hypothesis (the assumption that nothing is happening) is not the same as proving the alternative hypothesis. Statistical significance tells you something is unlikely to be random noise. It doesn’t tell you the effect is large, important, or clinically meaningful.
How Peer Review Filters Evidence
Before research reaches the public, it typically goes through peer review. When scientists submit a paper to a journal, the editors send it to other experts in the field who evaluate the work independently. Reviewers assess whether the research question matters, whether the methods are sound, whether the statistical analysis is appropriate, and whether the conclusions actually follow from the data.
Based on these reviews, a paper can be accepted, sent back for revisions, or rejected. Unconditional acceptance on first submission is very rare. Most papers go through at least one round of revision, with authors responding point by point to reviewer concerns. This process isn’t perfect. Reviewers can miss errors, hold biases, or lack expertise in a particular method. But peer review remains the primary quality filter that separates vetted evidence from unvetted claims.
Bias Can Undermine Good Evidence
Even well-intentioned research can produce misleading evidence if bias creeps in. Bias is any systematic error that pushes results in one direction. Selection bias happens when the people chosen for a study aren’t representative of the broader population. If a study on a new pain medication only enrolls young, healthy adults, the results may not apply to older patients with multiple health conditions.
Channeling bias is a subtler problem. It occurs when patients are sorted into study groups based on how sick they are rather than randomly. In surgical research, for example, younger and healthier patients tend to be offered more aggressive treatments while older patients are managed conservatively. A study comparing those two groups might conclude the aggressive treatment works better, when really it just attracted healthier patients who were more likely to do well regardless.
Confirmation bias operates at the interpretation stage: researchers may unconsciously favor results that align with what they expected to find. Blinding (keeping researchers and participants unaware of who received which treatment) and pre-registering study plans before collecting data are two common strategies for reducing these risks.
Why Anecdotes Aren’t Scientific Evidence
Personal stories are powerful. When someone you trust tells you a supplement cured their back pain or a vaccine made their child sick, it feels convincing. But anecdotes fail the basic tests of scientific evidence. They can’t be controlled for other factors, they represent a sample size of one, and they’re filtered through memory and emotion.
Research consistently shows that anecdotes influence medical decisions even when people are also given statistical data. Vaccine hesitancy, for instance, has been partly driven by personal stories shared in online communities, despite overwhelming clinical trial data confirming safety. Media coverage often amplifies this effect by pairing study results with individual patient stories, making the anecdote more memorable than the numbers.
This doesn’t mean personal experiences are worthless. They can generate hypotheses worth testing and add context to dry statistics. But a single person’s story can never tell you whether something works reliably, because you can’t know what would have happened without the treatment, what other factors were at play, or whether the outcome was simply coincidence.
How Evidence Becomes Consensus
No single study creates scientific consensus. Consensus forms gradually as evidence accumulates from many independent sources, using different methods, across different populations. Researchers studying the link between smoking and lung cancer, for example, published findings for decades before the scientific community considered the connection an established fact. Early on, there was genuine disagreement. Over time, as evidence mounted and competing explanations were tested and discarded, the debate resolved.
The process looks messy from the outside. At any given moment, you can find studies that seem to contradict each other. That’s normal. What matters is the overall direction of the evidence, weighted by the quality of the studies producing it. A dozen well-designed trials showing a treatment works carry more weight than one poorly designed trial suggesting it doesn’t. Scientific consensus isn’t a vote. It’s the point at which the evidence becomes strong enough that continued disagreement requires ignoring or dismissing the bulk of what’s been found.

