Psychology is a science. It uses the same core method that physics, chemistry, and biology use: forming testable hypotheses, collecting data through controlled experiments, analyzing results with statistics, and publishing findings for peer review. The American Psychological Association classifies psychology as a core STEM discipline, citing both its direct scientific innovations and its contributions to education in science and technology. That said, the question persists for legitimate reasons, and understanding why reveals a lot about what makes any field scientific and where psychology’s genuine weak spots are.
What Makes Something a Science
The scientific method follows a specific sequence. It starts with observation, which leads to a question. From that question, a researcher generates a hypothesis, phrased so it can be proved or disproved. This requirement, called falsifiability, is the single most important criterion. If there’s no conceivable result that would prove your idea wrong, you’re not doing science. The hypothesis gets tested through experiments, ideally with control groups and blinding to reduce bias. Data is analyzed, and the results either support the hypothesis or disprove it, prompting a new one. The final step is publication, which exposes the work to scrutiny from other researchers.
Psychology follows every one of these steps. A cognitive psychologist studying memory, for instance, might hypothesize that sleep deprivation impairs recall. They’d recruit participants, randomly assign some to a sleep-deprived condition and others to a control group, test recall performance, and analyze the difference statistically. The structure of a published psychology paper mirrors any other scientific paper: background, methods, results, and conclusions.
How Psychology Measures the Mind
One common objection is that you can’t measure thoughts or emotions the way you measure temperature or mass. Psychology addresses this through psychometrics, a system for building and evaluating tests that quantify mental traits and behaviors. These tools are held to strict standards.
A psychological test must demonstrate reliability, meaning it produces consistent results. There are several types: the same person should score similarly when retested days or weeks later (test-retest reliability), the individual questions on a test should all be measuring the same trait (internal consistency), and different evaluators scoring the same response should reach similar conclusions (inter-rater reliability). A test with a reliability coefficient below 0.6 is considered unreliable. Coefficients of 0.7 or above are considered relatively reliable for research purposes.
Tests must also demonstrate validity, meaning they actually measure what they claim to measure. Content validity checks whether the questions genuinely cover the trait in question. Construct validity goes deeper, using correlations with other established tests and statistical techniques like factor analysis to confirm the test is capturing the right concept. These aren’t loose guidelines. They’re quantitative benchmarks that determine whether a measurement tool gets used in research at all.
Where the Criticism Has Teeth
The philosopher Karl Popper, who established falsifiability as the gold standard for science, specifically called out Freudian psychoanalysis as an example of a non-scientific theory. Freud’s ideas could explain virtually any observation after the fact but couldn’t specify in advance what result would prove them wrong. That critique was valid, and modern psychology largely agrees. Psychoanalysis has been pushed to the margins of scientific psychology for exactly these reasons.
But the falsifiability problem isn’t limited to Freud. A 2009 analysis of current ADHD theories found that most published studies failed to meet the falsifiability requirement in practice, even when the theories had the potential to be falsifiable. Broad theories like “executive dysfunction causes ADHD” are hard to disprove because poor performance on almost any cognitive task can be interpreted as supporting them, without specifying conditions that would count as evidence against. Some narrower hypotheses within these theories did meet the standard. The delay aversion hypothesis, for example, made specific predictions about when differences between ADHD and control groups should appear and when they shouldn’t. But the general pattern was that psychology often states hypotheses too loosely to be rigorously tested.
The Replication Problem
In 2015, a large-scale effort called the Reproducibility Project attempted to replicate 100 published psychology studies. The results were sobering. While 97% of the original studies had reported statistically significant findings, only 36% of the replications did. The average effect size in replications was roughly half the magnitude of the originals. Only 39% of the replicated effects were subjectively rated as having successfully reproduced the original result.
This doesn’t mean those original findings were all wrong. When the original and replication data were combined, 68% still showed significant effects. But it does mean that psychology has had a serious problem with inflated results, likely driven by small sample sizes, selective reporting of positive findings, and flexibility in how data gets analyzed. A field where a coin-flip’s worth of findings don’t hold up on a second try has a credibility issue, and psychology has been unusually honest about confronting it. The open science movement, which emphasizes pre-registering hypotheses and sharing raw data, grew in large part from psychology’s self-correction.
The p-Value Question
Psychology, like medicine and most other empirical sciences, relies heavily on the p-value to determine whether a result is statistically meaningful. The conventional threshold is p < 0.05, meaning there’s less than a 5% chance the result occurred by random chance alone. This cutoff, first proposed by the statistician Ronald Fisher over 60 years ago, is arbitrary. Fisher himself described it as a convenient convention, not a law of nature. Other cutoffs (p < 0.01 or p < 0.001) are used for stricter standards.
The problem isn’t unique to psychology. Every field using this threshold faces the same limitations: a 5% false-positive rate means that out of every 20 tests where nothing real is happening, one will appear significant by chance. Psychology’s large number of possible comparisons (dozens of behavioral measures, multiple time points, various subgroups) makes it especially vulnerable to finding patterns in noise. But this is a statistical challenge shared across the sciences, not evidence that psychology itself isn’t one.
Not All Psychology Looks the Same
Psychology spans a wide range of subfields, and they vary enormously in how closely they resemble the stereotypical laboratory science. At one end, behavioral neuroscience uses brain imaging, computational modeling, and even direct measurement of neurochemical signaling. Researchers at Virginia Tech, for instance, have developed methods to measure sub-second fluctuations in brain chemicals through wires finer than a strand of hair, combining that data with behavioral tasks and clinical assessments to study depression. This work is indistinguishable from neuroscience in its methods and rigor.
At the other end, clinical psychology has traditionally relied more on behavioral observation, self-report questionnaires, and diagnostic interviews. These methods have “immense value,” as one researcher put it, “but don’t give us the full picture.” The field is increasingly moving toward integrating these approaches, connecting clinical symptoms with measurable brain processes to develop more precise, neuroscience-guided treatments. The boundaries between psychology and biology are blurring, not sharpening.
A Science With Known Limitations
Psychology is a science in the same way that ecology, epidemiology, and climate science are: it studies complex systems where perfect control is rarely possible, where variables interact in ways that are difficult to isolate, and where measurement is harder than in physics or chemistry. Its subject matter (human thought, emotion, and behavior) introduces unique challenges. You can’t randomly assign people to traumatic childhoods. You can’t blind someone to whether they’re feeling anxious. Ethical constraints, codified in frameworks like the Belmont Report, limit what experiments can be conducted. Research must protect participants through informed consent, minimize risks, and avoid selecting vulnerable populations simply because they’re convenient.
These constraints make psychology harder to do well, but they don’t make it unscientific. The field generates testable predictions, measures outcomes with validated tools, applies statistical analysis, submits to peer review, and increasingly holds itself to transparency standards that many older sciences are only beginning to adopt. Its replication crisis is real, but it’s also evidence of a field that takes self-correction seriously rather than one that ignores its flaws.

