What Is Data in Science? From Raw Facts to Evidence

In science, data is any piece of information collected through observation, measurement, or experimentation that serves as evidence for or against a claim about how the world works. Data can be a number on a thermometer, a photograph of a galaxy, a patient’s description of their symptoms, or a string of genetic code. What makes something “data” isn’t its format. It’s the fact that researchers collect, store, and use it to support or challenge what we think we know.

Why Data Isn’t Just Numbers

Many people picture spreadsheets full of numbers when they hear “scientific data,” but the concept is much broader. The Royal Society has defined data as “numbers, characters or images that designate an attribute of a phenomenon,” and philosophers of science push that definition even wider. Any research output, from a lab photograph to a field recording to letters and symbols, counts as data as long as it can be shared with others and used as potential evidence for a claim about the natural world.

This matters because science spans an enormous range of activities. An ecologist recording bird songs in a rainforest is collecting data. So is a physicist measuring particle collisions, a psychologist documenting how people respond to a survey, and a geologist photographing rock layers. The common thread is purpose: the information is gathered deliberately so it can be analyzed, compared, and used to test ideas.

Qualitative vs. Quantitative Data

Scientific data falls into two broad camps. Quantitative data is anything expressed as a number: temperature readings, reaction times, population counts, distances. It lends itself to mathematical analysis and statistical testing. Qualitative data, by contrast, captures qualities and descriptions rather than quantities. Open-ended survey responses, observational field notes, interview transcripts, and images all fall into this category.

Neither type is inherently better. A clinical trial might measure blood pressure (quantitative) while also recording how patients describe their pain (qualitative). Both contribute evidence, just through different lenses. The choice depends on the question being asked.

How Scientists Collect Data

Data collection methods generally fall into three categories based on how the information is generated.

  • Observational data comes from watching and recording things as they naturally occur. A wildlife biologist tracking migration patterns or an astronomer cataloging stars is collecting observational data. This type of data can’t be repeated through experimentation, and it’s often harder to establish cause-and-effect relationships from it alone.
  • Experimental data comes from controlled tests where researchers deliberately change one variable and measure the result. Experiments can be repeated under slightly different conditions to identify trends. Analyzing experimental data often relies on underlying physical or chemical models rather than purely statistical techniques.
  • Simulation data is generated by computer models that mimic real-world systems. Climate projections, for example, come from simulations that adjust variables like carbon dioxide levels to predict future temperatures. Simulation is now recognized as a third major research methodology alongside observation and experimentation, because it lets scientists explore cause and effect at a level that isn’t possible through other means.

The tools used to collect data range from simple (a ruler, a notebook, direct observation using the senses) to extraordinarily complex (gene sequencers, particle accelerators, satellite arrays). Surveys remain one of the most common instruments, especially for documenting perceptions, attitudes, beliefs, or knowledge within a defined group of people.

From Raw Data to Useful Knowledge

Data fresh from collection is called raw data. It’s unorganized and in its original form, often messy and difficult to interpret. A genome sequencer, for instance, spits out billions of short DNA fragments with no immediate meaning. Before those fragments become useful, researchers must clean, organize, and transform them through processing.

Processed data has been filtered for errors, organized into usable formats, and often summarized or analyzed. The tradeoff is flexibility: raw data can be reanalyzed in many ways, while processed data is tailored for specific interpretations. Both forms have value, which is why researchers typically archive the raw version alongside their results.

A useful way to think about the journey from raw numbers to real understanding is the data-information-knowledge-wisdom framework. Data on its own has little meaning in isolation. When you add context and structure, data becomes information. When you discover patterns and relationships across multiple pieces of information, you arrive at knowledge. And wisdom comes from deeply understanding and internalizing those patterns so you can apply them in new situations. Science, at its core, is the process of climbing that ladder.

Data as Evidence in the Scientific Method

Data plays its most important role when scientists use it to evaluate hypotheses. A researcher proposes an explanation for something, designs an experiment or observation to test it, collects data, and then asks: does this evidence support my hypothesis, or does it point in a different direction?

This process sounds straightforward, but interpreting data requires care. A single dataset rarely proves anything definitively. Scientists look at the weight of evidence for one explanation versus another, given the data they have. One important principle is that absence of evidence does not mean evidence of absence. If a study fails to find a statistically significant effect, that doesn’t necessarily mean the effect doesn’t exist. It may mean the study didn’t collect enough data, or the effect is too small to detect with the methods used.

As more data accumulates across multiple studies, the evidence for or against a hypothesis grows stronger. This is why replication matters so much. A single striking result is interesting; the same result repeated independently by different labs is convincing.

Why Data Integrity Matters

Reproducibility, the ability of independent researchers to obtain the same or similar results when repeating an experiment, is one of the hallmarks of good science. For this to work, data must be accurate, thoroughly documented, and transparent. Scientific records, including lab notebooks, protocols, and datasets, need to describe research in enough detail that someone else could reproduce it.

Transparency means honestly and openly disclosing all information related to the research when publishing it. This includes the study design, methods, materials, equipment, data analysis tools, study population, and any potential biases or conflicts of interest. Many journals now require authors to make their supporting data publicly available, recognizing that shared data strengthens the entire scientific enterprise.

A set of guidelines called the FAIR principles lays out what good data management looks like. Data should be Findable (with unique identifiers and rich descriptions so both people and computers can locate it), Accessible (retrievable through standard methods), Interoperable (formatted so it can be combined with other datasets and analyzed by common software), and Reusable (well-described enough that future researchers can work with it confidently). These standards have become increasingly important as datasets grow larger and more complex.

The Scale of Modern Scientific Data

The volume of data generated worldwide is projected to reach 181 zettabytes in 2025, up from 120 zettabytes just a few years prior. Scientific research contributes a significant share of that growth. Genomics is a clear example: sequencing a single human genome produces roughly 200 gigabytes of raw data, and researchers now sequence thousands of genomes in a single project. Software tools called aligners map individual DNA fragments onto a reference genome, and then variant callers identify the spots where one person’s genome differs from others. Those differences may indicate disease risk or suggest which medication would work best for a specific patient.

Artificial intelligence systems are increasingly used to interpret these massive datasets, diagnosing diseases at early stages or predicting risk based on genomic patterns. Similar data-intensive approaches drive climate modeling, particle physics, astronomy, and dozens of other fields where the sheer volume of information exceeds what any human could analyze manually. The ability to collect, process, share, and reuse data at this scale is reshaping what science can accomplish.