How to Evaluate Data: Quality, Bias, and Validity

Evaluating data means systematically checking whether the information in front of you is accurate, complete, and trustworthy enough to base decisions on. Whether you’re reviewing a spreadsheet at work, reading a study, or vetting a data source for a project, the process breaks down into a few core steps: assessing data quality, scrutinizing where the data came from, checking for bias, and understanding what the numbers actually tell you.

The Six Dimensions of Data Quality

Professional data governance groups evaluate data along six standard dimensions. These give you a structured way to spot problems instead of relying on gut feeling.

Completeness: What percentage of fields are actually filled in? Blank or null values reduce the reliability of any analysis. A dataset missing 40% of responses to a key question is fundamentally different from one missing 2%.
Accuracy: Does the data reflect what’s actually happening in the real world? A customer database listing someone at an address they left three years ago fails this test. Accuracy is measured as the percentage of entries that pass verification rules.
Consistency: Does the same information match across different parts of the dataset or across linked systems? If one table says a product launched in March and another says April, you have a consistency problem.
Validity: Does each value conform to the rules for that field? A date field containing “13/32/2024” is invalid on its face. Validity is the ratio of conforming entries to non-conforming ones.
Uniqueness: Are there duplicate records inflating your count? Compare the number of real-world things you’re tracking to the number of records in the dataset. Duplicate entries skew totals and averages.
Timeliness: How old is the data, and does that matter for your purpose? Census data from five years ago might be fine for broad demographic trends but useless for tracking a fast-growing neighborhood.

You don’t need to formally score every dimension for every project, but running through these six questions quickly will catch the most common problems before they compound into bad conclusions.

Evaluating Where the Data Came From

The quality of data is inseparable from the quality of its source. A widely used framework for evaluating sources is the CRAAP test, developed in academic libraries but useful far beyond them. It checks five things: currency, relevance, accuracy, authority, and purpose.

Currency asks whether the data is recent enough for your needs. Some fields move fast (technology adoption rates, disease prevalence during a pandemic), and a two-year-old dataset can be dangerously outdated. Others, like geological surveys, hold up for decades. Relevance is straightforward: does this data actually answer your question, or are you stretching it to fit?

Authority and accuracy go hand in hand. Look at who collected and published the data. Government agencies (.gov), academic institutions (.edu), and peer-reviewed journals carry more weight than anonymous blog posts or commercially motivated reports. Check whether the methodology is documented. Can you verify the findings against another independent source? If the data comes with no explanation of how it was gathered, treat it with skepticism.

Purpose is the one people most often skip. Ask why this data was collected and published. A pharmaceutical company’s internal trial data, a political advocacy group’s survey, and a university’s independent study might all report numbers on the same topic, but their motivations shape what they measure and how they frame it. Data created to inform is different from data created to persuade or sell.

Cleaning Data Before You Analyze It

Raw data almost always contains errors. Before drawing any conclusions, you need to screen for anomalies, diagnose what caused them, and apply corrections. A practical cleaning checklist covers four domains: data integrity, consistency, accuracy, and completeness. That means creating a data dictionary (a reference defining what each field should contain), quantifying how much data is missing, identifying outlier values, and checking that values make sense across related variables.

Handling Missing Values

The simplest approach is to delete rows with missing data, sometimes called complete-case analysis. This works when the missing values are truly random and represent a small fraction of your total. But deleting data means losing information, and if the missingness follows a pattern (say, older respondents skipping a question about internet use), you’ll introduce bias by cutting them out.

The alternative is imputation: filling in missing values using information from the rest of the dataset. Single imputation replaces blanks with the mean, median, or mode of that column. It’s quick but can underestimate variability. Multiple imputation is more sophisticated, generating several plausible versions of the complete dataset from the patterns in observed data and combining the results. For serious analysis, multiple imputation generally produces more reliable outcomes.

Handling Outliers

Not every extreme value is an error. The first step is determining whether an outlier reflects a data entry mistake or a genuine observation. If it’s clearly wrong (a human listed as 900 years old), you can correct or remove it. If it’s real but extreme (one customer spending 50 times the average), run your analysis both with and without the outlier to see how much it affects results. This sensitivity analysis tells you whether your conclusions depend on a single unusual data point. Another option is winsorization, which replaces extreme values with the nearest non-extreme value, keeping the data point in the analysis without letting it dominate.

Checking for Bias in the Data

Bias is systematic error that pushes results in one direction. It’s the single biggest threat to data quality because, unlike random noise, it doesn’t cancel out with a larger sample. It compounds.

Selection bias occurs when the people or things in your dataset aren’t representative of the group you’re trying to understand. A customer satisfaction survey sent only by email misses customers without email access. An employee engagement survey with a 20% response rate likely overrepresents people who feel strongly, positive or negative, and underrepresents the indifferent middle. Survivorship bias is a specific form of this: studying only the data that “survived” to be measured. Analyzing successful startups to find common traits ignores all the failed companies that shared those same traits. The classic example is evaluating bullet holes on returning warplanes. The holes show where planes can take damage and survive, not where the vulnerabilities are.

Measurement bias creeps in when the tool or method of data collection systematically distorts results. A poorly worded survey question that leads respondents toward a particular answer, a sensor that consistently reads slightly high, or a diagnostic test that misses certain cases all introduce measurement bias. Look at how the data was actually collected, not just what it claims to represent.

Confirmation bias operates at the analysis stage. If you approach data looking for evidence to support a conclusion you’ve already reached, you’ll find it, because you’ll unconsciously focus on supporting data points and explain away contradictory ones. The best defense is to actively look for evidence that disproves your hypothesis before looking for evidence that supports it.

Understanding Statistical Significance and Effect Size

When data comes with statistical analysis, two numbers matter most: the p-value and the effect size. They answer different questions, and confusing them is one of the most common mistakes in data evaluation.

A p-value tells you how likely you’d be to see results this extreme if there were truly no effect. The traditional threshold is 0.05, meaning there’s less than a 5% chance the result is due to random variation alone. But the American Statistical Association issued an unprecedented statement with six principles clarifying what p-values actually mean, and the core message is this: a p-value does not measure the probability that your hypothesis is true, does not measure the size or importance of an effect, and should never be the sole basis for a conclusion.

Effect size tells you how big the difference or relationship actually is. A study with thousands of participants can produce a tiny p-value for a difference so small it’s meaningless in practice. In medical research, for example, a relative risk of 1.25 for a hard outcome like mortality is considered a realistic and meaningful difference for a single intervention. The practical question is always: is this effect large enough to matter? That depends on context, and no single threshold applies universally. Always look for the effect size alongside a confidence interval, which tells you the range of plausible values. A confidence interval that’s wide signals imprecision, and one that crosses zero (for a difference) or one (for a ratio) means the data can’t confirm an effect exists at all.

Evaluating Sample Size and Representativeness

A dataset drawn from a sample is only useful if that sample reasonably represents the larger population you care about. Two things determine this: how the sample was selected and whether it’s large enough.

Random selection is the gold standard because it gives every member of the population an equal chance of being included, which prevents selection bias from creeping in. Convenience samples (surveying whoever is easiest to reach) are cheaper and faster but often systematically miss important subgroups. If you’re evaluating someone else’s data, look for a description of how participants or observations were selected. If it’s not documented, that’s a red flag.

Sample size depends on how precise you need your estimate to be. The generally accepted margin of error in survey research is 5 to 10%. Smaller margins require larger samples. The calculation also depends on how common the thing you’re measuring is: a rare event needs a bigger sample to detect reliably than a common one. When evaluating a study or report, check whether the authors justified their sample size. If the sample is small and the claimed effect is large, be cautious. Small samples are prone to producing dramatic results that don’t hold up when replicated.

Spotting Misleading Visualizations

Charts and graphs are powerful precisely because they bypass your analytical brain and communicate patterns at a glance. That same power makes them easy to manipulate.

The most common trick is a truncated Y-axis. A bar chart comparing two values of 98 and 100 looks nearly identical with a Y-axis starting at zero, but if the axis starts at 97, the second bar appears twice as tall. Whenever a chart shows dramatic differences, check where the axis begins. A related problem is inconsistent scaling: switching between linear and logarithmic scales, or changing the intervals between tick marks, can make trends look steeper or flatter than they are. If it’s unclear whether a chart uses a linear or logarithmic scale, you may misread the magnitude of change by orders of magnitude.

Cherry-picked timeframes are another red flag. A stock chart showing only the last three months of gains hides the year of losses that preceded them. Always ask what happens if you zoom out. And watch for graphs where the visual proportions don’t match the underlying numbers. A pie chart where a 30% slice looks bigger than a 35% slice, or a 3D chart that distorts relative sizes through perspective, are signs of either incompetence or intent to mislead.

Internal and External Validity

These two concepts help you decide how much to trust a study’s findings and how far you can extend them. Internal validity asks whether the study was designed and conducted well enough to support its own conclusions. Did the researchers control for other explanations? Were participants assigned to groups randomly? Did people drop out in ways that could skew results? High internal validity means the cause-and-effect relationship the study claims is probably real, at least within the study itself.

External validity asks whether those findings apply beyond the specific conditions of the study. Research conducted exclusively on college students in one country may not generalize to older adults or different cultures. Studies that exclude people with common complications (other health conditions, concurrent medications, severe symptoms) produce clean results but may not reflect what happens in the real world, where patients rarely fit neatly into study criteria. Short-term studies of conditions that require long-term management face the same limitation.

When you’re evaluating data from a study, internal validity tells you whether to believe the results. External validity tells you whether those results apply to your situation.