What Is Erroneous or Flawed Data: Causes and Consequences

Erroneous or flawed data is any information that is inaccurate, incomplete, or unreliable enough to produce wrong conclusions when used. This includes everything from a typo in a spreadsheet to a sensor reading that’s wildly off, to survey responses entered into the wrong field. The distinction between “erroneous” and “flawed” is subtle: erroneous data contains specific, identifiable mistakes, while flawed data is a broader label for information so unreliable it should be excluded from analysis entirely. In practice, the two terms overlap and are often used interchangeably.

How “Erroneous” and “Flawed” Differ

The National Institute of Standards and Technology (NIST) draws a useful line between the two. In their thermodynamics databases, data that “cannot be repaired or are clearly erroneous for unknown reasons” gets flagged as flawed and pre-rejected from all calculations. The logic is straightforward: if a data point deviates so far from every other reliable source that it would skew results, it gets pulled out before it can do damage.

Flawed data is typically identified by comparing multiple sources for the same measurement. If a value sits far outside what dozens of other sources report, something went wrong, even if no one can pinpoint exactly what. Erroneous data, by contrast, often has a traceable cause: a miscalibrated instrument, a transposed digit, a formula error. Think of “erroneous” as describing a specific mistake and “flawed” as describing data that’s simply too unreliable to trust, regardless of the reason.

Common Types of Data Errors

Data quality problems fall into several categories, each with different causes and consequences:

Inaccurate data: values that are flat-out wrong, like a patient’s weight recorded as 1,800 pounds instead of 180.
Incomplete data: missing fields or unanswered questions that leave gaps in the picture.
Duplicate data: the same record entered twice, inflating counts or skewing averages.
Inconsistent formatting: dates written as “01/05/2024” in one system and “2024-05-01” in another, making them impossible to compare without cleanup.
Stale data: information that was accurate once but has since changed, like an outdated phone number or a product price from last year.
Orphaned data: records that have lost their connection to related data, like an order linked to a customer profile that no longer exists.

Any of these can exist in a dataset without anyone noticing, especially when the volume is large enough that no single person reviews every entry.

Where Errors Enter the Pipeline

Mistakes can creep in at every stage of working with data, not just at the beginning. The Regional Educational Laboratory Central identifies three critical stages where errors are introduced.

During data collection, the most common problems are unanswered questions, responses marked in the wrong box, handwriting that can’t be read, and values that fall outside any expected range. These are the errors closest to the source, and they’re often the hardest to catch later because there’s no “correct” version to compare against.

During data entry and cleaning, values get typed into the wrong field, records are accidentally deleted or duplicated, and outliers from the original instrument carry over uncorrected. One particularly insidious problem: values that were incorrectly changed during a previous round of cleaning. Someone tried to fix the data and introduced a new error in the process.

During analysis, data can be incorrectly extracted from a database, miscoded, or scrambled by sorting errors in spreadsheets. A sorting error in a spreadsheet, for instance, can shift one column out of alignment with the rest, silently linking every row to the wrong values. By this stage, the errors may be invisible because the analyst assumes earlier steps were done correctly.

How Flawed Data Affects Scientific Research

In clinical and translational research, flawed data is a leading cause of paper retractions. A scoping review published in the Journal of Clinical and Translational Science examined 884 retraction notices and found that 42% described problems with generating or acquiring data, while 28% described problems with preparing or analyzing data. These aren’t rare edge cases. They represent a substantial share of the research that journals have had to publicly withdraw.

The specific errors behind these retractions range from simple to devastating. Some involved basic data entry mistakes, like positive results entered as negative. Others involved misidentified study subjects: cell lines that turned out to be the wrong type, transgenic mice that were mislabeled, or entire patient cohorts where cases were missed. In several instances, researchers simply lost their data and could no longer verify their findings.

Analysis errors were equally varied. In some retracted papers, experimental and control groups were accidentally switched, completely reversing the study’s conclusions. Others used inappropriate statistical methods or failed to account for known biases. One retracted study miscoded a binary variable, essentially swapping “yes” and “no” throughout the dataset. A single flipped variable was enough to invalidate the entire paper.

The consequences extend well beyond academic embarrassment. Incorrect findings that go unnoticed for months or years can influence clinical practice, drug development, and public health policy before anyone realizes the underlying data was wrong.

Real-World Consequences Beyond Research

Flawed data costs real money. Gartner estimates that poor data quality costs organizations at least $12.9 million per year on average. That figure accounts for bad decisions made on bad information, wasted labor spent cleaning up errors, and lost opportunities when teams can’t trust their own numbers.

Some of the most visible recent failures involve AI systems trained on or generating flawed data. In 2024, New York City launched MyCity, a chatbot meant to help residents navigate business regulations and housing policy. Investigators found it falsely told users that business owners could legally take a cut of their workers’ tips, fire employees who reported sexual harassment, and serve food contaminated by rodents. Every one of those claims was wrong, and any business owner who followed the advice could have faced legal consequences.

Air Canada faced a similar problem when its virtual assistant gave a grieving passenger incorrect information about bereavement fares. The airline was ultimately ordered to pay damages. In both cases, the systems presented bad information with complete confidence, and users had no way to distinguish it from accurate guidance.

The pattern extends to AI-generated content more broadly. In May 2025, the Chicago Sun-Times and Philadelphia Inquirer published a summer reading section that recommended books that don’t exist. The author had used AI to generate the list without fact-checking it. The titles sounded plausible and were attributed to real, well-known authors, but the books themselves were fabricated.

How Flawed Data Gets Caught

Detecting bad data relies on a combination of automated checks and human review. Three of the most common automated techniques are format checks, range checks, and consistency checks.

Format checks verify that data follows an expected pattern. A date field should contain something that looks like a date, not a phone number. Range checks flag values that fall outside realistic boundaries: a human body temperature of 250°F, for instance, or a negative age. Consistency checks compare related fields to make sure they agree with each other. If someone’s date of birth says they’re 12 years old but their employment status says “retired,” something is wrong.

These automated methods catch the obvious errors. Subtler problems, like a value that’s plausible but wrong, or a systematic bias introduced by a miscalibrated instrument, require comparing data against independent sources. This is the approach NIST uses: if a measurement disagrees sharply with multiple other reliable datasets measuring the same thing, it gets flagged as flawed regardless of whether anyone can explain why.

For organizations handling large datasets, building validation rules into the point of entry is far more effective than trying to clean data after the fact. Requiring specific formats, setting acceptable ranges, and flagging duplicates in real time prevents many errors from entering the system at all. The errors that slip through automated checks are the ones that require experienced analysts to spot, people who know what the data should look like and can recognize when something feels off even if it passes every technical test.