De-identification is the process of removing or altering personal information in a dataset so that the people described in it can no longer be easily identified. It’s most commonly applied to health records, research data, and other sensitive datasets that organizations want to share or analyze without exposing anyone’s identity. In the United States, the concept is closely tied to HIPAA (the Health Insurance Portability and Accountability Act), which lays out specific rules for how health data must be stripped of identifying details before it can be used freely.
The core idea is straightforward: if a hospital wants to share patient records with researchers, it first removes names, addresses, birth dates, and other details that could trace back to a specific person. What remains is data that’s still useful for analysis but far harder to connect to any individual.
How De-Identification Differs From Anonymization
These two terms get used interchangeably, but they describe different levels of protection. De-identification means stripping out explicit identifiers like names, phone numbers, and Social Security numbers. It’s a necessary first step, but it doesn’t guarantee that someone couldn’t piece together who a record belongs to using the remaining details. A dataset listing a 92-year-old male admitted to a rural hospital on a specific date might only match one person in that community, even without a name attached.
Anonymization goes further. It aims to make re-linking a record to an individual virtually impossible, even with outside knowledge. De-identification can be thought of as a preliminary step toward anonymization. Whether the de-identification process actually achieves true anonymity depends on how much residual information remains and how determined someone might be to re-identify the data. In North America, “de-identified” is the standard term; in Europe and other regions, “anonymized” is more common, though legally they carry different weight.
There’s also pseudonymization, where identifying details are replaced with fake but consistent placeholders. A patient named John Smith might become “Patient 4782” throughout a dataset. The original link between the pseudonym and the real identity still exists somewhere, locked behind access controls. This makes pseudonymized data reversible, which is the key distinction from true anonymization.
The Two HIPAA Methods
U.S. health privacy law recognizes two paths to de-identification, both defined under HIPAA’s Privacy Rule.
Safe Harbor
The Safe Harbor method is the more prescriptive of the two. It requires the removal of 18 specific categories of identifiers from a dataset:
- Names
- Geographic data smaller than a state (street address, city, county, ZIP code), though the first three digits of a ZIP code can stay if that three-digit zone contains more than 20,000 people
- Dates directly related to an individual (birth date, admission date, discharge date, date of death), except the year. All ages over 89 must be grouped into a single “90 or older” category
- Phone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate and license numbers
- Vehicle identifiers, including license plates
- Device identifiers and serial numbers
- URLs
- IP addresses
- Biometric identifiers (fingerprints, voiceprints)
- Full-face photographs or comparable images
- Any other unique identifying number or code
If all 18 categories are removed and the organization has no actual knowledge that the remaining information could identify someone, the data qualifies as de-identified under Safe Harbor. It’s a checklist approach: follow the rules, and the data is considered compliant.
Expert Determination
The second method relies on a qualified statistical or scientific expert who analyzes the dataset and certifies that the risk of identifying any individual is “very small.” This approach is more flexible. Instead of mechanically removing a fixed list of fields, the expert evaluates the actual data, considers what outside information an attacker might have access to, and applies statistical techniques to reduce identification risk. It’s more work, but it can preserve more useful detail in the final dataset because it’s tailored to the specific situation rather than following a one-size-fits-all checklist.
Common Techniques Used in Practice
The actual work of de-identifying data involves several practical techniques, often used in combination.
Suppression is the simplest: sensitive data elements are removed or redacted entirely. A column of Social Security numbers, for instance, gets deleted from the dataset. Generalization replaces precise values with broader categories. An exact age of 47 becomes “40 to 49,” or a specific city becomes a state. Masking swaps sensitive data with fictitious but realistic-looking values. A real name gets replaced with a fake one, preserving the structure of the data without exposing the real person.
More sophisticated approaches come from the field of statistical disclosure control. One widely used framework is called k-anonymity, which ensures that every combination of identifying characteristics in a dataset matches at least k other records. If k equals 5, then any combination of age, gender, and ZIP code in the dataset applies to at least five people, making it impossible to single out one individual from those traits alone. An attacker trying to match someone could only narrow it down to a 1-in-5 chance at best.
K-anonymity has known weaknesses, though. If all five people sharing the same characteristics also share the same sensitive medical condition, knowing someone is in that group reveals their diagnosis. Two extensions address this. L-diversity requires that each group of matching records contains at least l different values for any sensitive attribute, so you can’t infer someone’s condition just from their group. T-closeness goes further, ensuring that the distribution of sensitive values within each group closely mirrors the distribution across the entire dataset, preventing attackers from learning anything meaningful from the group’s composition.
Differential Privacy
A newer approach that’s gained significant traction is differential privacy. Rather than modifying the data itself, it adds carefully calibrated random noise to the results of any analysis performed on the data. The mathematical guarantee is powerful: the output of a query is nearly identical whether or not any single individual’s data is included in the dataset. This means that participating in the dataset doesn’t meaningfully increase your privacy risk.
The amount of noise added is controlled by a parameter called epsilon. A smaller epsilon means more noise and stronger privacy, but less precise results. A larger epsilon gives more accurate answers at the cost of weaker privacy protection. Organizations must balance these tradeoffs based on how sensitive the data is and how precise the analysis needs to be. In March 2025, the National Institute of Standards and Technology (NIST) finalized updated guidelines to help organizations evaluate and implement differential privacy claims more consistently.
Why De-Identification Matters
The primary use case is enabling data sharing without compromising individual privacy. In medical research, this is especially consequential. Clinical trial data, for example, contains detailed health information that other researchers could use to verify findings, detect unreported side effects, or pursue entirely new studies. Many funding agencies, including the NIH and the Wellcome Trust, now require that data from the projects they support be made available to other researchers. De-identification is what makes that sharing legally and ethically possible.
Privacy laws in most jurisdictions allow de-identified health data to be used and shared for secondary purposes without requiring individual consent from each patient. This dramatically lowers the barrier to large-scale health research. Without de-identification, much of the data generated by hospitals, insurers, and research institutions would sit locked away, unusable for the studies that could benefit public health.
Re-Identification Risk Is Real but Manageable
No de-identification method is perfect. The concern is always that someone with enough outside information could re-identify individuals in a supposedly safe dataset. The practical question is how likely that is.
A 2025 study in the Journal of the American Medical Informatics Association analyzed re-identification risk in a cancer registry dataset of 400,000 patients. Using a frequency-based attack (where an adversary tries to match patterns in the de-identified data to known individuals), the researchers found the risk was approximately 0.0002, meaning roughly 2 patients out of every 10,000 were potentially identifiable. When stronger grouping parameters were applied, that number dropped even further.
These numbers suggest that well-executed de-identification dramatically reduces risk, though it never eliminates it entirely. The residual risk is why the distinction between de-identification and true anonymization matters. De-identified data is far safer than raw data, but organizations handling especially sensitive information may need to layer multiple techniques, restrict access to the data, or apply differential privacy on top of traditional de-identification to bring the risk down to acceptable levels.
How GDPR Treats De-Identified Data
Under Europe’s General Data Protection Regulation, truly anonymous data falls outside the regulation’s scope entirely. Recital 26 of the GDPR states that data protection principles do not apply to information that cannot be linked to an identifiable person. The catch is that the bar for “anonymous” under GDPR is high. If there is any reasonable means by which the data could be re-linked to an individual, considering all available technology and methods, the data is still considered personal data and GDPR still applies.
Pseudonymized data, where an individual’s identity is replaced with a code but could theoretically be reversed, explicitly remains within GDPR’s scope. This is a meaningful difference from U.S. law, where HIPAA’s Safe Harbor method treats data as de-identified once the 18 identifier categories are removed, regardless of whether re-identification might be theoretically possible through other means. Organizations operating across both jurisdictions need to meet the stricter standard to stay compliant everywhere.

