De-identifying data with a code means replacing personal identifiers like names, Social Security numbers, or medical record numbers with random substitute values, while keeping a separate key that links each code back to the original person. This lets researchers or organizations work with useful data without exposing who it belongs to, but it’s not the same as making data fully anonymous. The key still exists, which means re-identification is possible.
How Coded De-Identification Works
The process is straightforward. An organization takes a dataset containing personal information, strips out identifying details (name, address, date of birth, phone number), and replaces them with a random code. Person A becomes “X7R92,” Person B becomes “K3M41,” and so on. The codes have no inherent connection to the person. They aren’t derived from the person’s name or any other identifying trait.
What makes this different from simply deleting identifiers is that a master key exists somewhere, mapping each code back to its original identity. This key is stored separately from the coded dataset itself, usually by a designated custodian or the organization that collected the data in the first place. The people who receive and analyze the coded data typically never see the key. They work only with the coded version and have no way to figure out who’s who.
This setup is useful because it preserves the ability to reconnect data to a specific person when there’s a legitimate reason to do so, like informing a research participant about a clinically significant finding or linking new data to an existing record over time.
Coded Data Is Not Truly Anonymous
This is the critical distinction most people miss. Coded data and fully anonymized data are legally and practically different things.
Fully anonymized data has no path back to the individual. The link is destroyed. No key exists. No one, not even the original data collector, can reconnect the information to a specific person. Under privacy frameworks like the GDPR, anonymized data is no longer considered personal data at all, which means the usual privacy rules stop applying to it. For data to qualify as truly anonymized, the original identifying information should be securely deleted. If that deletion doesn’t happen, the data is classified as pseudonymized (the European term for coded) rather than anonymized, and it’s still treated as personal data subject to privacy regulations.
Coded data, by contrast, retains that reconnection pathway. Harvard’s Institutional Review Board states this plainly: coded data is not de-identified. It’s better protected than raw data, but someone with access to the key could still identify individuals. That’s why coded data in research and healthcare settings still carries privacy obligations.
What HIPAA Requires for Coded Data
Under U.S. health privacy law, there are specific rules about when coded data qualifies as properly de-identified. The HIPAA Privacy Rule requires that 18 categories of identifiers be removed, including names, geographic details, dates, phone numbers, email addresses, and any other unique identifying number or code, unless the code meets certain conditions.
Those conditions are strict. The code cannot be derived from or related to information about the individual. You can’t, for example, use someone’s initials and birth year as their code, because that information could be reverse-engineered. The code must be a truly random or arbitrary value. The organization holding the data also cannot use or share the code for any purpose other than re-identification, and it cannot reveal the mechanism that links codes to identities.
Sharing the key itself is treated the same as sharing the protected health information directly. If an organization re-identifies someone using the key, that reconnected information immediately falls back under full HIPAA protections.
Who Holds the Key and How It’s Secured
The security of coded data depends almost entirely on how the key is managed. Standard practice is to store the key in a separate, access-controlled location from the coded dataset. Only a small number of authorized people should have access to it.
In research settings, the key is often held by the principal investigator or a designated honest broker, someone who isn’t involved in analyzing the data. In larger systems, different organizations may each hold keys that only work for their own records. A hospital, for instance, might only be able to re-identify data from its own patients, while a government oversight body might hold a master key covering a broader dataset. The private keys used for re-identification must be kept secure, since anyone who obtains both the coded dataset and the key can identify every person in it.
When coded data is shared with outside researchers, data use agreements typically include a clause where the recipient agrees not to attempt to re-identify individuals. Federal advisory committees have recommended that these agreements explicitly require data recipients to protect privacy, refrain from re-identification attempts, and restrict access to only the people named in the original data request.
Why Coding Alone Doesn’t Eliminate Risk
Even without the key, coded datasets can carry re-identification risk through indirect identifiers: pieces of information that don’t identify someone on their own but can do so in combination. A study examining 70 publicly available clinical trial datasets found that sex appeared in 80% of them, age in about 73%, and weight and height in roughly 45% each. Other common indirect identifiers included race (41%), ethnicity (36%), and country (26%).
The danger is combinatorial. Knowing that a participant is a 34-year-old woman of a specific ethnicity living in a particular country may not seem like much, but in a small enough dataset or community, that combination could narrow the possibilities to a single person. Adding height, weight, or occupation makes this even more likely.
This is why proper de-identification goes beyond just replacing names with codes. Organizations also need to assess whether the remaining data fields, taken together, could allow someone to be singled out. Techniques like generalizing ages into ranges (30 to 39 instead of 34), suppressing rare combinations, or limiting geographic detail all reduce this risk. The HIPAA framework addresses this through its “expert determination” method, where a qualified statistician must confirm that the risk of re-identification is very small before the data can be considered de-identified.
Coding vs. Other De-Identification Methods
Coding is one approach on a spectrum of privacy-preserving techniques. At one end is full anonymization, where all links to identity are permanently severed. At the other end is simply restricting access to raw, fully identified data. Coding sits in the middle, offering a practical compromise.
- Full anonymization provides the strongest privacy protection but sacrifices the ability to update records, follow up with participants, or link datasets over time. Once the connection is gone, it’s gone.
- Coding (pseudonymization) preserves that connection behind a locked door. It’s the standard approach in longitudinal research, clinical trials, and healthcare analytics where data needs to remain linkable but day-to-day users shouldn’t see identities.
- Statistical techniques like adding random noise to data or aggregating records into groups can further reduce re-identification risk. These are sometimes applied on top of coding for datasets that will be shared broadly.
The right approach depends on the use case. If a hospital wants to share data with an external research team for a one-time analysis, full anonymization may be appropriate. If a multi-year clinical trial needs to track outcomes over time while protecting participant privacy, coding with a securely stored key is the standard practice. The key question is always whether there’s a legitimate need to reconnect the data to individuals later. If there is, coding is the tool. If there isn’t, stronger methods should be used.

