What Is Medical Claims Data and How Is It Used?

Medical claims data is the standardized information generated every time a healthcare provider bills an insurance company for a service. Each claim captures who the patient is, what diagnosis they received, what procedures were performed, who provided the care, and how much was charged. Collectively, these millions of individual billing records form one of the largest and most widely used sources of health information in the United States, powering everything from insurance reimbursement to disease surveillance to drug safety research.

What a Single Claim Contains

Every medical claim follows a standardized format. Professional claims (from doctor’s offices and outpatient providers) use one form, while institutional claims (from hospitals and facilities) use another. Despite the different formats, both capture the same core categories of information.

Patient demographics include the patient’s name, a unique identifier, address, date of birth, and sex. Insurance details include the name of the insured person and their plan identifier, such as a Medicare beneficiary number. Provider identifiers include the billing provider’s tax number and National Provider Identifier (NPI), plus the attending physician’s NPI and, for surgical cases, the operating physician’s NPI. Beyond these basics, every claim records the dates of service, the facility where care was delivered, and the dollar amounts charged.

The clinical substance of a claim lives in its codes. Four coding systems do most of the work:

ICD-10 diagnosis codes describe why a patient sought care. Every provider in every healthcare setting uses these codes, which are maintained by the CDC.
ICD-10 procedure codes describe inpatient hospital procedures specifically, such as surgeries or other interventions performed during a hospital stay.
CPT codes identify professional services and procedures across six categories: evaluation and management visits, anesthesiology, surgery, radiology, pathology, and laboratory medicine.
HCPCS Level II codes cover products and services not captured by CPT codes, including durable medical equipment, prosthetics, ambulance services, and certain drugs and biologicals.
NDC codes identify specific prescription drugs. Nearly all drugs in the United States carry a unique National Drug Code assigned by the FDA.

How a Claim Becomes Data

A claim doesn’t simply appear in a database. It goes through a multi-step adjudication process that determines whether and how much the insurer will pay. That process shapes the final data record.

First, the provider submits the claim with all required codes and documentation. The insurer’s system then runs an automated review, checking the patient’s eligibility, verifying that codes are accurate, and confirming the claim meets policy rules. If the automated system flags an issue, a licensed medical reviewer manually assesses medical necessity, prior authorizations, and treatment history.

After review, the payer makes one of three decisions: approve the claim in full, partially pay it (adjusting the amount based on coding, medical necessity, or policy limits), or deny it entirely. The insurer then generates an Explanation of Benefits for the patient and a remittance advice document for the provider, detailing what was paid, adjusted, or denied. The completed claim, with its final payment determination, is then stored as a data record. This means claims data reflects not just what care was delivered, but what an insurer agreed to cover and how much it paid.

Major Public Claims Datasets

The largest publicly accessible source of claims data in the U.S. comes from Medicare. These files are organized into four main categories:

Beneficiary enrollment files contain dates of Medicare enrollment, the specific services a patient is enrolled in, and basic demographics like age, sex, race, and ZIP code. These files also include indicators for chronic conditions such as diabetes or heart failure, plus flags for disability-related conditions, mental health diagnoses, and substance use disorders.

Part A files cover inpatient hospital stays, including summary data from hospitalizations, detailed hospital claims, skilled nursing facility claims, and hospice services.

Part B files capture outpatient and ambulatory care: visits to community and hospital-based physician offices, clinical laboratory billing, durable medical equipment, and outpatient medications administered in doctor’s offices or infusion centers.

Part D files contain prescription drug claims from outpatient pharmacies, including the medication name, dose, quantity supplied, days supplied, and cost information.

Beyond Medicare, state-level databases called State Inpatient Databases provide all inpatient discharges from non-federal acute care hospitals, including diagnosis and procedure codes, insurance type, hospital charges, and length of stay. Commercial insurers also maintain their own claims databases, which researchers can access through data use agreements or commercial vendors.

How Claims Data Differs From Clinical Records

Claims data and electronic health records (EHRs) capture different slices of a patient’s healthcare experience, and the differences matter. Claims data reflects what an insurance plan covered and how it managed utilization. EHR data reflects what clinicians decided and documented during care. Neither was designed for research; both were built for operational purposes.

Claims data has a major advantage in capturing the full scope of a patient’s healthcare use across settings. One study comparing the two data sources in a large healthcare system found that EHRs substantially undercounted certain services. Only 4% of patients showed emergency department visits in EHR data compared to 11.2% in claims. For X-rays, the gap was even wider: 4% in EHRs versus 22% in claims. CT scans showed a similar pattern, at 5.1% versus 7.3%. EHR-based cost estimates also significantly underestimated total cost of care.

The tradeoff is clinical depth. EHRs contain lab results, vital signs, physician notes, imaging findings, and other granular clinical details that never appear on a bill. Combining the clinical detail from EHRs with the cost and utilization data from claims is increasingly common in health services research.

What Claims Data Is Used For

Claims data is one of the primary tools for understanding how healthcare is delivered and consumed at a population level. Because every billable interaction generates a claim, researchers can track patterns of care across millions of patients over years or even decades. Common applications include disease monitoring, where claims reveal how many people are being diagnosed and treated for specific conditions over time. Comparative effectiveness research uses claims to evaluate how different treatments perform across real-world patient populations, outside the controlled environment of clinical trials.

Insurers and employers use claims data to identify cost drivers, measure provider performance, and design benefit plans. Public health agencies use it to track disease burden and allocate resources. The broader healthcare analytics market, which encompasses claims analysis alongside other health data tools, is valued at roughly $55.5 billion in 2025 and is projected to reach $166.7 billion by 2030.

Privacy Protections for Claims Data

Because claims contain sensitive personal health information, federal law requires that data be stripped of identifying details before it can be used for research or shared outside the healthcare system. HIPAA provides two approved methods for de-identification.

The Safe Harbor method requires removing 18 specific types of identifiers: names, addresses more specific than the first three digits of a ZIP code (and only if that ZIP code area contains more than 20,000 people), all date elements except year (with special restrictions for anyone over 89), phone numbers, email addresses, Social Security numbers, medical record numbers, health plan numbers, account numbers, device serial numbers, license plate numbers, URLs, IP addresses, biometric data, photographs, and any other unique identifying code.

The Expert Determination method is more flexible. A qualified statistician analyzes the dataset and certifies that the risk of re-identifying any individual is “very small,” then documents the methods and reasoning behind that conclusion. This approach allows researchers to retain more data granularity when they can demonstrate that privacy risk remains minimal.

Known Limitations

Claims data is powerful but imperfect, and its limitations stem directly from its original purpose: getting providers paid, not generating research insights.

The most fundamental limitation is that claims depend entirely on the accuracy of coding. In clinical settings, some diagnoses get missed, different types of providers may code the same condition differently, and not all coding is accurate. These codes work best when analyzing large populations with similar conditions rather than evaluating individual patient outcomes.

Claims data also lacks information about disease severity and how long a patient was ill before receiving a diagnosis. Two patients with the same diagnosis code could be at vastly different stages of illness, and the claim won’t distinguish between them. This limits the ability to make fair comparisons between patients.

Outcomes are another blind spot. Recovery, symptom improvement, disease progression, and quality of life are not captured in billing records. Any outcome information has to be inferred indirectly, for example by looking at whether a patient was readmitted or required additional procedures. And because claims only exist when someone interacts with the healthcare system, people who are uninsured, underinsured, or simply avoid medical care are invisible in these datasets.