Data management in research is the process of organizing, documenting, storing, and sharing the data you collect throughout a study. It covers every decision about how data will be handled, from the initial planning stages through long-term preservation after a project ends. Done well, it makes your work reproducible, keeps you compliant with funder requirements, and ensures your data remains usable years down the line.
The Research Data Lifecycle
Research data doesn’t just sit in one place. It moves through a series of stages, often described as a lifecycle with seven phases: plan, acquire, process, analyze, preserve, share, and discover. Each stage involves specific management decisions. During planning, you decide what data you’ll collect and how you’ll organize it. During acquisition, you’re recording data and tracking its origins. Processing involves cleaning and transforming raw data into something analyzable, while analysis is where you draw findings from it.
The later stages matter just as much. Preservation means storing your data in formats and locations that will remain accessible for years. Sharing means making it available to other researchers or the public. Discovery closes the loop: other scientists find your preserved data and use it to ask new questions, starting the cycle again. Managing data well at every stage prevents the all-too-common scenario where a researcher finishes a project and realizes their files are disorganized, poorly labeled, or stored on a hard drive that no one else can access.
What a Data Management Plan Includes
Most major funders now require a formal data management plan (DMP) as part of grant applications. The NIH, for example, has required data management and sharing plans for all competing grant applications since January 25, 2023. These plans are typically two pages or less, but they cover a lot of ground.
A strong DMP addresses several core elements. First, it defines the types and volume of data the project will generate, whether that’s survey responses, genomic sequences, imaging files, or something else. It specifies file formats, describes how individual observations will be recorded, and identifies the instruments used for collection. Second, it documents the metadata: the descriptive information that helps someone else understand what your data means. This includes variable definitions, data dictionaries, and any community-specific standards your field uses. Third, it lays out where the data will be stored, when it will be shared, how long it will remain available, and whether any restrictions apply.
The plan also identifies any software or code someone would need to view or analyze the data, and it explains whether the data will receive a persistent identifier (a permanent digital address that helps others locate it). If your field has no established standard for organizing this type of data, you note that too.
The FAIR Principles
The FAIR principles are the most widely adopted framework for evaluating whether research data is well-managed. FAIR stands for findable, accessible, interoperable, and reusable.
- Findable means data has a unique, persistent identifier and is described with rich metadata so both humans and search engines can locate it.
- Accessible means that once found, the data can be retrieved by authorized users through standard, secure protocols.
- Interoperable means the data uses standardized formats and vocabularies so it can be combined with other datasets and processed by common software tools.
- Reusable means the data is thoroughly described, includes clear usage licenses, and follows field-specific standards so future researchers can confidently build on it.
These aren’t just abstract ideals. Funders and journals increasingly evaluate submissions against FAIR criteria. Data that meets these standards is more likely to be cited, reused, and trusted.
Metadata and Documentation
Metadata is often called “data about data,” which sounds redundant until you try to open a spreadsheet someone else created with no column labels, no units, and no explanation of what the numbers represent. Good metadata makes data interpretable by anyone, not just the person who collected it.
Different fields have developed their own metadata standards. Biodiversity researchers use Darwin Core, a schema for describing species observations and ecological data. Biomedical researchers rely on Medical Subject Headings (MeSH), maintained by the National Library of Medicine, to categorize and search for health-related datasets. Neuroimaging researchers use the Brain Imaging Data Structure (BIDS) to organize and describe brain scan files in a consistent way. Using the right standard for your discipline makes your data far easier for colleagues in your field to find and reuse.
Where to Store and Share Data
Choosing the right repository is a key management decision. The NIH recommends giving primary consideration to repositories that specialize in your discipline or data type, because these tend to have more detailed metadata requirements and attract the audience most likely to reuse your data. A genomics researcher, for instance, would deposit sequences in a genomics-specific archive rather than a general-purpose platform.
When no discipline-specific repository exists, generalist repositories like Zenodo, Figshare, or Dryad are solid alternatives. They accept data from any field, assign persistent identifiers, and provide long-term access. The tradeoff is that their metadata structures are broader, which can make datasets slightly harder to discover within a specialized community.
Protecting Sensitive Data
Research involving human participants introduces additional management challenges. Even after a study ends, you have obligations to protect participant privacy. The key technique is de-identification: removing or obscuring information that could link data back to a specific person.
There are three main approaches. Suppression removes identifying features entirely, such as stripping names, addresses, and dates of birth from a dataset. Generalization replaces specific values with broader categories, like recording an age range instead of an exact age, or listing a state instead of a city. Perturbation introduces small, controlled changes to data values so that individual records can’t be traced back to a person while the overall statistical patterns remain intact. Cryptographic hashing, which converts identifying information into a scrambled code using a one-way function, is another common method.
Even when data is technically de-identified, the NIH recommends that researchers proactively assess whether controlled-access sharing is appropriate. This means requiring potential users to apply for access and agree to specific terms before they can download the data. These protections should be outlined in your data management plan and conveyed to any repository where you deposit the data.
Budgeting for Data Management
Data management costs real money, and funders recognize that. The NIH allows grant applicants to include a range of data management expenses in their budgets. Allowable costs include curating data, developing documentation, formatting data to meet community standards, de-identifying records, preparing metadata, paying repository deposit fees, and maintaining specialized infrastructure needed for local storage before data is deposited in a permanent archive. If your plan calls for depositing data in multiple repositories, you can include costs for each one.
These expenses should be categorized under the appropriate budget lines: personnel, equipment, supplies, or other expenses. The practical takeaway is that data management shouldn’t be an afterthought squeezed into spare hours. It’s a fundable, integral part of doing research, and reviewers expect to see it reflected in your budget.
Why It Matters Beyond Compliance
Funder mandates get researchers to write data management plans, but the real benefits go further. Well-managed data is easier for your own team to work with. When a lab member leaves, their data doesn’t leave with them if it’s properly documented and stored. When a reviewer questions a finding, you can trace the analysis back to the raw data without reconstructing months of work.
Open access to well-managed data also accelerates science broadly. Studies with openly available data tend to attract more engagement and citations. Open science practices, including data sharing, sharing of research materials, and posting analysis code, are increasingly seen as markers of rigorous, transparent work. Fields that have embraced strong data management norms, like genomics, have seen exponential growth in secondary analyses that build on shared datasets, generating new discoveries from data that already exists.

