Data management plan (DMP)
- Every research group should implement a data management plan to ensure long-term sustainability
- Last updated: December 23, 2025
Why implement a DMP?
A data management plan (DMP) establishes consistent, transparent, and sustainable practices for storing, organizing, and maintaining research data. Implementing a DMP provides the following benefits:
- Resource efficiency: Reduce storage costs and minimize researcher time spent locating, copying, or cleaning data.
- Effective disk usage tracking: Enable monitoring of quotas and usage at the user and project level.
- Reduction of redundancy and clutter: Prevent duplicated large files (for example, reference genomes) and eliminate orphaned or abandoned directories.
- Improved documentation and accountability: Clearly identify directory ownership, contents, and purpose to support data reuse and validation cohorts.
- Automation readiness: Support automated reporting of storage usage per user or project without manual inspection.
- Streamlined backup and archiving: Simplify long-term backup, archiving, and data restoration workflows.
Common data management pitfalls
The following practices significantly hinder scalability, reproducibility, and storage efficiency:
- Retaining large files in uncompressed formats
- Duplicating reference genomes and annotations across multiple directories
- Inconsistent directory depth and naming conventions
- Mixing unrelated content within a single directory, such as:
- Raw data
- Reference files
- Intermediate results
- Scripts and logs
- Final outputs
Recommended directory structure
All shared storage should be organized under FIVE root-level directories only:
- RAWDATA: primary, unprocessed data (for example, FASTQ files).
- RESULTS: version-controlled, processed datasets generated by standardized lab pipelines and intended for reuse.
- EXTERNAL: third-party datasets obtained from public repositories or published studies.
- REFERENCES: shared resources such as reference genomes, annotations, and indices.
- projects: active working space for analyses, scripts, and collaboration.
Rationale for the five-folder model
The separation into RAWDATA, RESULTS, EXTERNAL, REFERENCES, and projects is not cosmetic. Each folder has a distinct operational meaning that enables informed decisions about backup priority, cleanup strategy, and long-term sustainability:
- Predictable cleanup when storage reaches quota: When storage pressure occurs,
projectsis the first target for cleanup, followed by selective pruning of derived files.RAWDATAandRESULTSare protected from ad hoc deletion, reducing the risk of irreversible data loss. - Clear backup prioritization:
RAWDATAandRESULTScontain irreplaceable or high-value data and therefore receive the highest backup priority. In contrast, projects contains intermediate and reproducible outputs that can be regenerated if needed. - Explicit separation of re-downloadable data:
EXTERNALis reserved exclusively for third-party datasets that can be re-downloaded from public repositories or publications. This prevents unnecessary backup of recoverable data and simplifies restoration strategies. - Shared, single-source references:
REFERENCEScentralizes genomes, annotations, and indices that are shared across users and projects. This eliminates redundant copies, ensures consistency across analyses, and simplifies updates. - Automated and interpretable storage accounting: A fixed set of root folders enables automated reporting of storage usage per category. Administrators can immediately assess how much space is consumed by raw data, results, references, external datasets, or active projects without manual inspection.
Suggested subdirectory organization
RAWDATAandRESULTS: Organized by sequencing run, cohort, or experimentEXTERNAL: Organized by PMID, consortium name, or study identifierREFERENCES: Hierarchical structurespecies → assembly → resource type(for example, genome, annotation, tool-specific index)projects: Flexible structure defined by project needs
Recommended project-level directory layout
Within each project directory, the following subfolders are encouraged:
- Documents: Metadata, sample annotations, and study notes
- FASTQ: Symbolic links pointing to raw data stored in
RAWDATA - Results: Intermediate and final analysis outputs specific to the project and finalized results should be copied to the
RESULTSfolder - Scripts: Analysis scripts and Slurm job files, ideally under version control (for example, Git)
- Logs: Records of pipeline runs, errors, and processing step
Special notes on the projects folder
- Intended for intermediate files and active analyses
- Subject to regular cleanup
- First target for space reclamation when storage is limited
- Symbolic links should be used to avoid data duplication
Additional guidelines
- Every
RAWDATAandprojectsfolder must contain aREADME.txt - Shared directories under
/workand/bulkmust use 750 permissions - Directory names are capitalized to indicate controlled, shared resources
README files (mandatory)
Every project directory and the RAWDATA directory must include a README file.
README files may be free-form but should, at minimum, record:
- Date of creation or update
- Project or sample description
- Data source
- Ownership and responsible contact
- Key processing steps or assumptions
Examples below:
## Under RAWDATA folder
# Project: MICROBENCH
# Generated: Jared Schlechte
# README by: Heewon Seo
# Date: November 06, 2025
# Data format: POD5
# Data source: Oxfort Nanopore
# Description: This folder contains raw Nanopore sequencing data (POD5 files)
along with automatically generated statistics and report files from MinKNOW.
Some report files are missing due to MinKNOW processing failures.
## Under RESULTS folder
# Project: MICROBENCH
# Generated: Heewon Seo
# Date: November 10, 2025
# Data format: BAM and FASTQ
# Data source: Oxfort Nanopore
# Description: This folder contains unaligned BAM files
from each plate and barcode-level FASTQ files that
were generated from POD5 files using the Griffin-Pipeline
(https://github.com/Snyder-Institute/Griffin-Pipeline).
Permissions and ownership
Shared directories
- Use
750(or770when appropriate) permissions - Allow group-level read access while preserving ownership clarity
Example commands:
chmod -R 750 FOLDER_NAME
chmod 750 FILE_NAME
Default permissions
Add the following to your .bash_profile:
umask 0027
Group ownership
chown -R heewon.seo:bioinformatics_hub FOLDER_NAME
chgrp -R bioinformatics_hub FOLDER_NAME
Immutable read-only data
Symbolic links
Use symbolic links to reference shared, immutable data:
ln -s SOURCE_FILE DEST_FILE
unlink DEST_FILE
Environment variables
Define shortcuts for frequently used directories:
Create ~/.bash_env:
REFERENCES=/work/bioinformatics_hub/REFERENCES/
Source it in .bash_profile:
if [ -f ~/.bash_env ]; then
source ~/.bash_env
fi
Backup and archiving strategy
The backup system follows the 3-2-1 rule:
- Maintain three copies of all critical data
- Store data on two different types of media
- Keep one copy offsite

Additional notes:
- Backup failover storage is offline, not internet-connected, and inaccessible to users to maximize data security.
- ResearchFS provides a scratch-style
projectsworkspace that supports active analyses without disrupting existing workflows.