Data management plan (DMP)

Every research group should implement a data management plan to ensure long-term sustainability
Last updated: December 23, 2025

Why implement a DMP?

A data management plan (DMP) establishes consistent, transparent, and sustainable practices for storing, organizing, and maintaining research data. Implementing a DMP provides the following benefits:

Resource efficiency: Reduce storage costs and minimize researcher time spent locating, copying, or cleaning data.
Effective disk usage tracking: Enable monitoring of quotas and usage at the user and project level.
Reduction of redundancy and clutter: Prevent duplicated large files (for example, reference genomes) and eliminate orphaned or abandoned directories.
Improved documentation and accountability: Clearly identify directory ownership, contents, and purpose to support data reuse and validation cohorts.
Automation readiness: Support automated reporting of storage usage per user or project without manual inspection.
Streamlined backup and archiving: Simplify long-term backup, archiving, and data restoration workflows.

Common data management pitfalls

The following practices significantly hinder scalability, reproducibility, and storage efficiency:

Retaining large files in uncompressed formats
Duplicating reference genomes and annotations across multiple directories
Inconsistent directory depth and naming conventions
Mixing unrelated content within a single directory, such as:
- Raw data
- Reference files
- Intermediate results
- Scripts and logs
- Final outputs

Recommended directory structure

All shared storage should be organized under FIVE root-level directories only:

RAWDATA: primary, unprocessed data (for example, FASTQ files).
RESULTS: version-controlled, processed datasets generated by standardized lab pipelines and intended for reuse.
EXTERNAL: third-party datasets obtained from public repositories or published studies.
REFERENCES: shared resources such as reference genomes, annotations, and indices.
projects: active working space for analyses, scripts, and collaboration.

Rationale for the five-folder model

The separation into RAWDATA, RESULTS, EXTERNAL, REFERENCES, and projects is not cosmetic. Each folder has a distinct operational meaning that enables informed decisions about backup priority, cleanup strategy, and long-term sustainability:

Predictable cleanup when storage reaches quota: When storage pressure occurs, projects is the first target for cleanup, followed by selective pruning of derived files. RAWDATA and RESULTS are protected from ad hoc deletion, reducing the risk of irreversible data loss.
Clear backup prioritization: RAWDATA and RESULTS contain irreplaceable or high-value data and therefore receive the highest backup priority. In contrast, projects contains intermediate and reproducible outputs that can be regenerated if needed.
Explicit separation of re-downloadable data: EXTERNAL is reserved exclusively for third-party datasets that can be re-downloaded from public repositories or publications. This prevents unnecessary backup of recoverable data and simplifies restoration strategies.
Shared, single-source references: REFERENCES centralizes genomes, annotations, and indices that are shared across users and projects. This eliminates redundant copies, ensures consistency across analyses, and simplifies updates.
Automated and interpretable storage accounting: A fixed set of root folders enables automated reporting of storage usage per category. Administrators can immediately assess how much space is consumed by raw data, results, references, external datasets, or active projects without manual inspection.

Suggested subdirectory organization

RAWDATA and RESULTS: Organized by sequencing run, cohort, or experiment
EXTERNAL: Organized by PMID, consortium name, or study identifier
REFERENCES: Hierarchical structure species → assembly → resource type (for example, genome, annotation, tool-specific index)
projects: Flexible structure defined by project needs

Recommended project-level directory layout

Within each project directory, the following subfolders are encouraged:

Documents: Metadata, sample annotations, and study notes
FASTQ: Symbolic links pointing to raw data stored in RAWDATA
Results: Intermediate and final analysis outputs specific to the project and finalized results should be copied to the RESULTS folder
Scripts: Analysis scripts and Slurm job files, ideally under version control (for example, Git)
Logs: Records of pipeline runs, errors, and processing step

Special notes on the `projects` folder

Intended for intermediate files and active analyses
Subject to regular cleanup
First target for space reclamation when storage is limited
Symbolic links should be used to avoid data duplication

Additional guidelines

Every RAWDATA and projects folder must contain a README.txt
Shared directories under /work and /bulk must use 750 permissions
Directory names are capitalized to indicate controlled, shared resources

README files (mandatory)

Every project directory and the RAWDATA directory must include a README file.

README files may be free-form but should, at minimum, record:

Date of creation or update
Project or sample description
Data source
Ownership and responsible contact
Key processing steps or assumptions

Examples below:

## Under RAWDATA folder
# Project: MICROBENCH
# Generated: Jared Schlechte
# README by: Heewon Seo
# Date: November 06, 2025
# Data format: POD5
# Data source: Oxfort Nanopore
# Description: This folder contains raw Nanopore sequencing data (POD5 files)
 along with automatically generated statistics and report files from MinKNOW. 
 Some report files are missing due to MinKNOW processing failures.

## Under RESULTS folder
# Project: MICROBENCH
# Generated: Heewon Seo
# Date: November 10, 2025
# Data format: BAM and FASTQ
# Data source: Oxfort Nanopore
# Description: This folder contains unaligned BAM files 
from each plate and barcode-level FASTQ files that 
were generated from POD5 files using the Griffin-Pipeline
(https://github.com/Snyder-Institute/Griffin-Pipeline).

Permissions and ownership

Shared directories

Use 750 (or 770 when appropriate) permissions
Allow group-level read access while preserving ownership clarity

Example commands:

chmod -R 750 FOLDER_NAME
chmod 750 FILE_NAME

Default permissions

Add the following to your .bash_profile:

umask 0027

Group ownership

chown -R heewon.seo:bioinformatics_hub FOLDER_NAME
chgrp -R bioinformatics_hub FOLDER_NAME

Immutable read-only data

Symbolic links

Use symbolic links to reference shared, immutable data:

ln -s SOURCE_FILE DEST_FILE
unlink DEST_FILE

Environment variables

Define shortcuts for frequently used directories:

Create ~/.bash_env:

REFERENCES=/work/bioinformatics_hub/REFERENCES/

Source it in .bash_profile:

if [ -f ~/.bash_env ]; then
  source ~/.bash_env
fi

Backup and archiving strategy

The backup system follows the 3-2-1 rule:

Maintain three copies of all critical data
Store data on two different types of media
Keep one copy offsite

Overview

Additional notes:

Backup failover storage is offline, not internet-connected, and inaccessible to users to maximize data security.
ResearchFS provides a scratch-style projects workspace that supports active analyses without disrupting existing workflows.