Filesystem

From CMU -- Language Technologies Institute -- HPC Wiki
Revision as of 10:59, 17 February 2025 by 74.98.250.13 (talk) (Created page with "__NOTOC__ = ** UNDER DEVELOPMENT ** = This page provides an overview of the storage setup for the Babel HPC cluster, including the filesystem layout, and guidelines for utilizing each path effectively. __TOC__ == Filesystem Layout == Our cluster's filesystem layout (excluding OS-related directories) is as follows: * '''/data/''' - '''datasets/''' - Storage for large datasets used in training, validation, or testing of machine learning models. - '''models/''' - St...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

** UNDER DEVELOPMENT **

This page provides an overview of the storage setup for the Babel HPC cluster, including the filesystem layout, and guidelines for utilizing each path effectively.

Filesystem Layout

Our cluster's filesystem layout (excluding OS-related directories) is as follows:

  • /data/
 - datasets/ - Storage for large datasets used in training, validation, or testing of machine learning models.
 - models/ - Storage for model weights, checkpoints, and final models.
 - user_data/ - General-purpose storage for user-specific data that needs to persist across jobs.
  • /home/<username> - Home directories for individual users.
  • /scratch - A large, temporary storage volume (5.5T) for job-related data that may persist across multiple job runs or be shared between jobs.
  • /scratch/job_tmp - A dedicated temporary storage volume (1.5T) for job-specific data managed by the `job_container/tmpfs` plugin.

Storage Details

Job Storage and Results

/scratch

  • Purpose: `/scratch` is an XFS filesystem designed for temporary or intermediate data that may need to persist beyond a single job's lifecycle or be shared between jobs.
  • Usage: Ideal for large datasets, intermediate results, or outputs that need to be preserved for subsequent jobs or analysis.

/scratch/job_tmp

  • Purpose: `/scratch/job_tmp` is an XFS filesystem providing each job with an isolated, high-performance temporary space.
  • Usage: Used for job-specific temporary files, such as intermediate results or high-speed I/O operations during job execution. Data in `/scratch/job_tmp/$SLURM_JOB_ID` is automatically cleaned up when a job completes.

Best Practices

  • Monitor Quotas: Regularly check your quota usage on /scratch/job_tmp to avoid exceeding limits.
  • Clean Up Data: While /scratch/job_tmp is automatically cleaned up, manually clean up /scratch when data is no longer needed to free up space.
  • Optimize I/O: For performance-critical operations, prioritize /scratch/job_tmp over /scratch due to its isolated, high-speed nature.

Quotas on /scratch/job_tmp

To manage disk usage effectively, quotas are enforced on `/scratch/job_tmp`. These quotas apply to both users and groups:

  • User Quotas:
 Soft Limit: 500GB
 Hard Limit: 1TB
 Grace Period: 7 days
  • Group Quotas:
 Soft Limit: set per group
 Hard Limit: set per group
 Grace Period: set per group

Users or groups exceeding the soft limit will receive a warning and have 7 days to reduce their usage below the soft limit before the hard limit is enforced. To check your quota usage:

xfs_quota -x -c 'report -u' /scratch/job_tmp  # For users
xfs_quota -x -c 'report -g' /scratch/job_tmp  # For groups

Usage Guidelines

General Workflow

Use /scratch for large, temporary data that may need to persist across jobs or be shared between jobs. Use /scratch/job_tmp (via job_container/tmpfs) for fast, job-specific temporary data that is automatically cleaned up after job completion.

Example Job Script

Here’s an example of how to utilize /scratch, /scratch/job_tmp and /data/user_data/<username> in a job script:

<syntaxhighlight lang="bash">

  1. !/bin/bash
  2. SBATCH --job-name=large_ml_job
  3. SBATCH --output=output_%j.txt
  1. Copy large dataset to job's fast temporary space

cp /scratch/<username>/large_dataset /scratch/job_tmp/$SLURM_JOB_ID/_tmp/

  1. Run job using fast local storage

./my_ml_program

  1. Copy results back to /scratch for temporary persistence

cp /scratch/job_tmp/$SLURM_JOB_ID/_tmp/results /scratch/user_name/results_$SLURM_JOB_ID

  1. Copy results back to /data/user_data/<username> for long-term persistence

rsync -aAvPHxX /scratch/job_tmp/$SLURM_JOB_ID/_tmp/results /data/user_data/<username>/results_$SLURM_JOB_ID </syntaxhighlight>