BABEL
Welcome to the Babel HPC Cluster!
[edit | edit source]Babel is a high-performance computing (HPC) cluster designed to deliver advanced computational capabilities for research and scientific tasks. This wiki page offers details on the cluster’s architecture and specifications, alongside resources to help you get started. To make the most of Babel’s high-performance computing resources, please review our documentation for job submission guidelines, available software, and best practices. If you have questions, don’t hesitate to contact our support team or explore the user forums.
Happy computing!
Getting Started:
[edit | edit source]- Account Requests: Use this form to request an account or for requesting changes to an existing account.
- Slack: We have a healthy and active Slack channel, #babel-babble, where you can ask questions, share insights, and stay updated on system announcements.
- Login: Securely connect to the cluster.
- Filesystem: Understand where to store your data, models, and personal directories.
- Submitting Jobs: How to schedule and manage your computational tasks using the resource scheduler.
- Getting Help: How to seek access to resources, request account changes requests, and general help guildlines.
How to Login to the Cluster
[edit | edit source]From your SSH client, use the command:
ssh <username>@login.babel.cs.cmu.edu
- Credentials: Use your Andrew ID and password; SCS credentials have no power here.
There are 4 login nodes in a round-robin DNS setup. You'll be directed to one of these nodes randomly upon connection. If you land on a node other than where your tmux session is active (e.g., you're on login3 but your session is on login2), you can SSH directly to the correct node:
ssh login2
More about SSH
See also: Login_Node_Resource_Limits
Shell Configuration
[edit | edit source]If you would like to set or change your shell on all nodes of the cluster, including login and compute nodes, please submit your request via the Change Request Form.
If you require a different shell than you find in the drop=-down list, please add it to the notes and we'll do our best to accomodate.
Filesystem
[edit | edit source]Layout
[edit | edit source]Each user is provisioned with:
/home/<username>: 100GB capacity; mounted on all nodes /data/user_data/<username>: 500GB capacity; available only on compute nodes with active jobs
Additional paths available on each compute node:
/data/datasets: Community datasets.
/data/models: Community models.
/scratch: Local SSD or NVMe storgae; when greater than 65% full data older than 28 days is automatically expunged.
/compute/<node_name>: Each nodes /scratch dir is exported to every other node.
Note: Home, user_data, datasets, and models are network data directories, mounted via AutoFS which is an “on-demand” filesystem. You may need to stat the full path to the trigger the mount mechanism. See Filesystem - Automount for more details.
For more details see: Filesystem#Filesystem Layout.
Automount via AutoFS
[edit | edit source]Non-OS data directories are mounted with AutoFS, which is an "on-demand" filesystem. This means that if you do not put a full path, the data directory will not mount and the higher level directory may appear empty. For a more detailed explanation with examples, please read the Filesystem - Automount wiki entry located at AutoFS.
Information about login nodes can be found here: Login Node Resource Limits
Submitting Jobs
[edit | edit source]Resource Scheduler
[edit | edit source]Slurm 20.11.9 is used for job scheduling.
There are 2 main ways to request resources:
- Interactive: Use
srunfor jobs where you need direct interaction with the running task, often after using `salloc` for interactive sessions. - Batch: Use
sbatchfor jobs that can run without user interaction, typically for longer or resource-intensive tasks, submitting them to the Slurm queue for scheduled execution.
Here's an overview of our main partitions:
- debug
Purpose: Quick, short jobs for testing and debugging.
Max Time: 12 hours
Default Time: 1 hour
Max GPUs: 2
Max CPUs: 64
QoS: debug_qos
Limitations: No array jobs
- general
Purpose: General, standard computing tasks.
Max Time: 2 days
Default Time: 6 hours
Max GPUs: 8
Max CPUs: 128
QoS: normal
Limitations: No interactive sessions | sbatch only
- preempt
Purpose: Long-running jobs that can be preempted for higher priority tasks.
Max Time: 31 days
Default Time: 3 hours
Max GPUs: 24
Max CPUs: 256
QoS: preempt_qos
Limitations: No interactive sessions | sbatch only
- cpu
Purpose: CPU-only computing tasks.
Max Time: 2 days
Default Time: 6 hours
Max GPUs: 0
Max CPUs: 128
QoS: cpu_qos
Limitations: No interactive sessions | sbatch only
- array
Purpose: Array jobs for parallel task execution.
Max Time: 12 days
Default Time: 6 hours
Max GPUs: 8
Max CPUs: 256
QoS: array_qos
Limitations: No interactive sessions | sbatch only
Partition Table
[edit | edit source] Name MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRES MinTRES Preempt
----------- ------------ --------- ----------- -------- ----------- ----------------------
normal gres/gpu=8 10 50 cpu=128 gres/gpu=1 array_qos,preempt_qos
preempt_qos gres/gpu=24 24 100 cpu=256 gres/gpu=1
debug_qos gres/gpu=2 10 12 cpu=64 preempt_qos
cpu_qos gres/gpu=0 10 50 cpu=128 preempt_qos
array_qos gres/gpu=8 100 10000 cpu=256 preempt_qos
Viewing Partition Details
[edit | edit source]To explore the full configuration of all partitions, use the scontrol command:
scontrol show part: Displays detailed information about all available partitions.
For specifics on a particular partition, include its name:
scontrol show part <partition_name>: Shows detailed settings for the specified partition (e.g.,scontrol show part debug).
For detailed information on how to use these partitions, see our documentation here.
Viewing QoS Details
[edit | edit source]Each partition is associated with a specific QoS (e.g., debug_qos, normal, preempt_qos), which defines rules such as maximum resource usage and preemption behavior.
To view QoS information associated with your user account:
sacctmgr show user $USER withassoc format=User,Account,DefaultQOS,QOS%4
Getting Help
[edit | edit source]We focus on providing robust computing infrastructure rather than individual software support.
- Slack is our first line of support. Post issues in the #babel-babble channel; to get the admins' attention, tag us with @help-babel. If privacy is needed, send a direct message (DM).
- User Change Request: Use this form for requesting changes to your account, including additional groups, changing your default shell, and requesting increases in disk capacity.
What We Won't Do
[edit | edit source]- IDE Support: VSCode, PyCharm, Jupyter Notebook, et cetera.
- Troubleshoot / Debug Code: Your code is your own.
If you think there may be an issue with our environment that is preventing your code from running let us know.
Additional Resources:
[edit | edit source]Tips and Tricks: Configs and best practices from our community.
Request Forms:
- |User Change Request: Use this form to request changes to your account; additional groups, default shell, and increases in disk capacity.
- [link|Dataset Request:]
Environment Modules: Modules to modify your user environment
Hugging Face Cache Configuration
FAQ
[edit | edit source]Upgrade History
[edit | edit source]- Birth: 2023-05-04 18:02:55.854769612 -0400
- Hydra9: 2024-10-15 16:35:37.128177207 -0400
- Migration to MSA: 2025-10-22 10:34:01.158642535 -0400