Editing
BABEL
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Welcome to the Babel HPC Cluster! = Babel is a high-performance computing (HPC) cluster designed to deliver advanced computational capabilities for research and scientific tasks. This wiki page offers details on the cluster’s architecture and specifications, alongside resources to help you get started. To make the most of Babel’s high-performance computing resources, please review our documentation for job submission guidelines, available software, and best practices. If you have questions, don’t hesitate to contact our support team or explore the user forums. Happy computing! __TOC__ == Getting Started: == * '''[https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform Account Requests]:''' Use this form to request an account or for requesting changes to an existing account. * '''[[BABEL#Getting_Help|Slack:]]''' We have a healthy and active Slack channel, #babel-babble, where you can ask questions, share insights, and stay updated on system announcements. * '''[[#How to Login to the Cluster|Login:]]''' Securely connect to the cluster. * '''[[#Filesystem|Filesystem:]]''' Understand where to store your data, models, and personal directories. * '''[[#Submitting Jobs|Submitting Jobs:]]''' How to schedule and manage your computational tasks using the resource scheduler. * '''[[#Getting Help|Getting Help:]]''' How to seek access to resources, request account changes requests, and general help guildlines. == How to Login to the Cluster == From your [[SSH]] client, use the command: ssh <username>@login.babel.cs.cmu.edu * '''Credentials:''' Use your Andrew ID and password; SCS credentials have no power here. There are 4 login nodes in a round-robin DNS setup. You'll be directed to one of these nodes randomly upon connection. If you land on a node other than where your tmux session is active (e.g., you're on login3 but your session is on login2), you can SSH directly to the correct node: ssh login2 More about [[SSH]] See also: [[Login_Node_Resource_Limits]] == Shell Configuration == If you would like to set or change your shell on all nodes of the cluster, including login and compute nodes, please submit your request via the [https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform|User Change Request Form]. If you require a different shell than you find in the drop=-down list, please add it to the notes and we'll do our best to accomodate. ==Filesystem== === Layout === Each user is provisioned with: <pre> /home/<username>: 100GB capacity; mounted on all nodes /data/user_data/<username>: 500GB capacity; available only on compute nodes with active jobs </pre> Additional paths available on each compute node: <pre> /data/datasets: Community datasets. /data/models: Community models. /scratch: Local SSD or NVMe storgae; when greater than 65% full data older than 28 days is automatically expunged. /compute/<node_name>: Each nodes /scratch dir is exported to every other node. </pre> '''Note:''' Home, user_data, datasets, and models are network data directories, mounted via [[Filesystem#Automount_via_AutoFS|AutoFS]] which is an “on-demand” filesystem. You may need to stat the full path to the trigger the mount mechanism. See Filesystem - Automount for more details. For more details see: [[Filesystem#Filesystem Layout]]. === Automount via AutoFS === Non-OS data directories are mounted with AutoFS, which is an "on-demand" filesystem. This means that if you do not put a full path, the data directory will not mount and the higher level directory may appear empty. For a more detailed explanation with examples, please read the Filesystem - Automount wiki entry located at [[Filesystem#Automount via AutoFS|AutoFS]]. Information about login nodes can be found here: [[Login_Node_Resource_Limits|Login Node Resource Limits]] ==Submitting Jobs== === Resource Scheduler === [[Slurm]] 20.11.9 is used for job scheduling. There are '''2''' main ways to request resources: * '''Interactive:''' Use <code>srun</code> for jobs where you need direct interaction with the running task, often after using `salloc` for interactive sessions. * '''Batch:''' Use <code>sbatch</code> for jobs that can run without user interaction, typically for longer or resource-intensive tasks, submitting them to the Slurm queue for scheduled execution. Here's an overview of our main partitions: * '''debug''' <pre> Purpose: Quick, short jobs for testing and debugging. Max Time: 12 hours Default Time: 1 hour Max GPUs: 2 Max CPUs: 64 QoS: debug_qos Limitations: No array jobs </pre> * '''general''' <pre> Purpose: General, standard computing tasks. Max Time: 2 days Default Time: 6 hours Max GPUs: 8 Max CPUs: 128 QoS: normal Limitations: No interactive sessions | sbatch only </pre> * '''preempt''' <pre> Purpose: Long-running jobs that can be preempted for higher priority tasks. Max Time: 31 days Default Time: 3 hours Max GPUs: 24 Max CPUs: 256 QoS: preempt_qos Limitations: No interactive sessions | sbatch only </pre> * '''cpu''' <pre> Purpose: CPU-only computing tasks. Max Time: 2 days Default Time: 6 hours Max GPUs: 0 Max CPUs: 128 QoS: cpu_qos Limitations: No interactive sessions | sbatch only </pre> * '''array''' <pre> Purpose: Array jobs for parallel task execution. Max Time: 12 days Default Time: 6 hours Max GPUs: 8 Max CPUs: 256 QoS: array_qos Limitations: No interactive sessions | sbatch only </pre> === Partition Table === <pre> Name MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRES MinTRES Preempt ----------- ------------ --------- ----------- -------- ----------- ---------------------- normal gres/gpu=8 10 50 cpu=128 gres/gpu=1 array_qos,preempt_qos preempt_qos gres/gpu=24 24 100 cpu=256 gres/gpu=1 debug_qos gres/gpu=2 10 12 cpu=64 preempt_qos cpu_qos gres/gpu=0 10 50 cpu=128 preempt_qos array_qos gres/gpu=8 100 10000 cpu=256 preempt_qos </pre> === Viewing Partition Details === To explore the full configuration of all partitions, use the <code>scontrol</code> command: * <code>scontrol show part</code>: Displays detailed information about all available partitions. For specifics on a particular partition, include its name: * <code>scontrol show part <partition_name></code>: Shows detailed settings for the specified partition (e.g., <code>scontrol show part debug</code>). For detailed information on how to use these partitions, see our documentation [[Slurm|here]]. === Viewing QoS Details === Each partition is associated with a specific QoS (e.g., <code>debug_qos</code>, <code>normal</code>, <code>preempt_qos</code>), which defines rules such as maximum resource usage and preemption behavior. To view QoS information associated with your user account: sacctmgr show user $USER withassoc format=User,Account,DefaultQOS,QOS%4 == Getting Help == We focus on providing robust computing infrastructure rather than individual software support. * [https://lti-at-cmu.slack.com/archives/C05HRL547MX Slack] is our first line of support. Post issues in the '''#babel-babble''' channel; to get the admins' attention, tag us with @help-babel. If privacy is needed, send a direct message (DM). * [https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform User Change Request]: Use this form for requesting changes to your account, including additional groups, changing your default shell, and requesting increases in disk capacity. ==== What We Won't Do ==== * IDE Support: [[VSCode]], PyCharm, Jupyter Notebook, et cetera. * Troubleshoot / Debug Code: Your code is your own. If you think there may be an issue with our environment that is preventing your code from running let us know. === Additional Resources: === [[#Tips and Tricks|Tips and Tricks:]] Configs and best practices from our community. Request Forms: * [https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform |User Change Request:] Use this form to request changes to your account; additional groups, default shell, and increases in disk capacity. * [link|Dataset Request:] [[Environment Modules|Environment Modules:]] Modules to modify your user environment [[Tips_%26_Tricks|Tips and Tricks]] [[Hugging_Face|Hugging Face Cache Configuration]] === FAQ === * [[FAQ|FAQ]] == Upgrade History == * Birth: '''2023-05-04 18:02:55.854769612 -0400''' * Hydra9: '''2024-10-15 16:35:37.128177207 -0400''' * Migration to MSA: '''2025-10-22 10:34:01.158642535 -0400'''
Summary:
Please note that all contributions to CMU -- Language Technologies Institute -- HPC Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Project:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information