Editing BABEL (section)

= Welcome to the Babel HPC Cluster! =
Babel is a high-performance computing (HPC) cluster designed to deliver advanced computational capabilities for research and scientific tasks. This wiki page offers details on the cluster’s architecture and specifications, alongside resources to help you get started. To make the most of Babel’s high-performance computing resources, please review our documentation for job submission guidelines, available software, and best practices. If you have questions, don’t hesitate to contact our support team or explore the user forums.

Happy computing! 

__TOC__
== Getting Started: ==
* '''[https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform Account Requests]:''' Use this form to request an account or for requesting changes to an existing account.
* '''[[BABEL#Getting_Help|Slack:]]''' We have a healthy and active Slack channel, #babel-babble, where you can ask questions, share insights, and stay updated on system announcements.
* '''[[#How to Login to the Cluster|Login:]]''' Securely connect to the cluster.
* '''[[#Filesystem|Filesystem:]]''' Understand where to store your data, models, and personal directories.
* '''[[#Submitting Jobs|Submitting Jobs:]]''' How to schedule and manage your computational tasks using the resource scheduler.
* '''[[#Getting Help|Getting Help:]]''' How to seek access to resources, request account changes requests, and general help guildlines.

== How to Login to the Cluster ==
From your [[SSH]] client, use the command:
  ssh <username>@login.babel.cs.cmu.edu

* '''Credentials:''' Use your Andrew ID and password; SCS credentials have no power here.

There are 4 login nodes in a round-robin DNS setup. You'll be directed to one of these nodes randomly upon connection. If you land on a node other than where your tmux session is active (e.g., you're on login3 but your session is on login2), you can SSH directly to the correct node:
  ssh login2

More about [[SSH]]

See also: [[Login_Node_Resource_Limits]]

== Shell Configuration ==

If you would like to set or change your shell on all nodes of the cluster, including login and compute nodes, please submit your request via the [https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform|User Change Request Form].

If you require a different shell than you find in the drop=-down list, please add it to the notes and we'll do our best to accomodate.

==Filesystem==

=== Layout ===
Each user is provisioned with:
<pre>
          /home/<username>: 100GB capacity; mounted on all nodes
/data/user_data/<username>: 500GB capacity; available only on compute nodes with active jobs
</pre>

Additional paths available on each compute node:
<pre>
      /data/datasets: Community datasets.
        /data/models: Community models.
            /scratch: Local SSD or NVMe storgae; when greater than 65% full data older than 28 days is automatically expunged.
/compute/<node_name>: Each nodes /scratch dir is exported to every other node.
</pre>

'''Note:'''  Home, user_data, datasets, and models are network data directories, mounted via [[Filesystem#Automount_via_AutoFS|AutoFS]] which is an “on-demand” filesystem. You may need to stat the full path to the trigger the mount mechanism. See Filesystem - Automount for more details.

For more details see: [[Filesystem#Filesystem Layout]].

=== Automount via AutoFS ===
Non-OS data directories are mounted with AutoFS, which is an "on-demand" filesystem. This means that if you do not put a full path, the data directory will not mount and the higher level directory may appear empty. For a more detailed explanation with examples, please read the Filesystem - Automount wiki entry located at [[Filesystem#Automount via AutoFS|AutoFS]].

Information about login nodes can be found here: [[Login_Node_Resource_Limits|Login Node Resource Limits]]

==Submitting Jobs==
=== Resource Scheduler ===
[[Slurm]] 20.11.9 is used for job scheduling.

There are '''2''' main ways to request resources:

* '''Interactive:''' Use <code>srun</code> for jobs where you need direct interaction with the running task, often after using `salloc` for interactive sessions.
* '''Batch:''' Use <code>sbatch</code> for jobs that can run without user interaction, typically for longer or resource-intensive tasks, submitting them to the Slurm queue for scheduled execution.
Here's an overview of our main partitions:

* '''debug'''
<pre>
Purpose: Quick, short jobs for testing and debugging.
      Max Time: 12 hours
  Default Time: 1 hour
      Max GPUs: 2
      Max CPUs: 64
           QoS: debug_qos
   Limitations: No array jobs
</pre>

* '''general'''
<pre>
Purpose: General, standard computing tasks.
      Max Time: 2 days
  Default Time: 6 hours
      Max GPUs: 8
      Max CPUs: 128
           QoS: normal
   Limitations: No interactive sessions | sbatch only
</pre>

* '''preempt'''
<pre>
Purpose: Long-running jobs that can be preempted for higher priority tasks.
      Max Time: 31 days
  Default Time: 3 hours
      Max GPUs: 24
      Max CPUs: 256
           QoS: preempt_qos
   Limitations: No interactive sessions | sbatch only
</pre>

* '''cpu'''
<pre>
Purpose: CPU-only computing tasks.
      Max Time: 2 days
  Default Time: 6 hours
      Max GPUs: 0
      Max CPUs: 128
           QoS: cpu_qos
   Limitations: No interactive sessions | sbatch only
</pre>

* '''array'''
<pre>
Purpose: Array jobs for parallel task execution.
      Max Time: 12 days
  Default Time: 6 hours
      Max GPUs: 8
      Max CPUs: 256
           QoS: array_qos
   Limitations: No interactive sessions | sbatch only
</pre>

=== Partition Table ===

<pre>
      Name    MaxTRESPU MaxJobsPU MaxSubmitPU  MaxTRES     MinTRES                Preempt
----------- ------------ --------- ----------- -------- ----------- ----------------------
     normal   gres/gpu=8        10          50  cpu=128  gres/gpu=1  array_qos,preempt_qos
preempt_qos  gres/gpu=24        24         100  cpu=256  gres/gpu=1
  debug_qos   gres/gpu=2        10          12   cpu=64                        preempt_qos
    cpu_qos   gres/gpu=0        10          50  cpu=128                        preempt_qos
  array_qos   gres/gpu=8       100       10000  cpu=256                        preempt_qos
</pre>

=== Viewing Partition Details ===

To explore the full configuration of all partitions, use the <code>scontrol</code> command:

* <code>scontrol show part</code>: Displays detailed information about all available partitions.

For specifics on a particular partition, include its name:

* <code>scontrol show part <partition_name></code>: Shows detailed settings for the specified partition (e.g., <code>scontrol show part debug</code>).

For detailed information on how to use these partitions, see our documentation [[Slurm|here]].

=== Viewing QoS Details ===
Each partition is associated with a specific QoS (e.g., <code>debug_qos</code>, <code>normal</code>, <code>preempt_qos</code>), which defines rules such as maximum resource usage and preemption behavior.

To view QoS information associated with your user account:
  
  sacctmgr show user $USER withassoc format=User,Account,DefaultQOS,QOS%4

== Getting Help ==
We focus on providing robust computing infrastructure rather than individual software support. 

* [https://lti-at-cmu.slack.com/archives/C05HRL547MX Slack] is our first line of support. Post issues in the '''#babel-babble''' channel; to get the admins' attention, tag us with @help-babel. If privacy is needed, send a direct message (DM).
* [https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform User Change Request]: Use this form for requesting changes to your account, including additional groups, changing your default shell, and requesting increases in disk capacity.

==== What We Won't Do ====
* IDE Support: [[VSCode]], PyCharm, Jupyter Notebook, et cetera.
* Troubleshoot / Debug Code: Your code is your own.

If you think there may be an issue with our environment that is preventing your code from running let us know.

=== Additional Resources: ===

[[#Tips and Tricks|Tips and Tricks:]] Configs and best practices from our community.

Request Forms: 
* [https://docs.google.com/forms/d/e/1FAIpQLSccfWvXdBltL8oxPYEZPGD-IWTnXXPqQS2bwcHr72wpRi1l6A/viewform |User Change Request:] Use this form to request changes to your account; additional groups, default shell, and increases in disk capacity.
* [link|Dataset Request:]

[[Environment Modules|Environment Modules:]]  Modules to modify your user environment

[[Tips_%26_Tricks|Tips and Tricks]]

[[Hugging_Face|Hugging Face Cache Configuration]]

=== FAQ ===
* [[FAQ|FAQ]]

== Upgrade History ==
* Birth: '''2023-05-04 18:02:55.854769612 -0400'''
* Hydra9: '''2024-10-15 16:35:37.128177207 -0400'''
* Migration to MSA: '''2025-10-22 10:34:01.158642535 -0400'''