HPC Terminology
HPC Terminology
This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing.
Cluster Architecture
The cluster is comprised of several key components:
Operating System: Springdale 8 [Red Hat Enterprise Linux compliant]Kernel: x86_64 Linux 4.18.0-372.32.1.el8_6.x86_64Login Node: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.Head Node: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:Ansible Control Node: The control node is the primary Ansible node, responsible for managing and automating tasks across the entire system.SLURM Controller: The control node manages the SLURM installation and configuration, and is responsible for scheduling and managing jobs on the compute nodes.SLURM Database: The control node may also serve as the primary database node, storing and managing data related to the cluster’s configuration, job scheduling, and system performance.
Compute Nodes: The compute nodes provide CPU and GPU resources, local scratch space, and network-mounted storage for running compute-intensive tasks.NAS: The NAS provides network-attached storage for the cluster, allowing users to store and access data from anywhere on the network.
Filesystem Layout
- Each user is provisioned with:
/home/<username>: 100GB/data/user_data/<username>: 500GB
- NFS mounted Storage via autofs (i.e. it is not local disk on each compute node).
/home/: mounted on the login nodes and all compute nodes/data/datasets: available only on the compute nodes/compute/<node_name>: available only on the compute nodes
AutoFS notes:
AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of ls /compute/ might seem empty. However, if you ls /compute/babel-0-23 the contents of the /scratch dirs from that node will be revealed.
Data Directories
Community datasets are placed at /data/datasets/<thing>. If you have a dataset or model that would be useful to everyone let the administrator know by sending e-mail to to help@cs.cmu.edu with the details of the request.
If you or your group requires additional space have your sponsor make the request by sending e-mail to to help@cs.cmu.edu with the details of the request.
Current Community datasets are:
/data/dataset/llama/weights/meta_ai
Local Scratch Partition
When are frequently accessing large files you should first move them to the /scratch directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.
The /scratch dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node babel-X-X can be accessed through at /compute/babel-X-X on other nodes. This will allow for faster access and reduce pressure on the NAS. highly recommended for large or frequently accessed files.
Notes:
- If you
ls /compute/, it might seem empty. However if you add the compute node's hostname the path autofs will mount the remote directory and you will be able to access it'sscratchpartition. For example:
[dvosler@babel-1-27 ~]# ls -la /compute/ total 4 drwxr-xr-x 2 root root 0 Jun 23 15:02 . dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
[root@babel-1-27 ~]# ls -la /compute/babel-0-19 total 28 drwxrwxrwt 4 root root 4096 May 29 10:25 . drwxr-xr-x 3 root root 0 Jun 28 16:17 .. drwx------ 2 root root 16384 May 5 15:02 lost+found drwxrwxr-x 3 dvosler dvosler 4096 May 11 18:41 things -rw-rw-r-- 1 dvosler dvosler 1 May 25 10:41 version.txt
- Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself, but files are not deleted automatically (true as of 06/28/2023)