HPC Terminology

From CMU -- Language Technologies Institute -- HPC Wiki
Revision as of 16:22, 28 June 2023 by Dvosler (talk | contribs) (Created page with "<span id="hpc-terminology"></span> = HPC Terminology = This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing. <span id="cluster-architecture"></span> = Cluster Architecture = The cluster is...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

HPC Terminology

This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing.

Cluster Architecture

The cluster is comprised of several key components:

  • Operating System: Springdale 8 [Red Hat Enterprise Linux compliant]
  • Kernel: x86_64 Linux 4.18.0-372.32.1.el8_6.x86_64
  • Login Node: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.
  • Head Node: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:
    • Ansible Control Node: The control node is the primary Ansible node, responsible for managing and automating tasks across the entire system.
    • SLURM Controller: The control node manages the SLURM installation and configuration, and is responsible for scheduling and managing jobs on the compute nodes.
    • SLURM Database: The control node may also serve as the primary database node, storing and managing data related to the cluster’s configuration, job scheduling, and system performance.
  • Compute Nodes: The compute nodes provide CPU and GPU resources, local scratch space, and network-mounted storage for running compute-intensive tasks.
  • NAS: The NAS provides network-attached storage for the cluster, allowing users to store and access data from anywhere on the network.

Filesystem Layout

  • Each user is provisioned with:
    • /home/<username>: 100GB
    • /data/user_data/<username>: 500GB
  • NFS mounted Storage via autofs (i.e. it is not local disk on each compute node).
    • /home/: mounted on the login nodes and all compute nodes
    • /data/datasets: available only on the compute nodes
    • /compute/<node_name>: available only on the compute nodes

AutoFS notes:

AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of ls /compute/ might seem empty. However, if you ls /compute/babel-0-23 the contents of the /scratch dirs from that node will be revealed.

Data Directories

Community datasets are placed at /data/datasets/<thing>. If you have a dataset or model that would be useful to everyone let the administrator know by sending e-mail to to help@cs.cmu.edu with the details of the request.

If you or your group requires additional space have your sponsor make the request by sending e-mail to to help@cs.cmu.edu with the details of the request.

Current Community datasets are:

  1. /data/dataset/llama/weights/meta_ai

Local Scratch Partition

When are frequently accessing large files you should first move them to the /scratch directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.

The /scratch dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node babel-X-X can be accessed through at /compute/babel-X-X on other nodes. This will allow for faster access and reduce pressure on the NAS. highly recommended for large or frequently accessed files.

Notes:

  • If you ls /compute/, it might seem empty. However if you add the compute node's hostname the path autofs will mount the remote directory and you will be able to access it's scratch partition. For example:
   [dvosler@babel-1-27 ~]# ls -la /compute/
   total 4
   drwxr-xr-x   2 root root    0 Jun 23 15:02 .
   dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
   [root@babel-1-27 ~]# ls -la /compute/babel-0-19
   total 28
   drwxrwxrwt 4 root     root      4096 May 29 10:25 .
   drwxr-xr-x 3 root     root         0 Jun 28 16:17 ..
   drwx------ 2 root     root     16384 May  5 15:02 lost+found
   drwxrwxr-x 3 dvosler  dvosler   4096 May 11 18:41 things
   -rw-rw-r-- 1 dvosler  dvosler      1 May 25 10:41 version.txt
  • Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself, but files are not deleted automatically (true as of 06/28/2023)