HPC Terminology: Difference between revisions

From CMU -- Language Technologies Institute -- HPC Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
<span id="hpc-terminology"></span>
[[Category:General]]
= HPC Terminology =
[[Category:HPC]]


This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing.
This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing.
Line 18: Line 18:


<span id="filesystem-layout"></span>
<span id="filesystem-layout"></span>
= Filesystem Layout =
== Filesystem Layout ==
'''MORE INFO NEEDED'''
'''MORE INFO NEEDED'''



Revision as of 11:58, 6 July 2023


This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing.

Cluster Architecture

The cluster is comprised of several key components:

  • Login Node: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.
  • Head Node: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:
    • Ansible Control Node: The control node is the primary Ansible node, responsible for managing and automating tasks across the entire system.
    • SLURM Controller: The control node manages the SLURM installation and configuration, and is responsible for scheduling and managing jobs on the compute nodes.
    • SLURM Database: The control node may also serve as the primary database node, storing and managing data related to the cluster’s configuration, job scheduling, and system performance.
  • Compute Nodes: The compute nodes provide CPU and GPU resources, local scratch space, and network-mounted storage for running compute-intensive tasks.
  • NAS: The NAS provides network-attached storage for the cluster, allowing users to store and access data from anywhere on the network.

Filesystem Layout

MORE INFO NEEDED

  • Each user is provisioned with:
    • /home/<username>
    • /compute/<node_name>


AutoFS notes:

AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of ls /compute/ might seem empty. However, if you ls /compute/<node_name> the contents of the /scratch dirs from that node will be revealed.

Data Directories

MORE INFO NEEDED


Local Scratch Partition

When are frequently accessing large files you should first move them to the /scratch directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.

The /scratch dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node <node_name> can be accessed through at /compute/<node_name> on other nodes. This will allow for faster access and reduce pressure on the NAS. (highly recommended for large or frequently accessed files).

Notes:

  • If you ls /compute/, it might seem empty. However if you add the compute node's hostname the path autofs will mount the remote directory and you will be able to access it's scratch partition. For example:
   [dvosler@hpc-1-27 ~]# ls -la /compute/
   total 4
   drwxr-xr-x   2 root root    0 Jun 23 15:02 .
   dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
   [root@hpc-1-27 ~]# ls -la /compute/hpc-0-19
   total 28
   drwxrwxrwt 4 root     root      4096 May 29 10:25 .
   drwxr-xr-x 3 root     root         0 Jun 28 16:17 ..
   drwxrwxr-x 3 dvosler  dvosler   4096 May 11 18:41 things
   -rw-rw-r-- 3 dvosler  dvosler    420 May 11 18:41 stuff
  • Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself, but files are not deleted automatically (true as of 06/28/2023)

Node Types

Login Nodes

The login nodes for this cluster are setup in a 3-node Roundrobin DNS configuration. Connecting to babel.lti.cs.cmu.edu will land you one one of them.

Only your home directory will mount on the login nodes. Other storage data such as /data/ and /compute network mounts are only available on the compute nodes.

Compute Nodes

The Babel cluster includes dedicated GPU nodes that offer accelerated computing capabilities. These GPU nodes are equipped with high-performance GPUs, providing significant computational power for GPU-intensive workloads.

GPU Types

Babel's GPU nodes feature a range of GPU types, ensuring compatibility with diverse computational needs. The specific GPU models available on the cluster may vary over time as the cluster hardware is upgraded.

RAM

The GPU nodes in the Babel cluster are equipped with generous amounts of RAM to support memory-intensive applications.

CPU

In addition to GPUs, the GPU nodes in the Babel cluster also incorporate powerful CPUs to complement the computational capabilities. The CPUs provide the necessary processing power to handle non-GPU tasks and assist in managing parallel computations effectively.