HPC Terminology: Difference between revisions

From CMU -- Language Technologies Institute -- HPC Wiki
Jump to navigation Jump to search
Created page with "<span id="hpc-terminology"></span> = HPC Terminology = This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing. <span id="cluster-architecture"></span> = Cluster Architecture = The cluster is..."
 
No edit summary
Line 9: Line 9:
The cluster is comprised of several key components:
The cluster is comprised of several key components:


* <code>Operating System</code>: Springdale 8 [Red Hat Enterprise Linux compliant]
* <code>Kernel</code>: x86_64 Linux 4.18.0-372.32.1.el8_6.x86_64
* <code>Login Node</code>: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.
* <code>Login Node</code>: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.
* <code>Head Node</code>: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:
* <code>Head Node</code>: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:
Line 21: Line 19:
<span id="filesystem-layout"></span>
<span id="filesystem-layout"></span>
= Filesystem Layout =
= Filesystem Layout =
'''MORE INFO NEEDED'''


* Each user is provisioned with:
* Each user is provisioned with:
** <code>/home/&lt;username&gt;</code>: 100GB
** <code>/home/&lt;username&gt;</code>
** <code>/data/user_data/&lt;username&gt;</code>: 500GB
** <code>/compute/&lt;node_name&gt;</code>
* NFS mounted Storage via autofs (i.e. it is not local disk on each compute node).
 
** <code>/home/</code>: mounted on the login nodes and all compute nodes
** <code>/data/datasets</code>: available only on the compute nodes
** <code>/compute/&lt;node_name&gt;</code>: available only on the compute nodes


<span id="autofs-notes"></span>
<span id="autofs-notes"></span>
=== AutoFS notes: ===
=== AutoFS notes: ===


AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of <code>ls /compute/</code> might seem empty. However, if you <code>ls /compute/babel-0-23</code> the contents of the /scratch dirs from that node will be revealed.
AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of <code>ls /compute/</code> might seem empty. However, if you <code>ls /compute/&lt;node_name&gt;</code> the contents of the /scratch dirs from that node will be revealed.


<span id="data-directories"></span>
<span id="data-directories"></span>
== Data Directories ==
== Data Directories ==


Community datasets are placed at <code>/data/datasets/&lt;thing&gt;</code>. If you have a dataset or model that would be useful to everyone let the administrator know by sending e-mail to to [mailto:help@cs.cmu.edu help@cs.cmu.edu] with the details of the request.
'''MORE INFO NEEDED'''
 
If you or your group requires additional space have your sponsor make the request by sending e-mail to to [mailto:help@cs.cmu.edu help@cs.cmu.edu] with the details of the request.
 
Current Community datasets are:


# <code>/data/dataset/llama/weights/meta_ai</code>
#


<span id="local-scratch-partition"></span>
<span id="local-scratch-partition"></span>
Line 52: Line 42:
When are frequently accessing large files you should first move them to the <code>/scratch</code> directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.
When are frequently accessing large files you should first move them to the <code>/scratch</code> directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.


The <code>/scratch</code> dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node <code>babel-X-X</code> can be accessed through at <code>/compute/babel-X-X</code> on other nodes. This will allow for faster access and reduce pressure on the NAS. ''highly recommended for large or frequently accessed files''.
The <code>/scratch</code> dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node <code>&lt;node_name&gt;</code> can be accessed through at <code>/compute/&lt;node_name&gt;</code> on other nodes. This will allow for faster access and reduce pressure on the NAS. ''(highly recommended for large or frequently accessed files)''.


'''Notes''':
'''Notes''':
Line 58: Line 48:
* If you <code>ls /compute/</code>, it might seem empty. However if you add the compute node's hostname the path autofs will mount the remote directory and you will be able to access it's <code>scratch</code> partition. For example:
* If you <code>ls /compute/</code>, it might seem empty. However if you add the compute node's hostname the path autofs will mount the remote directory and you will be able to access it's <code>scratch</code> partition. For example:


     [dvosler@babel-1-27 ~]# ls -la /compute/
     [dvosler@hpc-1-27 ~]# ls -la /compute/
     total 4
     total 4
     drwxr-xr-x  2 root root    0 Jun 23 15:02 .
     drwxr-xr-x  2 root root    0 Jun 23 15:02 .
     dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
     dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..


     [root@babel-1-27 ~]# ls -la /compute/babel-0-19
     [root@hpc-1-27 ~]# ls -la /compute/hpc-0-19
     total 28
     total 28
     drwxrwxrwt 4 root    root      4096 May 29 10:25 .
     drwxrwxrwt 4 root    root      4096 May 29 10:25 .
     drwxr-xr-x 3 root    root        0 Jun 28 16:17 ..
     drwxr-xr-x 3 root    root        0 Jun 28 16:17 ..
    drwx------ 2 root    root    16384 May  5 15:02 lost+found
     drwxrwxr-x 3 dvosler  dvosler  4096 May 11 18:41 things
     drwxrwxr-x 3 dvosler  dvosler  4096 May 11 18:41 things
     -rw-rw-r-- 1 dvosler  dvosler     1 May 25 10:41 version.txt
     -rw-rw-r-- 3 dvosler  dvosler   420 May 11 18:41 stuff


* Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself, but files are not deleted automatically (true as of 06/28/2023)
* Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself, but files are not deleted automatically (true as of 06/28/2023)
<span id="node-types"></span>
== Node Types ==
=== Login Nodes ===
The login nodes for this cluster are setup in a 3-node Roundrobin DNS configuration. Connecting to <code>babel.lti.cs.cmu.edu</code> will land you one one of them.
Only your home directory will mount on the login nodes. Other storage data such as <code>/data/</code> and <code>/compute</code> network mounts are only available on the compute nodes.
=== Compute Nodes ===
The Babel cluster includes dedicated GPU nodes that offer accelerated computing capabilities. These GPU nodes are equipped with high-performance GPUs, providing significant computational power for GPU-intensive workloads.
==== GPU Types ====
Babel's GPU nodes feature a range of GPU types, ensuring compatibility with diverse computational needs. The specific GPU models available on the cluster may vary over time as the cluster hardware is upgraded.
==== RAM ====
The GPU nodes in the Babel cluster are equipped with generous amounts of RAM to support memory-intensive applications.
==== CPU ====
In addition to GPUs, the GPU nodes in the Babel cluster also incorporate powerful CPUs to complement the computational capabilities. The CPUs provide the necessary processing power to handle non-GPU tasks and assist in managing parallel computations effectively.

Revision as of 11:57, 6 July 2023

HPC Terminology

This article provides an overview of key terminology and concepts related to High-Performance Computing (HPC). It aims to familiarize users with essential terms and their meanings in the context of HPC. By understanding these terms, users can gain a solid foundation for exploring and engaging in the field of high-performance computing.

Cluster Architecture

The cluster is comprised of several key components:

  • Login Node: The login node is used for logging into the cluster, launching jobs, and connecting to compute nodes.
  • Head Node: The control node serves as the primary management and coordination point for the cluster, and is responsible for several key functions:
    • Ansible Control Node: The control node is the primary Ansible node, responsible for managing and automating tasks across the entire system.
    • SLURM Controller: The control node manages the SLURM installation and configuration, and is responsible for scheduling and managing jobs on the compute nodes.
    • SLURM Database: The control node may also serve as the primary database node, storing and managing data related to the cluster’s configuration, job scheduling, and system performance.
  • Compute Nodes: The compute nodes provide CPU and GPU resources, local scratch space, and network-mounted storage for running compute-intensive tasks.
  • NAS: The NAS provides network-attached storage for the cluster, allowing users to store and access data from anywhere on the network.

Filesystem Layout

MORE INFO NEEDED

  • Each user is provisioned with:
    • /home/<username>
    • /compute/<node_name>


AutoFS notes:

AutoFS directories are not always mounted as AutoFS is an “on-demand” filesystem. You may need to stat the full path to the files you are looking for. For example, sometimes the output of ls /compute/ might seem empty. However, if you ls /compute/<node_name> the contents of the /scratch dirs from that node will be revealed.

Data Directories

MORE INFO NEEDED


Local Scratch Partition

When are frequently accessing large files you should first move them to the /scratch directory of the local machines your job is running on. Read/Write is much faster locally than it is over the network.

The /scratch dir of each node is exported via NFS to other nodes on the cluster. The local disk for the node <node_name> can be accessed through at /compute/<node_name> on other nodes. This will allow for faster access and reduce pressure on the NAS. (highly recommended for large or frequently accessed files).

Notes:

  • If you ls /compute/, it might seem empty. However if you add the compute node's hostname the path autofs will mount the remote directory and you will be able to access it's scratch partition. For example:
   [dvosler@hpc-1-27 ~]# ls -la /compute/
   total 4
   drwxr-xr-x   2 root root    0 Jun 23 15:02 .
   dr-xr-xr-x. 21 root root 4096 Jun 23 15:02 ..
   [root@hpc-1-27 ~]# ls -la /compute/hpc-0-19
   total 28
   drwxrwxrwt 4 root     root      4096 May 29 10:25 .
   drwxr-xr-x 3 root     root         0 Jun 28 16:17 ..
   drwxrwxr-x 3 dvosler  dvosler   4096 May 11 18:41 things
   -rw-rw-r-- 3 dvosler  dvosler    420 May 11 18:41 stuff
  • Compute nodes and scratch should only hold temporary files, in the sense that you should clean up after yourself, but files are not deleted automatically (true as of 06/28/2023)

Node Types

Login Nodes

The login nodes for this cluster are setup in a 3-node Roundrobin DNS configuration. Connecting to babel.lti.cs.cmu.edu will land you one one of them.

Only your home directory will mount on the login nodes. Other storage data such as /data/ and /compute network mounts are only available on the compute nodes.

Compute Nodes

The Babel cluster includes dedicated GPU nodes that offer accelerated computing capabilities. These GPU nodes are equipped with high-performance GPUs, providing significant computational power for GPU-intensive workloads.

GPU Types

Babel's GPU nodes feature a range of GPU types, ensuring compatibility with diverse computational needs. The specific GPU models available on the cluster may vary over time as the cluster hardware is upgraded.

RAM

The GPU nodes in the Babel cluster are equipped with generous amounts of RAM to support memory-intensive applications.

CPU

In addition to GPUs, the GPU nodes in the Babel cluster also incorporate powerful CPUs to complement the computational capabilities. The CPUs provide the necessary processing power to handle non-GPU tasks and assist in managing parallel computations effectively.