FLAME

From CMU -- Language Technologies Institute -- HPC Wiki
Revision as of 09:20, 7 July 2024 by 174.181.5.47 (talk) (Created page with "== Multi-node Training == === Launch Script === Here's an example of a launch script to start a multi-node training job with three H100 nodes using deepspeed: #!/bin/bash #SBATCH --job-name=flame #SBATCH --output=flame.out #SBATCH --error=flame.err #SBATCH --partition=flame #SBATCH --nodes=3 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=12 #SBATCH --mem=2000G #SBATCH --gres=gpu:H100:8 #SBATCH --time=2-00:00:00 srun --jobid $SLURM_JOB_ID bash -c 'source...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Multi-node Training

[edit | edit source]

Launch Script

[edit | edit source]

Here's an example of a launch script to start a multi-node training job with three H100 nodes using deepspeed:

#!/bin/bash
#SBATCH --job-name=flame
#SBATCH --output=flame.out
#SBATCH --error=flame.err
#SBATCH --partition=flame
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=2000G
#SBATCH --gres=gpu:H100:8
#SBATCH --time=2-00:00:00
srun --jobid $SLURM_JOB_ID bash -c 'source ~/.bashrc && conda activate <your conda env> (optional: && WANDB__SERVICE_WAIT=500 WANDB_PROJECT=<wandb project> WANDB_ENTITY=<wandb entity> WANDB_NAME=<wandb name>) python -u -m torch.distributed.run \
--nproc_per_node 8 (number of gpus per node) \
--nnodes 3 (number of nodes) \
--rdzv_endpoint flame-10-7:6000 (master node and port) \
--rdzv_backend c10d \
--max_restarts 0 \
--tee 3 \
--node_rank $SLURM_PROCID \
(Following is the command to launch training, you may have your own command. The simple example below uses Llamafactory)
LLaMA-Factory/src/train_bash.py     --deepspeed ds_config_3.json     --stage sft     --model_name_or_path <model path>     --do_train     --dataset <dataset>     --train_size <train size>   --dataset_dir <dataset directory>  --template llama2     --finetuning_type full     --output_dir <output directory>    --per_device_train_batch_size <per device train batch size>     --per_device_eval_batch_size <per device eval batch size>    --gradient_accumulation_steps <gradient accumulation step>     --gradient_checkpointing True     --lr_scheduler_type cosine     --evaluation_strategy "steps"     --save_strategy "steps"     --save_steps 500     --logging_steps "30"     --save_total_limit <save total limit>     --preprocessing_num_workers 16     --learning_rate <lr>     --weight_decay <weight decay>     --warmup_ratio <warmup ratio>     --num_train_epochs <epochs>     --plot_loss     --bf16 True     --cutoff_len <cutoff length>    --flash_attn True   --report_to "wandb"  --seed <seed>'
> output.txt 2>&1

Hostfile (create a file name “hostfile” with the following content)

flame-10-7 slots=8
flame-10-15 slots=8
flame-10-23 slots=8

Potential Errors and Fixes

[edit | edit source]
  • To debug, first set NCCL logging to INFO: export NCCL_DEBUG=INFO
  • ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
    • This error occurs when max locked memory is not sufficient. You can check how much max locked memory you have by running ulimit -a. On Babel, the login node has max locked memory = 64 on login node and max locked memory = unlimited on compute node. So to avoid this problem, run from a compute node.