Editing
Slurm
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Submitting Jobs = == Submitting Jobs with sbatch == To submit a batch job using <code>sbatch</code>, create a shell script (e.g., <code>job_script.sh</code>) that contains the necessary commands and configurations for your job. Then, use the following command to submit the job: $ sbatch job_script.sh SLURM will assign a unique job ID to your job and enqueue it for execution. You can monitor the status of your job using various SLURM commands like <code>squeue</code> or <code>sacct</code>. <span id="running-jobs-with-srun"></span> == Submitting Jobs with srun == For interactive or non-batch jobs, you can use the <code>srun</code> command. It allows you to execute commands directly on compute nodes. Here’s an example: $ srun -n 4 ./my_program SLURM will assign a unique job ID to your job and enqueue it for execution. You can monitor the status of your job using various SLURM commands like <code>squeue</code> or <code>sacct</code>. <span id="running-jobs-with-srun-1"></span> The <code>-n</code> option specifies the number of tasks you want to run. In the example above, we are running the <code>my_program</code> executable on four tasks. <span id="requesting-resources"></span> == Basic sbatch Job Submission == An <code>sbatch</code> script typically includes directives for resource requests and the application commands. Here’s a minimal example: <syntaxhighlight lang="bash"> #!/bin/bash #SBATCH --job-name=my_job #SBATCH --partition=preemptable_partition #SBATCH --time=2:00:00 # Max job run time #SBATCH --cpus-per-task=4 #SBATCH --mem=16G # Load any required modules module load your_application_module # Run the application srun ./your_application </syntaxhighlight> Save this as <code>job_script.sh</code> and submit it with: $ sbatch job_script.sh == Handling Preemption with Checkpointing == In clusters with preemption set to <code>requeue</code>, jobs may be interrupted and requeued, restarting from scratch. Checkpointing allows a job to save its state at intervals, so it can resume from the last checkpoint instead of starting over. === Implement Checkpointing in Your Application === Your application should support saving its state periodically (checkpoints). This often involves using signals (like <code>SIGUSR1</code>) to trigger checkpointing. === Set Up Checkpointing in Slurm === Specify the <code>--requeue</code> option to ensure that the job will be requeued upon preemption: <syntaxhighlight lang="bash"> #SBATCH --requeue </syntaxhighlight> === Use a Checkpoint Signal === Configure Slurm to send a checkpoint signal (e.g., <code>SIGUSR1</code>) at a defined interval: <syntaxhighlight lang="bash"> #SBATCH --signal=B:USR1@60 </syntaxhighlight> This directive sends <code>SIGUSR1</code> to your job 60 seconds before it is preempted. Adjust the timing based on your checkpointing needs. === Handle Checkpoint Signal in the Script === Set up a trap in your <code>sbatch</code> script to call the checkpoint function when <code>SIGUSR1</code> is received: <syntaxhighlight lang="bash"> trap 'checkpoint' USR1 checkpoint() { # Replace this with your application's checkpoint command ./your_application --checkpoint } </syntaxhighlight> === Resume from Checkpoint in the Job Script === When the job is requeued, configure the job to restart from the last saved checkpoint: <syntaxhighlight lang="bash"> if [ -f checkpoint_file ]; then srun ./your_application --resume checkpoint_file else srun ./your_application fi </syntaxhighlight> == Final Example Script == Here is a complete example of an <code>sbatch</code> script that uses checkpointing: <syntaxhighlight lang="bash"> #!/bin/bash #SBATCH --job-name=my_checkpointed_job #SBATCH --partition=preemptable_partition #SBATCH --time=2:00:00 #SBATCH --ntasks=1 #SBATCH --gpus=2 #SBATCH --output=/home/dvosler/logs/preemption_test-%j.log #SBATCH --error=/home/dvosler/logs/error-%j.out #SBATCH --mail-type=END #SBATCH --mail-user=dvosler@cs.cmu.edu #SBATCH --requeue # Allows the job to be requeued after preemption #SBATCH --signal=B:USR1@60 # Define a checkpoint file with job ID in the logs directory CHECKPOINT_FILE="/home/dvosler/logs/checkpoint-${SLURM_JOB_ID}.txt" # Load checkpoint if it exists if [[ -f $CHECKPOINT_FILE ]]; then i=$(cat $CHECKPOINT_FILE) echo "Resuming from iteration $i" else i=1 echo "Starting fresh" fi hostname; date; nvidia-smi -L; # Simulate work with checkpointing for (( ; i<=1000; i++ )); do echo "Iteration $i" sleep 5 # Simulate work # Save progress echo $i > $CHECKPOINT_FILE done echo "Job completed at $(date)" rm -f $CHECKPOINT_FILE # Clean up checkpoint file on completion </syntaxhighlight> This setup allows Slurm to preempt and requeue the job without losing progress. The application resumes from the latest checkpoint file if preempted. Adjust checkpoint frequency and requeue timing based on application requirements.
Summary:
Please note that all contributions to CMU -- Language Technologies Institute -- HPC Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Project:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information