Editing Slurm (section)

= Submitting Jobs =
== Submitting Jobs with sbatch ==

To submit a batch job using <code>sbatch</code>, create a shell script (e.g., <code>job_script.sh</code>) that contains the necessary commands and configurations for your job. Then, use the following command to submit the job:

  $ sbatch job_script.sh

SLURM will assign a unique job ID to your job and enqueue it for execution. You can monitor the status of your job using various SLURM commands like <code>squeue</code> or <code>sacct</code>.

<span id="running-jobs-with-srun"></span>
== Submitting Jobs with srun ==

For interactive or non-batch jobs, you can use the <code>srun</code> command. It allows you to execute commands directly on compute nodes. Here’s an example:

  $ srun -n 4 ./my_program

SLURM will assign a unique job ID to your job and enqueue it for execution. You can monitor the status of your job using various SLURM commands like <code>squeue</code> or <code>sacct</code>.

<span id="running-jobs-with-srun-1"></span>

The <code>-n</code> option specifies the number of tasks you want to run. In the example above, we are running the <code>my_program</code> executable on four tasks.

<span id="requesting-resources"></span>

== Basic sbatch Job Submission ==

An <code>sbatch</code> script typically includes directives for resource requests and the application commands. Here’s a minimal example:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --partition=preemptable_partition
#SBATCH --time=2:00:00  # Max job run time
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G

# Load any required modules
module load your_application_module

# Run the application
srun ./your_application
</syntaxhighlight>

Save this as <code>job_script.sh</code> and submit it with:

  $ sbatch job_script.sh

== Handling Preemption with Checkpointing ==

In clusters with preemption set to <code>requeue</code>, jobs may be interrupted and requeued, restarting from scratch. Checkpointing allows a job to save its state at intervals, so it can resume from the last checkpoint instead of starting over.

=== Implement Checkpointing in Your Application ===

Your application should support saving its state periodically (checkpoints). This often involves using signals (like <code>SIGUSR1</code>) to trigger checkpointing.

=== Set Up Checkpointing in Slurm ===

Specify the <code>--requeue</code> option to ensure that the job will be requeued upon preemption:

<syntaxhighlight lang="bash">
#SBATCH --requeue
</syntaxhighlight>

=== Use a Checkpoint Signal ===

Configure Slurm to send a checkpoint signal (e.g., <code>SIGUSR1</code>) at a defined interval:

<syntaxhighlight lang="bash">
#SBATCH --signal=B:USR1@60
</syntaxhighlight>

This directive sends <code>SIGUSR1</code> to your job 60 seconds before it is preempted. Adjust the timing based on your checkpointing needs.

=== Handle Checkpoint Signal in the Script ===

Set up a trap in your <code>sbatch</code> script to call the checkpoint function when <code>SIGUSR1</code> is received:

<syntaxhighlight lang="bash">
trap 'checkpoint' USR1

checkpoint() {
    # Replace this with your application's checkpoint command
    ./your_application --checkpoint
}
</syntaxhighlight>

=== Resume from Checkpoint in the Job Script ===

When the job is requeued, configure the job to restart from the last saved checkpoint:

<syntaxhighlight lang="bash">
if [ -f checkpoint_file ]; then
    srun ./your_application --resume checkpoint_file
else
    srun ./your_application
fi
</syntaxhighlight>

== Final Example Script ==

Here is a complete example of an <code>sbatch</code> script that uses checkpointing:

<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --job-name=my_checkpointed_job
#SBATCH --partition=preemptable_partition
#SBATCH --time=2:00:00
#SBATCH --ntasks=1
#SBATCH --gpus=2
#SBATCH --output=/home/dvosler/logs/preemption_test-%j.log
#SBATCH --error=/home/dvosler/logs/error-%j.out
#SBATCH --mail-type=END
#SBATCH --mail-user=dvosler@cs.cmu.edu
#SBATCH --requeue  # Allows the job to be requeued after preemption
#SBATCH --signal=B:USR1@60

# Define a checkpoint file with job ID in the logs directory
CHECKPOINT_FILE="/home/dvosler/logs/checkpoint-${SLURM_JOB_ID}.txt"

# Load checkpoint if it exists
if [[ -f $CHECKPOINT_FILE ]]; then
    i=$(cat $CHECKPOINT_FILE)
    echo "Resuming from iteration $i"
else
    i=1
    echo "Starting fresh"
fi

hostname;
date;
nvidia-smi -L;

# Simulate work with checkpointing
for (( ; i<=1000; i++ )); do
    echo "Iteration $i"
    sleep 5  # Simulate work

    # Save progress
    echo $i > $CHECKPOINT_FILE
done

echo "Job completed at $(date)"
rm -f $CHECKPOINT_FILE  # Clean up checkpoint file on completion
</syntaxhighlight>

This setup allows Slurm to preempt and requeue the job without losing progress. The application resumes from the latest checkpoint file if preempted. Adjust checkpoint frequency and requeue timing based on application requirements.