Slurm Job Efficiency
Slurm Job Efficiency Script
[edit | edit source]The Slurm Job Efficiency Script is a Python script for analyzing resource utilization of completed Slurm jobs. It uses Slurm's sacct and seff commands to retrieve job data, providing insights into CPU and memory usage, along with optimization recommendations. The script supports filtering by users, job IDs, partitions, and time ranges, and generates detailed or summary reports.
Overview
[edit | edit source]This script assists in monitoring and optimizing resource allocation in Slurm-based HPC clusters. It calculates CPU and memory efficiency, identifies under- or over-utilized resources, and outputs a tabular report with a summary of average efficiencies and actionable insights.
Example Usage
[edit | edit source]Below are examples of common use cases, assuming the script is saved as /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py and run in a Slurm environment.
Analyze Jobs for a Specific User
[edit | edit source]To analyze completed jobs for user alice with the default time range (last 1 day):
python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py alice
To analyze jobs for user alice over the last 7 days or starting from April 1, 2025:
python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py alice -t 7d python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py alice -t 2025-04-01
Output (for -t 7d):
Slurm Job Efficiency: User: alice ╒══════════╤═════════════════╤════════════╤════════════╤════════════════════════════════════════════════╕ │ JobID │ JobName │ CPU Eff │ Mem Eff │ Insights │ ╞══════════╪═════════════════╪════════════╪════════════╪════════════════════════════════════════════════╡ │ 12345 │ simulation │ 95.50% │ 80.20% │ - │ ╞══════════╪═════════════════╪════════════╪════════════╪════════════════════════════════════════════════╡ │ 12346 │ data_proc │ 45.30% │ 20.10% │ Underutilized CPU; Reduce CPU cores │ │ │ │ │ │ Underutilized memory; Reduce memory allocation │ ╘══════════╧═════════════════╧════════════╧════════════╧════════════════════════════════════════════════╛ -------------------------------------- Total Jobs: 2 Avg CPU Eff: 70.40%, Avg Mem Eff: 50.15% Summary: - Underutilized CPU; Reduce requested resources or optimize job - Underutilized memory; Reduce memory allocation --------------------------------------
Show Detailed Resource Usage
[edit | edit source]To analyze jobs for user alice with detailed resource columns:
python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py alice -r
Output:
Slurm Job Efficiency: User: alice ╒══════════╤═════════════════╤════════════╤════════════╤═════════════╤══════════╤══════════╤══════════════════════════════════════════════════════╕ │ JobID │ JobName │ CPU Eff │ Mem Eff │ Req Cores │ Req Mem │ Max Mem │ Insights │ ╞══════════╪═════════════════╪════════════╪════════════╪═════════════╪══════════╪══════════╪══════════════════════════════════════════════════════╡ │ 12348 │ compute_task │ 85.00% │ 95.00% │ 4 │ 16GB │ 15.20 GB │ High memory usage; Increase memory to avoid swapping │ ╘══════════╧═════════════════╧════════════╧════════════╧═════════════╧══════════╧══════════╧══════════════════════════════════════════════════════╛ -------------------------------------- Total Jobs: 1 Avg CPU Eff: 85.00%, Avg Mem Eff: 95.00% Summary: - High memory usage; Increase memory to avoid swapping --------------------------------------
Analyze Specific Job IDs
[edit | edit source]To analyze job IDs 12345 and 12346:
python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py -j 12345,12346
Output:
Slurm Job Efficiency: User: Any ╒══════════╤═════════════════╤════════════╤════════════╤════════════════════════════════════════════════╕ │ JobID │ JobName │ CPU Eff │ Mem Eff │ Insights │ ╞══════════╪═════════════════╪════════════╪════════════╪════════════════════════════════════════════════╡ │ 12345 │ simulation │ 95.50% │ 80.20% │ - │ ╞══════════╪═════════════════╪════════════╪════════════╪════════════════════════════════════════════════╡ │ 12346 │ data_proc │ 45.30% │ 20.10% │ Underutilized CPU; Reduce CPU cores │ │ │ │ │ │ Underutilized memory; Reduce memory allocation │ ╘══════════╧═════════════════╧════════════╧════════════╧════════════════════════════════════════════════╛ -------------------------------------- Total Jobs: 2 Avg CPU Eff: 70.40%, Avg Mem Eff: 50.15% Summary: - Underutilized CPU; Reduce requested resources or optimize job - Underutilized memory; Reduce memory allocation --------------------------------------
Analyze Jobs for a Partition
[edit | edit source]To analyze jobs for user alice on the general partition:
python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py alice -p general
Output:
Slurm Job Efficiency: User: alice (general) ╒══════════╤═════════════════╤════════════╤════════════╤══════════════════════════════════════════════════════╕ │ JobID │ JobName │ CPU Eff │ Mem Eff │ Insights │ ╞══════════╪═════════════════╪════════════╪════════════╪══════════════════════════════════════════════════════╡ │ 12348 │ compute_task │ 85.00% │ 95.00% │ High memory usage; Increase memory to avoid swapping │ ╘══════════╧═════════════════╧════════════╧════════════╧══════════════════════════════════════════════════════╛ -------------------------------------- Total Jobs: 1 Avg CPU Eff: 85.00%, Avg Mem Eff: 95.00% Summary: - High memory usage; Increase memory to avoid swapping --------------------------------------
Summary-Only Report
[edit | edit source]To display only summary statistics for user alice over the last 1 day:
python3 /opt/cluster_tools/babel_contrib/slurm_job_efficiency.py alice -t 1d -s
Output:
Slurm Job Efficiency: User: alice -------------------------------------- Total Jobs: 3 Avg CPU Eff: 75.33%, Avg Mem Eff: 68.67% Summary: - Underutilized CPU; Reduce requested resources or optimize job - Underutilized memory; Reduce memory allocation --------------------------------------
Explanation of Options
[edit | edit source]-a, --all: Analyzes all users with completed jobs.-j, --job-id: Specifies job IDs to analyze (e.g.,-j 12345,12346).-t, --time: Sets time range (e.g.,7dfor 7 days,12hfor 12 hours) or start date (e.g.,2025-04-01for jobs starting from April 1, 2025). Defaults to1dif not specified.-p, --partition: Filters by partitions (e.g.,-p compute,gpu).-r, --show-resource-details: Includes columns for requested cores, memory, and max memory used.-S, --sort-by: Sorts table byjobid,cpu_eff, ormem_eff.-s, --summary-only: Shows only summary statistics.--no-user: Suppresses printing the username.
Notes
[edit | edit source]- Invalid job IDs or users trigger warnings or errors, and the script skips them.
- Time ranges can be specified as
Nd(days),Nh(hours), or a date inYYYY-MM-DDformat.