SLURM Array Jobs Guide#

Author: Snit Sanghlao, Claude, QWEN AI agents

Quick Start#

# Submit 16 parallel FFT tasks
sbatch --array=0-15 slurm/fft-example.sh

# Check status
squeue -u $USER

# View results
cat logs/fft-*.log

Why Array Jobs?#

1. Parallel Execution#

Run 16 experiments simultaneously instead of sequentially:

Approach

Time (16 tasks Γ— 5min)

Sequential

80 minutes

Array Job

~5 minutes

2. QOS Limit Workaround#

Clusters limit concurrent jobs per user (typically 3-10). Array jobs count as ONE job:

Approach

Job Count

Result

32 individual jobs

32

REJECTED after 3

1 array job (32 tasks)

1

All 32 run

3. Fairness & Resource Utilization#

Array jobs + QOS limits together ensure fair cluster access:

User

Individual Jobs

Array Jobs

A

32 jobs (blocked)

1 job βœ“

B

32 jobs (blocked)

1 job βœ“

C

32 jobs (blocked)

1 job βœ“

Without array jobs, users would spam the queue. With array jobs, everyone gets equal job slots.


FFT Frequency Analysis Example#

Task Distribution#

Task

Signal

Freq Band

0

traffic_speed

low

1

traffic_volume

low

2

intersection_delay

low

3

traffic_speed

high

4

traffic_volume

high

5

intersection_delay

high

6

traffic_speed

all

7-15

(repeat cycle)

…

scripts/fft_analysis.py#

#!/usr/bin/env python3
"""FFT Frequency Analysis β€” Array Job Worker"""

import argparse
import json
import os
import numpy as np
from pathlib import Path

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--signal",    type=str,   default="traffic_speed")
    parser.add_argument("--freq-band", type=str,   default="all")
    parser.add_argument("--output",    type=str,   default="./results")
    args = parser.parse_args()

    task_id = int(os.getenv("SLURM_ARRAY_TASK_ID", 0))

    # Generate sample signal (replace with real data)
    np.random.seed(task_id)
    t = np.linspace(0, 10, 1000)
    signal = np.sin(2 * np.pi * 5 * t) + 0.5 * np.sin(2 * np.pi * 12 * t) + np.random.randn(1000) * 0.1

    # Compute FFT
    fft = np.fft.fft(signal)
    freqs = np.fft.fftfreq(len(signal), d=0.01)

    # Extract dominant frequencies
    magnitudes = np.abs(fft[:len(fft)//2])
    top_idx = np.argsort(magnitudes)[-5:][::-1]
    top_freqs = [(freqs[i], magnitudes[i]) for i in top_idx]

    result = {
        "task_id": task_id,
        "signal": args.signal,
        "freq_band": args.freq_band,
        "dominant_freqs": [{"freq": float(f), "mag": float(m)} for f, m in top_freqs],
        "status": "complete",
    }

    out_dir = Path(args.output) / "fft" / f"task{task_id}"
    out_dir.mkdir(parents=True, exist_ok=True)
    with open(out_dir / "result.json", "w") as f:
        json.dump(result, f, indent=2)

    print(f"Task {task_id}: Top freqs = {top_freqs[:3]}")
    return 0

if __name__ == "__main__":
    main()

slurm/fft-example.sh#

#!/bin/bash
# FFT Array Job β€” Parallel frequency analysis
# Submit: sbatch --array=0-15 slurm/fft-example.sh

#SBATCH --job-name=fft-analysis
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=2G
#SBATCH --time=00:10:00
#SBATCH --partition=defq
#SBATCH --output=logs/fft-%A_%a.log

source /cm/shared/apps/anaconda3/etc/profile.d/conda.sh
conda activate traffic

TASK_ID=$SLURM_ARRAY_TASK_ID

# Signal configurations (16 combinations)
SIGNALS=(traffic_speed traffic_volume intersection_delay)
FREQ_BANDS=(low high all)

# Map task_id to parameters
SIG_IDX=$((TASK_ID % 3))
FREQ_IDX=$(((TASK_ID / 3) % 3))

SIGNAL=${SIGNALS[$SIG_IDX]}
FREQ_BAND=${FREQ_BANDS[$FREQ_IDX]}

echo "Task $TASK_ID: signal=$SIGNAL, freq_band=$FREQ_BAND"

python3 scripts/fft_analysis.py \
    --signal $SIGNAL \
    --freq-band $FREQ_BAND \
    --output /scratch/$USER/fft-results

Smoke Test (No Real Data)#

Validate infrastructure before running full experiments:

scripts/fft_smoke_test.py#

#!/usr/bin/env python3
"""FFT Smoke Test β€” Validates FFT array job infrastructure"""

import argparse
import json
import os
import numpy as np
from pathlib import Path

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--signal",    type=str, default="test")
    parser.add_argument("--freq-band", type=str, default="all")
    parser.add_argument("--output",    type=str, default="./results")
    args = parser.parse_args()

    task_id = int(os.getenv("SLURM_ARRAY_TASK_ID", 0))

    # Generate synthetic signal (no real data needed)
    np.random.seed(task_id)
    t = np.linspace(0, 10, 1000)
    signal = np.sin(2 * np.pi * 5 * t) + np.random.randn(1000) * 0.1

    # Compute FFT
    fft = np.fft.fft(signal)
    freqs = np.fft.fftfreq(len(signal), d=0.01)

    # Extract top frequencies
    magnitudes = np.abs(fft[:len(fft)//2])
    top_idx = np.argsort(magnitudes)[-3:][::-1]
    top_freqs = [(freqs[i], magnitudes[i]) for i in top_idx]

    result = {
        "task_id": task_id,
        "signal": args.signal,
        "freq_band": args.freq_band,
        "dominant_freqs": [{"freq": float(f), "mag": float(m)} for f, m in top_freqs],
        "status": "complete",
        "smoke_test": True,
    }

    out_dir = Path(args.output) / "fft-smoke" / f"task{task_id}"
    out_dir.mkdir(parents=True, exist_ok=True)
    with open(out_dir / "result.json", "w") as f:
        json.dump(result, f, indent=2)

    print(f"Task {task_id}: Top freq = {top_freqs[0]}")
    return 0

if __name__ == "__main__":
    main()

slurm/fft-smoke-test.sh#

#!/bin/bash
# FFT Smoke Test β€” 16 parallel tasks, no real data
# Submit: sbatch --array=0-15 slurm/fft-smoke-test.sh

#SBATCH --job-name=fft-smoke
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=2G
#SBATCH --time=00:05:00
#SBATCH --partition=defq
#SBATCH --output=logs/fft-smoke-%A_%a.log

source /cm/shared/apps/anaconda3/etc/profile.d/conda.sh
conda activate python3

TASK_ID=$SLURM_ARRAY_TASK_ID

# 16 synthetic configurations
SIGNALS=(synthetic_1 synthetic_2 synthetic_3 synthetic_4)
FREQ_BANDS=(low high all)

SIG_IDX=$((TASK_ID % 4))
FREQ_IDX=$(((TASK_ID / 4) % 3))

SIGNAL=${SIGNALS[$SIG_IDX]}
FREQ_BAND=${FREQ_BANDS[$FREQ_IDX]}

echo "Task $TASK_ID: signal=$SIGNAL, freq_band=$FREQ_BAND"

python3 scripts/fft_smoke_test.py \
    --signal $SIGNAL \
    --freq-band $FREQ_BAND \
    --output /scratch/$USER/fft-results

Run Smoke Test#

sbatch --array=0-15 slurm/fft-smoke-test.sh

QOS Limits Explained#

What You See#

JOBID          STATE
339846_[0-2]   R    (running)
339846_[3-31]  PD   (QOSMaxJobsPerUserLimit)

Meaning#

  • PD = Pending (waiting)

  • QOSMaxJobsPerUserLimit = Cluster allows only 3 concurrent jobs per user

Why QOS Limits Exist#

Reason

Explanation

Fair Share

Prevent one user from monopolizing

Resource Protection

Protect login nodes, scheduler

Cost Control

Compute resources are expensive

How Array Jobs Help#

Array jobs count as ONE job entry:

# Individual jobs (hit QOS limit)
sbatch job1.sh    # Job 1 βœ“
sbatch job2.sh    # Job 2 βœ“
sbatch job3.sh    # Job 3 βœ“
sbatch job4.sh    # REJECTED

# Array job (bypasses QOS limit)
sbatch --array=0-31 exp.sh   # Job 1 βœ“ (32 tasks under 1 entry)

Check Your Cluster’s QOS#

# View QOS definitions
scontrol show qos

# Check your QOS limits
sacctmgr show qos format=Name,MaxJobs,MaxSubmitJobs

# Check current jobs
squeue -u $USER -o "%A %a %J %T"

Commands Reference#

Action

Command

Submit array job

sbatch --array=0-15 script.sh

Check status

squeue -u $USER

Check specific job

squeue -j <JOB_ID>

Cancel entire array

scancel <JOB_ID>

Cancel specific task

scancel <JOB_ID>_[0,5,10]

View logs

cat logs/fft-*.log

View QOS limits

scontrol show qos


Resource Efficiency#

Task Sizing Guide#

Task Type

CPUs

Memory

Time

FFT (light)

2

2GB

5 min

Traffic sim (med)

2

4GB

1 hour

ML training (heavy)

8

32GB

4 hours

CPU-Oriented Tasks#

For compute-bound workloads (FFT, matrix ops, simulations):

Scenario

CPUs Per Task

Reason

Single-threaded FFT

2

1 for compute, 1 for I/O

Multi-threaded FFT

4-8

Parallelize across cores

Vectorized (numpy)

2-4

Leverage SIMD, avoid oversubscribe

Parallel (multiprocessing)

8-16

Spawn worker processes

Node Packing (CPU-Heavy)#

Node: 64 CPUs, 256GB RAM

Task Type

CPUs/Task

Tasks/Node

Total Time (32 tasks)

Light (2 CPUs)

2

32

~5 min

Medium (4 CPUs)

4

16

~10 min (2 batches)

Heavy (8 CPUs)

8

8

~20 min (4 batches)

Very Heavy (16)

16

4

~40 min (8 batches)

When to Increase CPUs Per Task#

# Single task, more cores = faster
#SBATCH --cpus-per-task=8    # FFT with parallel backend

# Many tasks, limit total cores
#SBATCH --cpus-per-task=2    # 32 tasks Γ— 2 = 64 cores

Memory-Bound vs CPU-Bound#

Type

Bottleneck

Solution

CPU-bound

Compute

Add more CPUs

Memory-bound

RAM

Add more memory

I/O-bound

Disk/Network

Reduce I/O, cache results

Right-Sizing Matters#

# Over-allocated (waste)
#SBATCH --cpus-per-task=8    # Only uses 2
#SBATCH --mem=16G           # Only uses 2GB
# Result: 75% waste, fewer tasks per node

# Right-sized (efficient)
#SBATCH --cpus-per-task=2
#SBATCH --mem=2G
# Result: 100% utilization, more tasks per node

Output Format#

{
  "task_id": 5,
  "signal": "traffic_volume",
  "freq_band": "high",
  "dominant_freqs": [
    {"freq": 12.0, "mag": 45.3},
    {"freq": 5.0, "mag": 32.1},
    {"freq": 0.5, "mag": 12.8}
  ],
  "status": "complete"
}

Aggregating Results#

import json
from pathlib import Path

results_dir = Path("/scratch/$USER/fft-results/fft")
results = []

for task_dir in results_dir.glob("task*"):
    result_file = task_dir / "result.json"
    if result_file.exists():
        results.append(json.loads(result_file.read_text()))

# Print summary
for r in results:
    print(f"Task {r['task_id']}: {r['signal']} | {r['freq_band']} | {r['status']}")

Best Practices#

Practice

Why

Use array jobs

Counts as 1 job, bypasses QOS limit

Right-size tasks

Maximize nodes, reduce waste

Run smoke test

Validate before full experiment

Monitor utilization

Check actual vs allocated resources

Batch similar tasks

Group by resource requirements


Summary#

Concept

Key Point

Array jobs

1 job entry, N tasks

QOS limits

Fair share, prevent monopolization

Fairness

Array + QOS = equal access for all users

Efficiency

Right-size tasks, pack nodes fully

Smoke test

Validate infrastructure before real work