1. What is Slurm?

Slurm (Simple Linux Utility for Resource Management) is the most widely used job scheduler for HPC clusters. It manages:

Slurm Cluster Architecture
Login Nodes login1, login2 Submit jobs from here 👤 You sbatch Slurm Controller slurmctld Schedules jobs Manages queue Compute Nodes gpu001 4× A100 gpu002 4× A100 gpu003 4× A100 cpu001 128 cores cpu002 128 cores ... allocate Shared Storage /scratch, /projappl Lustre filesystem
Users submit jobs from login nodes → Slurm schedules → Jobs run on compute nodes
Never Run on Login Nodes!

Login nodes are shared by all users for job submission only. Running heavy computation there will get your account suspended. Always use sbatch or srun to run on compute nodes.

2. Slurm Architecture

Partitions & Nodes

Slurm organizes resources into partitions (also called queues). Each partition has different:

$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up 3-00:00:00 24 idle gpu[001-024] gpu-large up 7-00:00:00 8 idle gpu[101-108] cpu up 3-00:00:00 200 idle cpu[001-200] interactive up 4:00:00 10 idle int[001-010] test up 1:00:00 4 idle test[001-004]

Resource Allocation

Key resources you request:

Slurm Resource Hierarchy
Node (--nodes=N) Physical machine with CPUs, GPUs, Memory Example: gpu001 = 128 CPU cores + 4× A100 GPUs + 512GB RAM Task (--ntasks=N) MPI rank / process 1 task per GPU for DDP CPUs per Task (--cpus-per-task=N) Threads for data loading Usually 8-16 for PyTorch DataLoader GPUs (--gpus=N or --gres=gpu:N) Total GPUs across all nodes Memory (--mem=NG) RAM per node (default: proportional to CPUs)
Understanding the hierarchy: Nodes → Tasks → CPUs/GPUs → Memory

3. Your First Slurm Job

Writing a Batch Script

A Slurm batch script starts with #SBATCH directives:

Bash my_first_job.sh
#!/bin/bash
#SBATCH --job-name=hello_gpu       # Job name
#SBATCH --account=project_2001234  # Billing account
#SBATCH --partition=gpu            # Queue/partition
#SBATCH --time=00:15:00            # Time limit (HH:MM:SS)
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --ntasks=1                 # Number of tasks (processes)
#SBATCH --cpus-per-task=10         # CPUs per task
#SBATCH --gres=gpu:v100:1          # 1 V100 GPU
#SBATCH --mem=32G                  # Memory per node
#SBATCH --output=output_%j.txt     # Output file (%j = job ID)
#SBATCH --error=error_%j.txt       # Error file

# Load required modules
module load pytorch/2.1

# Print some info
echo "Job started on $(hostname)"
echo "CUDA devices: $CUDA_VISIBLE_DEVICES"

# Run your Python script
python train.py --epochs 10 --batch-size 64

echo "Job finished!"

Submit & Monitor

Bash commands.sh
# Submit the job
sbatch my_first_job.sh
# Output: Submitted batch job 12345678

# Check job status
squeue --me               # Your jobs only
squeue -u $USER           # Same thing
squeue -j 12345678        # Specific job

# Detailed job info
scontrol show job 12345678

# Cancel a job
scancel 12345678

# Cancel all your jobs
scancel -u $USER

# Check job efficiency after completion
seff 12345678

# View past jobs (accounting)
sacct -j 12345678 --format=JobID,JobName,Elapsed,MaxRSS,MaxVMSize,State
$ sbatch my_first_job.sh Submitted batch job 12345678 $ squeue --me JOBID PARTITION NAME USER STATE TIME NODES NODELIST(REASON) 12345678 gpu hello_gpu user1 RUNNING 0:05 1 gpu003 $ seff 12345678 Job ID: 12345678 Cluster: mahti User/Group: user1/users State: COMPLETED (exit code 0) Cores: 10 CPU Utilized: 00:08:23 CPU Efficiency: 83.83% Memory Utilized: 24.5 GB Memory Efficiency: 76.56% of 32.00 GB

4. Multi-GPU Jobs

Single Node Multi-GPU

For training on multiple GPUs within one node (most common for moderate-scale training):

Bash single_node_4gpu.sh
#!/bin/bash
#SBATCH --job-name=train_4gpu
#SBATCH --account=project_2001234
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4        # 1 task per GPU
#SBATCH --cpus-per-task=10         # 10 CPUs per GPU
#SBATCH --gres=gpu:a100:4          # 4 A100 GPUs
#SBATCH --mem=200G
#SBATCH --output=train_%j.out

module load pytorch/2.1

# Launch with torchrun (recommended for PyTorch DDP)
srun torchrun \
    --standalone \
    --nnodes=1 \
    --nproc_per_node=4 \
    train_ddp.py \
    --batch-size 256 \
    --epochs 100
Single Node Multi-GPU Setup
Node: gpu001 GPU 0 (A100) Rank 0 batch[:64] Model Copy GPU 1 (A100) Rank 1 batch[64:128] Model Copy GPU 2 (A100) Rank 2 batch[128:192] Model Copy GPU 3 (A100) Rank 3 batch[192:256] Model Copy NVLink: All-Reduce gradients at 600 GB/s Total batch = 256, each GPU processes 64 samples
DDP: Each GPU has full model copy, data is split across GPUs

Multi-Node Distributed Training

For large-scale training across multiple nodes:

Bash multi_node_8gpu.sh
#!/bin/bash
#SBATCH --job-name=train_multinode
#SBATCH --account=project_2001234
#SBATCH --partition=gpu
#SBATCH --time=24:00:00
#SBATCH --nodes=2                  # 2 nodes
#SBATCH --ntasks-per-node=4        # 4 tasks per node
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:a100:4          # 4 GPUs per node
#SBATCH --mem=200G
#SBATCH --output=multinode_%j.out

module load pytorch/2.1

# Get master node address
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500

echo "Master node: $MASTER_ADDR"
echo "Total GPUs: $((SLURM_NNODES * 4))"

# Launch distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_ddp.py \
    --batch-size 512 \
    --epochs 100
Multi-Node Distributed Training (2 nodes × 4 GPUs)
Node 0: gpu001 (Master) Rank 0 Rank 1 Rank 2 Rank 3 LOCAL_RANK: 0, 1, 2, 3 NVLink intra-node Node 1: gpu002 Rank 4 Rank 5 Rank 6 Rank 7 LOCAL_RANK: 0, 1, 2, 3 NVLink intra-node IB InfiniBand HDR (200 Gb/s) NCCL all-reduce across nodes
8 GPUs total: WORLD_SIZE=8, ranks 0-7 distributed across 2 nodes

5. PyTorch + Slurm Integration

Your PyTorch script needs to read Slurm environment variables:

Python train_ddp.py
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_distributed():
    """Initialize distributed training from Slurm environment."""
    
    # torchrun sets these automatically
    rank = int(os.environ.get("RANK", 0))
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    
    # Initialize process group
    dist.init_process_group(
        backend="nccl",  # Use NCCL for GPU
        init_method="env://",
        world_size=world_size,
        rank=rank
    )
    
    # Set device for this process
    torch.cuda.set_device(local_rank)
    
    print(f"Initialized rank {rank}/{world_size} on GPU {local_rank}")
    
    return rank, world_size, local_rank


def main():
    rank, world_size, local_rank = setup_distributed()
    device = torch.device(f"cuda:{local_rank}")
    
    # Create model and wrap with DDP
    model = MyModel().to(device)
    model = DDP(model, device_ids=[local_rank])
    
    # Create distributed sampler
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_dataset,
        num_replicas=world_size,
        rank=rank,
        shuffle=True
    )
    
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=64,            # Per-GPU batch size
        sampler=train_sampler,
        num_workers=10,            # Match cpus-per-task
        pin_memory=True
    )
    
    # Training loop
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)  # Important for shuffling!
        
        for batch in train_loader:
            # ... training step ...
            pass
    
    # Cleanup
    dist.destroy_process_group()


if __name__ == "__main__":
    main()
Key Environment Variables

RANK: Global rank (0 to WORLD_SIZE-1)
LOCAL_RANK: Rank within the node (0 to GPUs_per_node-1)
WORLD_SIZE: Total number of processes
MASTER_ADDR: IP of rank 0 node
MASTER_PORT: Port for communication

6. CSC-Specific Tips (Puhti/Mahti/LUMI)

If you're using CSC's supercomputers in Finland, here are cluster-specific configurations:

🖥️ Puhti (GPU)

  • 80 GPU nodes
  • 4× V100 (32GB) per node
  • Partition: gpu, gputest
  • Max time: 3 days

🚀 Mahti (GPU)

  • 24 GPU nodes
  • 4× A100 (40GB) per node
  • Partition: gpusmall, gpumedium
  • NVLink + HDR InfiniBand

⚡ LUMI-G

  • 2,978 GPU nodes
  • 4× AMD MI250X per node
  • Use ROCm instead of CUDA
  • Partition: standard-g

Mahti GPU Job Example

Bash mahti_a100.sh
#!/bin/bash
#SBATCH --job-name=llm_train
#SBATCH --account=project_2001234
#SBATCH --partition=gpumedium       # 1-8 GPUs: gpusmall, 1-32: gpumedium
#SBATCH --time=24:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32          # Mahti has 128 cores/node
#SBATCH --gres=gpu:a100:4
#SBATCH --mem=0                     # Request all memory

# CSC-specific modules
module load pytorch/2.1

# Use local SSD for faster data loading
export TMPDIR=/local_scratch/$SLURM_JOB_ID
cp -r $SCRATCH/my_data $TMPDIR/

# Enable NCCL debugging (optional)
export NCCL_DEBUG=INFO

# Launch training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$(hostname):29500 \
    train.py --data $TMPDIR/my_data
CSC Billing

CSC uses Billing Units (BUs). GPU jobs consume BUs faster than CPU jobs. Check your allocation with: csc-projects

7. Common Issues & Debugging

Issue Cause Solution
Job stuck in PENDING Resources not available squeue -j JOBID -o "%R" to see reason
CUDA out of memory Batch size too large Reduce batch size or use gradient accumulation
DDP hangs at init Network/firewall issues Check MASTER_ADDR, try different port
NCCL timeout One rank crashed/slow export NCCL_DEBUG=INFO for details
Job killed (OOM) Exceeded memory limit Request more memory with --mem

Useful Debugging Commands

Bash debug_commands.sh
# Why is my job pending?
squeue -j $JOBID -o "%j %T %r"

# Check node health
sinfo -N -l

# Get detailed job info
scontrol show job $JOBID

# Check GPU usage on a node (if you have access)
ssh gpu001 nvidia-smi

# View job output in real-time
tail -f output_$JOBID.txt

# Cancel and resubmit with more resources
scancel $JOBID
sbatch --mem=64G my_job.sh  # Override memory

# Interactive GPU session for debugging
srun --partition=gpu --gres=gpu:1 --time=01:00:00 --pty bash

Summary: Slurm Cheat Sheet

Essential Slurm Commands
Submit & Run sbatch script.sh srun ./program salloc --gres=gpu:1 Submit batch / Run interactively Monitor squeue --me scontrol show job ID seff ID Check status / Job details / Efficiency Control & Info scancel ID sinfo -p gpu sacct -j ID Cancel / Partition info / Accounting Essential #SBATCH Flags --nodes=N Number of nodes --ntasks-per-node=N Tasks per node (1 per GPU for DDP) --gres=gpu:TYPE:N GPU type and count --cpus-per-task=N CPUs per task (for DataLoader) --time=HH:MM:SS Time limit --mem=NG Memory per node
Pro Tips

1. Start with --time=00:30:00 for testing, then increase for production.
2. Use seff JOBID after completion to check CPU/memory efficiency.
3. Set --cpus-per-task equal to your DataLoader num_workers.
4. Always use DistributedSampler with DDP to avoid duplicate batches.