1. What is Slurm?
Slurm (Simple Linux Utility for Resource Management) is the most widely used job scheduler for HPC clusters. It manages:
- Job queuing: Fair scheduling across many users
- Resource allocation: CPUs, GPUs, memory, time limits
- Job execution: Running your code on compute nodes
- Accounting: Tracking resource usage and billing
Login nodes are shared by all users for job submission only. Running heavy computation there will get your account suspended. Always use sbatch or srun to run on compute nodes.
2. Slurm Architecture
Partitions & Nodes
Slurm organizes resources into partitions (also called queues). Each partition has different:
- Hardware (GPU vs CPU nodes)
- Time limits
- Priority levels
- Access restrictions
Resource Allocation
Key resources you request:
3. Your First Slurm Job
Writing a Batch Script
A Slurm batch script starts with #SBATCH directives:
#!/bin/bash
#SBATCH --job-name=hello_gpu # Job name
#SBATCH --account=project_2001234 # Billing account
#SBATCH --partition=gpu # Queue/partition
#SBATCH --time=00:15:00 # Time limit (HH:MM:SS)
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of tasks (processes)
#SBATCH --cpus-per-task=10 # CPUs per task
#SBATCH --gres=gpu:v100:1 # 1 V100 GPU
#SBATCH --mem=32G # Memory per node
#SBATCH --output=output_%j.txt # Output file (%j = job ID)
#SBATCH --error=error_%j.txt # Error file
# Load required modules
module load pytorch/2.1
# Print some info
echo "Job started on $(hostname)"
echo "CUDA devices: $CUDA_VISIBLE_DEVICES"
# Run your Python script
python train.py --epochs 10 --batch-size 64
echo "Job finished!"
Submit & Monitor
# Submit the job
sbatch my_first_job.sh
# Output: Submitted batch job 12345678
# Check job status
squeue --me # Your jobs only
squeue -u $USER # Same thing
squeue -j 12345678 # Specific job
# Detailed job info
scontrol show job 12345678
# Cancel a job
scancel 12345678
# Cancel all your jobs
scancel -u $USER
# Check job efficiency after completion
seff 12345678
# View past jobs (accounting)
sacct -j 12345678 --format=JobID,JobName,Elapsed,MaxRSS,MaxVMSize,State
4. Multi-GPU Jobs
Single Node Multi-GPU
For training on multiple GPUs within one node (most common for moderate-scale training):
#!/bin/bash
#SBATCH --job-name=train_4gpu
#SBATCH --account=project_2001234
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4 # 1 task per GPU
#SBATCH --cpus-per-task=10 # 10 CPUs per GPU
#SBATCH --gres=gpu:a100:4 # 4 A100 GPUs
#SBATCH --mem=200G
#SBATCH --output=train_%j.out
module load pytorch/2.1
# Launch with torchrun (recommended for PyTorch DDP)
srun torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node=4 \
train_ddp.py \
--batch-size 256 \
--epochs 100
Multi-Node Distributed Training
For large-scale training across multiple nodes:
#!/bin/bash
#SBATCH --job-name=train_multinode
#SBATCH --account=project_2001234
#SBATCH --partition=gpu
#SBATCH --time=24:00:00
#SBATCH --nodes=2 # 2 nodes
#SBATCH --ntasks-per-node=4 # 4 tasks per node
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:a100:4 # 4 GPUs per node
#SBATCH --mem=200G
#SBATCH --output=multinode_%j.out
module load pytorch/2.1
# Get master node address
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500
echo "Master node: $MASTER_ADDR"
echo "Total GPUs: $((SLURM_NNODES * 4))"
# Launch distributed training
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train_ddp.py \
--batch-size 512 \
--epochs 100
5. PyTorch + Slurm Integration
Your PyTorch script needs to read Slurm environment variables:
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed():
"""Initialize distributed training from Slurm environment."""
# torchrun sets these automatically
rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
local_rank = int(os.environ.get("LOCAL_RANK", 0))
# Initialize process group
dist.init_process_group(
backend="nccl", # Use NCCL for GPU
init_method="env://",
world_size=world_size,
rank=rank
)
# Set device for this process
torch.cuda.set_device(local_rank)
print(f"Initialized rank {rank}/{world_size} on GPU {local_rank}")
return rank, world_size, local_rank
def main():
rank, world_size, local_rank = setup_distributed()
device = torch.device(f"cuda:{local_rank}")
# Create model and wrap with DDP
model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])
# Create distributed sampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=world_size,
rank=rank,
shuffle=True
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=64, # Per-GPU batch size
sampler=train_sampler,
num_workers=10, # Match cpus-per-task
pin_memory=True
)
# Training loop
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch) # Important for shuffling!
for batch in train_loader:
# ... training step ...
pass
# Cleanup
dist.destroy_process_group()
if __name__ == "__main__":
main()
RANK: Global rank (0 to WORLD_SIZE-1)
LOCAL_RANK: Rank within the node (0 to GPUs_per_node-1)
WORLD_SIZE: Total number of processes
MASTER_ADDR: IP of rank 0 node
MASTER_PORT: Port for communication
6. CSC-Specific Tips (Puhti/Mahti/LUMI)
If you're using CSC's supercomputers in Finland, here are cluster-specific configurations:
🖥️ Puhti (GPU)
- 80 GPU nodes
- 4× V100 (32GB) per node
- Partition:
gpu,gputest - Max time: 3 days
🚀 Mahti (GPU)
- 24 GPU nodes
- 4× A100 (40GB) per node
- Partition:
gpusmall,gpumedium - NVLink + HDR InfiniBand
⚡ LUMI-G
- 2,978 GPU nodes
- 4× AMD MI250X per node
- Use ROCm instead of CUDA
- Partition:
standard-g
Mahti GPU Job Example
#!/bin/bash
#SBATCH --job-name=llm_train
#SBATCH --account=project_2001234
#SBATCH --partition=gpumedium # 1-8 GPUs: gpusmall, 1-32: gpumedium
#SBATCH --time=24:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32 # Mahti has 128 cores/node
#SBATCH --gres=gpu:a100:4
#SBATCH --mem=0 # Request all memory
# CSC-specific modules
module load pytorch/2.1
# Use local SSD for faster data loading
export TMPDIR=/local_scratch/$SLURM_JOB_ID
cp -r $SCRATCH/my_data $TMPDIR/
# Enable NCCL debugging (optional)
export NCCL_DEBUG=INFO
# Launch training
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--rdzv_backend=c10d \
--rdzv_endpoint=$(hostname):29500 \
train.py --data $TMPDIR/my_data
CSC uses Billing Units (BUs). GPU jobs consume BUs faster than CPU jobs. Check your allocation with: csc-projects
7. Common Issues & Debugging
| Issue | Cause | Solution |
|---|---|---|
Job stuck in PENDING |
Resources not available | squeue -j JOBID -o "%R" to see reason |
CUDA out of memory |
Batch size too large | Reduce batch size or use gradient accumulation |
| DDP hangs at init | Network/firewall issues | Check MASTER_ADDR, try different port |
NCCL timeout |
One rank crashed/slow | export NCCL_DEBUG=INFO for details |
| Job killed (OOM) | Exceeded memory limit | Request more memory with --mem |
Useful Debugging Commands
# Why is my job pending?
squeue -j $JOBID -o "%j %T %r"
# Check node health
sinfo -N -l
# Get detailed job info
scontrol show job $JOBID
# Check GPU usage on a node (if you have access)
ssh gpu001 nvidia-smi
# View job output in real-time
tail -f output_$JOBID.txt
# Cancel and resubmit with more resources
scancel $JOBID
sbatch --mem=64G my_job.sh # Override memory
# Interactive GPU session for debugging
srun --partition=gpu --gres=gpu:1 --time=01:00:00 --pty bash
Summary: Slurm Cheat Sheet
1. Start with --time=00:30:00 for testing, then increase for production.
2. Use seff JOBID after completion to check CPU/memory efficiency.
3. Set --cpus-per-task equal to your DataLoader num_workers.
4. Always use DistributedSampler with DDP to avoid duplicate batches.