Getting Started with Slurm: Submit Your First Multi-GPU Job

Table of Contents

What is Slurm?
Slurm Architecture
1. Partitions & Nodes
2. Resource Allocation
Your First Slurm Job
1. Writing a Batch Script
2. Submit & Monitor
Multi-GPU Jobs
1. Single Node Multi-GPU
2. Multi-Node Distributed
PyTorch + Slurm Integration
CSC-Specific Tips (Puhti/Mahti/LUMI)
Common Issues & Debugging

1. What is Slurm?

Slurm (Simple Linux Utility for Resource Management) is the most widely used job scheduler for HPC clusters. It manages:

Job queuing: Fair scheduling across many users
Resource allocation: CPUs, GPUs, memory, time limits
Job execution: Running your code on compute nodes
Accounting: Tracking resource usage and billing

Slurm Cluster Architecture

Users submit jobs from login nodes → Slurm schedules → Jobs run on compute nodes

Never Run on Login Nodes!

Login nodes are shared by all users for job submission only. Running heavy computation there will get your account suspended. Always use sbatch or srun to run on compute nodes.

2. Slurm Architecture

Partitions & Nodes

Slurm organizes resources into partitions (also called queues). Each partition has different:

Hardware (GPU vs CPU nodes)
Time limits
Priority levels
Access restrictions

$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up 3-00:00:00 24 idle gpu[001-024] gpu-large up 7-00:00:00 8 idle gpu[101-108] cpu up 3-00:00:00 200 idle cpu[001-200] interactive up 4:00:00 10 idle int[001-010] test up 1:00:00 4 idle test[001-004]

Resource Allocation

Key resources you request:

Slurm Resource Hierarchy

Understanding the hierarchy: Nodes → Tasks → CPUs/GPUs → Memory

3. Your First Slurm Job

Writing a Batch Script

A Slurm batch script starts with #SBATCH directives:

Bash my_first_job.sh

#!/bin/bash
#SBATCH --job-name=hello_gpu       # Job name
#SBATCH --account=project_2001234  # Billing account
#SBATCH --partition=gpu            # Queue/partition
#SBATCH --time=00:15:00            # Time limit (HH:MM:SS)
#SBATCH --nodes=1                  # Number of nodes
#SBATCH --ntasks=1                 # Number of tasks (processes)
#SBATCH --cpus-per-task=10         # CPUs per task
#SBATCH --gres=gpu:v100:1          # 1 V100 GPU
#SBATCH --mem=32G                  # Memory per node
#SBATCH --output=output_%j.txt     # Output file (%j = job ID)
#SBATCH --error=error_%j.txt       # Error file

# Load required modules
module load pytorch/2.1

# Print some info
echo "Job started on $(hostname)"
echo "CUDA devices: $CUDA_VISIBLE_DEVICES"

# Run your Python script
python train.py --epochs 10 --batch-size 64

echo "Job finished!"

Submit & Monitor

Bash commands.sh

# Submit the job
sbatch my_first_job.sh
# Output: Submitted batch job 12345678

# Check job status
squeue --me               # Your jobs only
squeue -u $USER           # Same thing
squeue -j 12345678        # Specific job

# Detailed job info
scontrol show job 12345678

# Cancel a job
scancel 12345678

# Cancel all your jobs
scancel -u $USER

# Check job efficiency after completion
seff 12345678

# View past jobs (accounting)
sacct -j 12345678 --format=JobID,JobName,Elapsed,MaxRSS,MaxVMSize,State

$ sbatch my_first_job.sh Submitted batch job 12345678 $ squeue --me JOBID PARTITION NAME USER STATE TIME NODES NODELIST(REASON) 12345678 gpu hello_gpu user1 RUNNING 0:05 1 gpu003 $ seff 12345678 Job ID: 12345678 Cluster: mahti User/Group: user1/users State: COMPLETED (exit code 0) Cores: 10 CPU Utilized: 00:08:23 CPU Efficiency: 83.83% Memory Utilized: 24.5 GB Memory Efficiency: 76.56% of 32.00 GB

4. Multi-GPU Jobs

Single Node Multi-GPU

For training on multiple GPUs within one node (most common for moderate-scale training):

Bash single_node_4gpu.sh

#!/bin/bash
#SBATCH --job-name=train_4gpu
#SBATCH --account=project_2001234
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4        # 1 task per GPU
#SBATCH --cpus-per-task=10         # 10 CPUs per GPU
#SBATCH --gres=gpu:a100:4          # 4 A100 GPUs
#SBATCH --mem=200G
#SBATCH --output=train_%j.out

module load pytorch/2.1

# Launch with torchrun (recommended for PyTorch DDP)
srun torchrun \
    --standalone \
    --nnodes=1 \
    --nproc_per_node=4 \
    train_ddp.py \
    --batch-size 256 \
    --epochs 100

Single Node Multi-GPU Setup

DDP: Each GPU has full model copy, data is split across GPUs

Multi-Node Distributed Training

For large-scale training across multiple nodes:

Bash multi_node_8gpu.sh

#!/bin/bash
#SBATCH --job-name=train_multinode
#SBATCH --account=project_2001234
#SBATCH --partition=gpu
#SBATCH --time=24:00:00
#SBATCH --nodes=2                  # 2 nodes
#SBATCH --ntasks-per-node=4        # 4 tasks per node
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:a100:4          # 4 GPUs per node
#SBATCH --mem=200G
#SBATCH --output=multinode_%j.out

module load pytorch/2.1

# Get master node address
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=29500

echo "Master node: $MASTER_ADDR"
echo "Total GPUs: $((SLURM_NNODES * 4))"

# Launch distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_ddp.py \
    --batch-size 512 \
    --epochs 100

Multi-Node Distributed Training (2 nodes × 4 GPUs)

8 GPUs total: WORLD_SIZE=8, ranks 0-7 distributed across 2 nodes

5. PyTorch + Slurm Integration

Your PyTorch script needs to read Slurm environment variables:

Python train_ddp.py

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_distributed():
    """Initialize distributed training from Slurm environment."""
    
    # torchrun sets these automatically
    rank = int(os.environ.get("RANK", 0))
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    
    # Initialize process group
    dist.init_process_group(
        backend="nccl",  # Use NCCL for GPU
        init_method="env://",
        world_size=world_size,
        rank=rank
    )
    
    # Set device for this process
    torch.cuda.set_device(local_rank)
    
    print(f"Initialized rank {rank}/{world_size} on GPU {local_rank}")
    
    return rank, world_size, local_rank


def main():
    rank, world_size, local_rank = setup_distributed()
    device = torch.device(f"cuda:{local_rank}")
    
    # Create model and wrap with DDP
    model = MyModel().to(device)
    model = DDP(model, device_ids=[local_rank])
    
    # Create distributed sampler
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_dataset,
        num_replicas=world_size,
        rank=rank,
        shuffle=True
    )
    
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=64,            # Per-GPU batch size
        sampler=train_sampler,
        num_workers=10,            # Match cpus-per-task
        pin_memory=True
    )
    
    # Training loop
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)  # Important for shuffling!
        
        for batch in train_loader:
            # ... training step ...
            pass
    
    # Cleanup
    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Key Environment Variables

RANK: Global rank (0 to WORLD_SIZE-1)
LOCAL_RANK: Rank within the node (0 to GPUs_per_node-1)
WORLD_SIZE: Total number of processes
MASTER_ADDR: IP of rank 0 node
MASTER_PORT: Port for communication

6. CSC-Specific Tips (Puhti/Mahti/LUMI)

If you're using CSC's supercomputers in Finland, here are cluster-specific configurations:

🖥️ Puhti (GPU)

80 GPU nodes
4× V100 (32GB) per node
Partition: gpu, gputest
Max time: 3 days

🚀 Mahti (GPU)

24 GPU nodes
4× A100 (40GB) per node
Partition: gpusmall, gpumedium
NVLink + HDR InfiniBand

⚡ LUMI-G

2,978 GPU nodes
4× AMD MI250X per node
Use ROCm instead of CUDA
Partition: standard-g

Mahti GPU Job Example

Bash mahti_a100.sh

#!/bin/bash
#SBATCH --job-name=llm_train
#SBATCH --account=project_2001234
#SBATCH --partition=gpumedium       # 1-8 GPUs: gpusmall, 1-32: gpumedium
#SBATCH --time=24:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32          # Mahti has 128 cores/node
#SBATCH --gres=gpu:a100:4
#SBATCH --mem=0                     # Request all memory

# CSC-specific modules
module load pytorch/2.1

# Use local SSD for faster data loading
export TMPDIR=/local_scratch/$SLURM_JOB_ID
cp -r $SCRATCH/my_data $TMPDIR/

# Enable NCCL debugging (optional)
export NCCL_DEBUG=INFO

# Launch training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$(hostname):29500 \
    train.py --data $TMPDIR/my_data

CSC Billing

CSC uses Billing Units (BUs). GPU jobs consume BUs faster than CPU jobs. Check your allocation with: csc-projects

7. Common Issues & Debugging

Issue	Cause	Solution
Job stuck in `PENDING`	Resources not available	`squeue -j JOBID -o "%R"` to see reason
`CUDA out of memory`	Batch size too large	Reduce batch size or use gradient accumulation
DDP hangs at init	Network/firewall issues	Check MASTER_ADDR, try different port
`NCCL timeout`	One rank crashed/slow	`export NCCL_DEBUG=INFO` for details
Job killed (OOM)	Exceeded memory limit	Request more memory with `--mem`

Useful Debugging Commands

Bash debug_commands.sh

# Why is my job pending?
squeue -j $JOBID -o "%j %T %r"

# Check node health
sinfo -N -l

# Get detailed job info
scontrol show job $JOBID

# Check GPU usage on a node (if you have access)
ssh gpu001 nvidia-smi

# View job output in real-time
tail -f output_$JOBID.txt

# Cancel and resubmit with more resources
scancel $JOBID
sbatch --mem=64G my_job.sh  # Override memory

# Interactive GPU session for debugging
srun --partition=gpu --gres=gpu:1 --time=01:00:00 --pty bash

Summary: Slurm Cheat Sheet

Essential Slurm Commands

Pro Tips

1. Start with --time=00:30:00 for testing, then increase for production.
2. Use seff JOBID after completion to check CPU/memory efficiency.
3. Set --cpus-per-task equal to your DataLoader num_workers.
4. Always use DistributedSampler with DDP to avoid duplicate batches.