Understanding CUDA Memory Hierarchy: A Practical Guide
Deep dive into global, shared, and register memory with practical examples, diagrams, and performance benchmarks.
PyTorch DDP vs FSDP: When to Use Which
Comprehensive comparison of distributed training strategies with practical code examples and decision guide.
ADMM for Distributed Machine Learning: From Theory to Practice
Step-by-step implementation of ADMM-based distributed optimization with convergence guarantees.
AI Cluster Architectures: A Deep Dive
Understanding GPU cluster hierarchies, NVLink, InfiniBand, and network topologies for large-scale AI training.
Getting Started with Slurm: Submit Your First Multi-GPU Job
Practical guide to Slurm job scheduling with examples from CSC's Puhti and Mahti supercomputers.
Nsight Systems & Compute: Profiling Your CUDA Kernels
Learn to identify bottlenecks and optimize GPU kernels using NVIDIA's profiling tools.
Efficient Inference: A Complete Guide to Pruning and Quantization
Deep dive into model compression: unstructured, structured, N:M sparsity, INT8/FP16 quantization, and hardware acceleration.
Writing Custom CUDA Kernels with Triton
Master GPU programming with OpenAI Triton: from vector addition to fused attention kernels with Python-like syntax.