1. Why Profile GPU Code?

Writing CUDA code that works is one thing. Writing CUDA code that achieves peak performance is another challenge entirely. Without profiling, you're optimizing blind.

Common performance issues that profiling reveals:

The GPU Performance Gap
Theoretical Peak: 100 TFLOPS Naive: 20% After Profiling & Optimization: 80% 4x speedup! H100 SXM Memory bound, low occupancy Optimized memory, high occupancy "Without measurement, optimization is just guessing"
Profiling often reveals 2-5x improvement opportunities in CUDA code
Rule of Thumb

If your GPU utilization is below 70%, there's almost certainly a bottleneck that profiling can identify. Even at 90%, there may be optimization opportunities.

2. Nsight Systems vs Nsight Compute

NVIDIA provides two complementary profiling tools, each designed for different levels of analysis:

Nsight Systems vs Nsight Compute
Nsight Systems "System-level profiler" What it shows: • CPU-GPU timeline visualization • Kernel launch patterns • Memory transfers (H2D, D2H) • CUDA API calls • Multi-GPU/multi-process Best for: • Finding where time goes (big picture) • Identifying CPU-GPU bottlenecks • Understanding async/stream behavior Nsight Compute "Kernel-level profiler" What it shows: • SM utilization & occupancy • Memory throughput (L1, L2, DRAM) • Warp stall reasons • Instruction mix (FP32, FP64, INT) • Roofline analysis Best for: • Optimizing individual kernels • Understanding memory access patterns • Identifying compute vs memory bound then
Use Nsight Systems first (find WHERE), then Nsight Compute (understand WHY)

When to Use Each Tool

Question Tool
"Where is time being spent?" Nsight Systems
"Why is this kernel slow?" Nsight Compute
"Is my memory transfer overlapping with compute?" Nsight Systems
"What's my memory bandwidth utilization?" Nsight Compute
"Is my application CPU or GPU bound?" Nsight Systems

3. Nsight Systems Deep Dive

Nsight Systems captures a timeline of your entire application, showing how CPU and GPU activities interleave. It's the best starting point for any optimization effort.

Command-Line Profiling

Basic profiling with nsys:

Bash profile_app.sh
# Basic profile - generates .nsys-rep file
nsys profile --output=my_app_profile ./my_cuda_app

# Profile with specific options
nsys profile \
  --trace=cuda,nvtx,osrt \          # What to trace
  --sample=cpu \                     # CPU sampling
  --cudabacktrace=all \              # CUDA API backtraces
  --output=detailed_profile \
  ./my_cuda_app

# Profile for specific duration
nsys profile --duration=10 ./my_cuda_app

# Profile with NVTX ranges (see below)
nsys profile --trace=cuda,nvtx ./my_cuda_app

# Generate stats summary
nsys stats my_app_profile.nsys-rep
$ nsys stats my_app_profile.nsys-rep CUDA API Statistics: Time(%) Total Time (ns) Num Calls Avg (ns) Name ------- --------------- --------- ----------- ---------------- 45.2 1,234,567,890 1,000 1,234,567 cudaLaunchKernel 32.1 876,543,210 100 8,765,432 cudaMemcpy 15.3 418,765,432 50 8,375,308 cudaDeviceSynchronize CUDA Kernel Statistics: Time(%) Total Time (ns) Instances Avg (ns) Name ------- --------------- --------- ----------- ---------------------------- 78.4 2,345,678,901 500 4,691,357 matmul_kernel 15.2 456,789,012 500 913,578 relu_kernel 6.4 191,234,567 500 382,469 add_bias_kernel

Reading the Timeline

The timeline view is the most powerful feature. Here's how to interpret it:

Nsight Systems Timeline View
0 ms 50 ms 100 ms 150 ms 200 ms CPU prep data H2D D2H process CUDA API sync GPU H2D kernel 1 k2 D2H CPU work Memory transfer Kernel execution Synchronization (blocking) Common Problems to Look For: GPU Idle Time Gaps between kernels = CPU overhead Blocking Syncs Too many cudaDeviceSynchronize() Serial Transfers H2D/D2H not overlapping compute What Good Looks Like: ✓ GPU continuously busy ✓ Overlapped transfers ✓ Minimal sync gaps
Timeline shows CPU-GPU coordination - look for gaps and blocking operations

NVTX Annotations

Add custom markers to your code for better timeline navigation:

C++ nvtx_example.cu
#include <nvtx3/nvToolsExt.h>

void train_epoch() {
    // Create a named range for this epoch
    nvtxRangePush("Epoch");
    
    for (int batch = 0; batch < num_batches; batch++) {
        // Nested range for each phase
        nvtxRangePush("Forward Pass");
        forward<<>>(input, output);
        nvtxRangePop();
        
        nvtxRangePush("Backward Pass");
        backward<<>>(grad_output, grad_input);
        nvtxRangePop();
        
        nvtxRangePush("Weight Update");
        update_weights<<>>(weights, gradients, lr);
        nvtxRangePop();
    }
    
    nvtxRangePop();  // End Epoch range
}

// Python with PyTorch
import torch.cuda.nvtx as nvtx

with nvtx.range("training_loop"):
    for batch in dataloader:
        with nvtx.range("forward"):
            output = model(batch)
        with nvtx.range("backward"):
            loss.backward()
Pro Tip

Use color-coded NVTX ranges: nvtxRangePushA("Forward") with different colors for different phases. Makes complex timelines much easier to read.

4. Nsight Compute Deep Dive

Once you've identified slow kernels with Nsight Systems, use Nsight Compute to understand why they're slow.

Kernel Profiling

Bash ncu_profile.sh
# Profile all kernels (basic)
ncu --set full -o kernel_report ./my_cuda_app

# Profile specific kernel by name
ncu --kernel-name "matmul_kernel" --set full ./my_cuda_app

# Profile specific kernel launch (by index)
ncu --launch-skip 10 --launch-count 5 ./my_cuda_app

# Collect specific metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
./my_cuda_app

# Generate CSV for analysis
ncu --csv --set full ./my_cuda_app > metrics.csv

# Compare two implementations
ncu --set full -o baseline ./my_app_v1
ncu --set full -o optimized ./my_app_v2
ncu --page raw --diff baseline.ncu-rep optimized.ncu-rep

Key Metrics Explained

Critical Nsight Compute Metrics
Compute Metrics sm__throughput.avg.pct_of_peak → % of theoretical compute used ✓ Good: >60% ✗ Bad: <30% sm__warps_active.avg.pct_of_peak → Occupancy (warps in flight) ✓ Good: >50% ✗ Bad: <25% smsp__sass_thread_inst_executed.sum → Total instructions executed Memory Metrics dram__throughput.avg.pct_of_peak → % of memory bandwidth used ✓ Good: >70% ✗ Waste: <40% l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum → L1 cache hit rate for global loads lts__t_sectors_srcunit_tex_op_read.sum → L2 cache misses (→ DRAM traffic) Warp Stall Reasons (Why Warps Wait) smsp__warp_issue_stalled_*: • long_scoreboard: Waiting for memory (main bottleneck) • not_selected: Other warps prioritized • math_pipe_throttle: FP units at capacity • barrier: Waiting at __syncthreads() Quick Diagnosis High Memory Throughput + Low Compute → Memory Bound High Compute Throughput + Low Memory → Compute Bound Both Low → Latency Bound (low occupancy)
Use these metrics to identify whether your kernel is compute, memory, or latency bound
85%
SM Throughput (Good)
45%
Occupancy (Needs Work)
20%
Memory BW (Bottleneck!)

Roofline Analysis

The roofline model visualizes whether a kernel is limited by compute or memory bandwidth:

Roofline Model for GPU Kernels
Arithmetic Intensity (FLOP/Byte) 0.1 1 10 100 Performance (GFLOP/s) 1 100 1000 10000 Memory BW Roof Compute Roof (Peak FLOPS) Ridge Point Kernel A (Memory Bound) AI=0.5, far from roof Kernel B (Near Optimal) AI=8, close to ridge Kernel C (Compute Bound) AI=50, at compute roof Memory Bound Compute Bound Increase AI (cache, tiling)
Kernels below the roof have optimization potential. Move toward the ridge point for best performance.
Reading the Roofline

Arithmetic Intensity = FLOP / Bytes transferred. Low AI kernels (vector add, copy) are memory bound. High AI kernels (dense matmul) are compute bound. The ridge point is where both resources are fully utilized.

5. Optimization Workflow

A systematic approach to GPU optimization:

GPU Optimization Workflow
1. Profile with Nsight Systems Find hotspots 2. Profile kernel with Nsight Compute Identify bottleneck 3. Analyze Roofline Position Compute vs Memory? 4. Apply Optimization Based on diagnosis Repeat until satisfied Optimization Strategies by Bottleneck Memory Bound • Coalesce memory access • Use shared memory tiling • Reduce global memory traffic Compute Bound • Use Tensor Cores (if applicable) • Reduce instruction count • Use faster math (__fmaf) Latency Bound • Increase occupancy • Reduce register usage • Use more threads per block
Iterative profiling → analysis → optimization cycle

6. Case Study: Optimizing Matrix Multiply

Let's walk through a real optimization using both tools.

Step 1: Baseline Implementation

CUDA C++ matmul_naive.cu
// Naive matrix multiply - each thread computes one element
__global__ void matmul_naive(float* A, float* B, float* C, 
                              int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];  // Global memory!
        }
        C[row * N + col] = sum;
    }
}

Step 2: Profile with Nsight Systems

$ nsys profile --stats=true ./matmul_naive CUDA Kernel Statistics: Time(%) Total Time Instances Avg Time Name ------- ---------- --------- -------- ---- 98.2% 4.23 s 100 42.3 ms matmul_naive GPU-Util: 45% ← Low utilization!

Step 3: Profile with Nsight Compute

$ ncu --set full ./matmul_naive | grep -E "Throughput|Occupancy" Metric Value ---------------------------------------- -------- sm__throughput.avg.pct_of_peak_sustained 18.3% ← Very low! gpu__compute_memory_throughput.avg.pct 85.2% ← Memory bound sm__warps_active.avg.pct_of_peak 32.4% ← Low occupancy Warp Stall Reasons: long_scoreboard (memory): 68% ← Waiting for memory! not_selected: 22% other: 10%
Diagnosis

The kernel is memory bound: high memory throughput, low compute throughput, and 68% of time waiting for memory. Solution: use shared memory tiling.

Step 4: Optimized Implementation

CUDA C++ matmul_tiled.cu
#define TILE_SIZE 32

// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(float* A, float* B, float* C,
                             int M, int N, int K) {
    // Shared memory tiles
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    
    float sum = 0.0f;
    
    // Loop over tiles
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Cooperative load into shared memory
        if (row < M && t * TILE_SIZE + threadIdx.x < K)
            As[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            As[threadIdx.y][threadIdx.x] = 0.0f;
            
        if (col < N && t * TILE_SIZE + threadIdx.y < K)
            Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            Bs[threadIdx.y][threadIdx.x] = 0.0f;
        
        __syncthreads();  // Wait for tile to load
        
        // Compute on tile (from shared memory - fast!)
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        
        __syncthreads();  // Wait before loading next tile
    }
    
    if (row < M && col < N)
        C[row * N + col] = sum;
}

Step 5: Verify Improvement

$ ncu --set full ./matmul_tiled | grep -E "Throughput|Occupancy" Metric Value ---------------------------------------- -------- sm__throughput.avg.pct_of_peak_sustained 72.8% ← 4x improvement! gpu__compute_memory_throughput.avg.pct 45.3% ← More balanced sm__warps_active.avg.pct_of_peak 68.2% ← Better occupancy Warp Stall Reasons: long_scoreboard (memory): 28% ← Much less waiting! barrier: 35% ← Expected for tiled algo not_selected: 27%
Performance Comparison: Naive vs Tiled
42.3 ms Naive 10.5 ms (4x faster!) Tiled (Shared Memory) 8.2 ms (cuBLAS - near optimal) 4x speedup
Shared memory tiling achieves 4x speedup, approaching cuBLAS performance

Summary & Best Practices

Tool Use For Key Commands
Nsight Systems Timeline, CPU-GPU interaction, finding hotspots nsys profile, nsys stats
Nsight Compute Kernel metrics, roofline, optimization guidance ncu --set full, ncu --metrics
Key Takeaways

1. Always start with Nsight Systems to understand the big picture.
2. Use Nsight Compute for deep kernel analysis.
3. Check roofline position to diagnose compute vs memory bound.
4. Look at warp stall reasons to understand why threads wait.
5. Iterate: profile → analyze → optimize → repeat.