Nsight Systems & Compute: Profiling Your CUDA Kernels

Table of Contents

Why Profile GPU Code?
Nsight Systems vs Nsight Compute
1. When to Use Each Tool
Nsight Systems Deep Dive
Nsight Compute Deep Dive
Optimization Workflow
Case Study: Optimizing Matrix Multiply

1. Why Profile GPU Code?

Writing CUDA code that works is one thing. Writing CUDA code that achieves peak performance is another challenge entirely. Without profiling, you're optimizing blind.

Common performance issues that profiling reveals:

Memory bottlenecks: Uncoalesced access patterns, bank conflicts
Low occupancy: Not enough threads to hide latency
Kernel launch overhead: Too many small kernels
Host-device synchronization: Hidden blocking calls
Compute underutilization: Arithmetic intensity too low

The GPU Performance Gap

Profiling often reveals 2-5x improvement opportunities in CUDA code

Rule of Thumb

If your GPU utilization is below 70%, there's almost certainly a bottleneck that profiling can identify. Even at 90%, there may be optimization opportunities.

2. Nsight Systems vs Nsight Compute

NVIDIA provides two complementary profiling tools, each designed for different levels of analysis:

Nsight Systems vs Nsight Compute

Use Nsight Systems first (find WHERE), then Nsight Compute (understand WHY)

When to Use Each Tool

Question	Tool
"Where is time being spent?"	Nsight Systems
"Why is this kernel slow?"	Nsight Compute
"Is my memory transfer overlapping with compute?"	Nsight Systems
"What's my memory bandwidth utilization?"	Nsight Compute
"Is my application CPU or GPU bound?"	Nsight Systems

3. Nsight Systems Deep Dive

Nsight Systems captures a timeline of your entire application, showing how CPU and GPU activities interleave. It's the best starting point for any optimization effort.

Command-Line Profiling

Basic profiling with nsys:

Bash profile_app.sh

# Basic profile - generates .nsys-rep file
nsys profile --output=my_app_profile ./my_cuda_app

# Profile with specific options
nsys profile \
  --trace=cuda,nvtx,osrt \          # What to trace
  --sample=cpu \                     # CPU sampling
  --cudabacktrace=all \              # CUDA API backtraces
  --output=detailed_profile \
  ./my_cuda_app

# Profile for specific duration
nsys profile --duration=10 ./my_cuda_app

# Profile with NVTX ranges (see below)
nsys profile --trace=cuda,nvtx ./my_cuda_app

# Generate stats summary
nsys stats my_app_profile.nsys-rep

$ nsys stats my_app_profile.nsys-rep CUDA API Statistics: Time(%) Total Time (ns) Num Calls Avg (ns) Name ------- --------------- --------- ----------- ---------------- 45.2 1,234,567,890 1,000 1,234,567 cudaLaunchKernel 32.1 876,543,210 100 8,765,432 cudaMemcpy 15.3 418,765,432 50 8,375,308 cudaDeviceSynchronize CUDA Kernel Statistics: Time(%) Total Time (ns) Instances Avg (ns) Name ------- --------------- --------- ----------- ---------------------------- 78.4 2,345,678,901 500 4,691,357 matmul_kernel 15.2 456,789,012 500 913,578 relu_kernel 6.4 191,234,567 500 382,469 add_bias_kernel

Reading the Timeline

The timeline view is the most powerful feature. Here's how to interpret it:

Nsight Systems Timeline View

Timeline shows CPU-GPU coordination - look for gaps and blocking operations

NVTX Annotations

Add custom markers to your code for better timeline navigation:

C++ nvtx_example.cu

#include <nvtx3/nvToolsExt.h>

void train_epoch() {
    // Create a named range for this epoch
    nvtxRangePush("Epoch");
    
    for (int batch = 0; batch < num_batches; batch++) {
        // Nested range for each phase
        nvtxRangePush("Forward Pass");
        forward<<>>(input, output);
        nvtxRangePop();
        
        nvtxRangePush("Backward Pass");
        backward<<>>(grad_output, grad_input);
        nvtxRangePop();
        
        nvtxRangePush("Weight Update");
        update_weights<<>>(weights, gradients, lr);
        nvtxRangePop();
    }
    
    nvtxRangePop();  // End Epoch range
}

// Python with PyTorch
import torch.cuda.nvtx as nvtx

with nvtx.range("training_loop"):
    for batch in dataloader:
        with nvtx.range("forward"):
            output = model(batch)
        with nvtx.range("backward"):
            loss.backward()

Pro Tip

Use color-coded NVTX ranges: nvtxRangePushA("Forward") with different colors for different phases. Makes complex timelines much easier to read.

4. Nsight Compute Deep Dive

Once you've identified slow kernels with Nsight Systems, use Nsight Compute to understand why they're slow.

Kernel Profiling

Bash ncu_profile.sh

# Profile all kernels (basic)
ncu --set full -o kernel_report ./my_cuda_app

# Profile specific kernel by name
ncu --kernel-name "matmul_kernel" --set full ./my_cuda_app

# Profile specific kernel launch (by index)
ncu --launch-skip 10 --launch-count 5 ./my_cuda_app

# Collect specific metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
./my_cuda_app

# Generate CSV for analysis
ncu --csv --set full ./my_cuda_app > metrics.csv

# Compare two implementations
ncu --set full -o baseline ./my_app_v1
ncu --set full -o optimized ./my_app_v2
ncu --page raw --diff baseline.ncu-rep optimized.ncu-rep

Key Metrics Explained

Critical Nsight Compute Metrics

Use these metrics to identify whether your kernel is compute, memory, or latency bound

85%

SM Throughput (Good)

45%

Occupancy (Needs Work)

20%

Memory BW (Bottleneck!)

Roofline Analysis

The roofline model visualizes whether a kernel is limited by compute or memory bandwidth:

Roofline Model for GPU Kernels

Kernels below the roof have optimization potential. Move toward the ridge point for best performance.

Reading the Roofline

Arithmetic Intensity = FLOP / Bytes transferred. Low AI kernels (vector add, copy) are memory bound. High AI kernels (dense matmul) are compute bound. The ridge point is where both resources are fully utilized.

5. Optimization Workflow

A systematic approach to GPU optimization:

GPU Optimization Workflow

Iterative profiling → analysis → optimization cycle

6. Case Study: Optimizing Matrix Multiply

Let's walk through a real optimization using both tools.

Step 1: Baseline Implementation

CUDA C++ matmul_naive.cu

// Naive matrix multiply - each thread computes one element
__global__ void matmul_naive(float* A, float* B, float* C, 
                              int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];  // Global memory!
        }
        C[row * N + col] = sum;
    }
}

Step 2: Profile with Nsight Systems

$ nsys profile --stats=true ./matmul_naive CUDA Kernel Statistics: Time(%) Total Time Instances Avg Time Name ------- ---------- --------- -------- ---- 98.2% 4.23 s 100 42.3 ms matmul_naive GPU-Util: 45% ← Low utilization!

Step 3: Profile with Nsight Compute

$ ncu --set full ./matmul_naive | grep -E "Throughput|Occupancy" Metric Value ---------------------------------------- -------- sm__throughput.avg.pct_of_peak_sustained 18.3% ← Very low! gpu__compute_memory_throughput.avg.pct 85.2% ← Memory bound sm__warps_active.avg.pct_of_peak 32.4% ← Low occupancy Warp Stall Reasons: long_scoreboard (memory): 68% ← Waiting for memory! not_selected: 22% other: 10%

Diagnosis

The kernel is memory bound: high memory throughput, low compute throughput, and 68% of time waiting for memory. Solution: use shared memory tiling.

Step 4: Optimized Implementation

CUDA C++ matmul_tiled.cu

#define TILE_SIZE 32

// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(float* A, float* B, float* C,
                             int M, int N, int K) {
    // Shared memory tiles
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    
    float sum = 0.0f;
    
    // Loop over tiles
    for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Cooperative load into shared memory
        if (row < M && t * TILE_SIZE + threadIdx.x < K)
            As[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
        else
            As[threadIdx.y][threadIdx.x] = 0.0f;
            
        if (col < N && t * TILE_SIZE + threadIdx.y < K)
            Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            Bs[threadIdx.y][threadIdx.x] = 0.0f;
        
        __syncthreads();  // Wait for tile to load
        
        // Compute on tile (from shared memory - fast!)
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        
        __syncthreads();  // Wait before loading next tile
    }
    
    if (row < M && col < N)
        C[row * N + col] = sum;
}

Step 5: Verify Improvement

$ ncu --set full ./matmul_tiled | grep -E "Throughput|Occupancy" Metric Value ---------------------------------------- -------- sm__throughput.avg.pct_of_peak_sustained 72.8% ← 4x improvement! gpu__compute_memory_throughput.avg.pct 45.3% ← More balanced sm__warps_active.avg.pct_of_peak 68.2% ← Better occupancy Warp Stall Reasons: long_scoreboard (memory): 28% ← Much less waiting! barrier: 35% ← Expected for tiled algo not_selected: 27%

Performance Comparison: Naive vs Tiled

Shared memory tiling achieves 4x speedup, approaching cuBLAS performance

Summary & Best Practices

Tool	Use For	Key Commands
Nsight Systems	Timeline, CPU-GPU interaction, finding hotspots	`nsys profile`, `nsys stats`
Nsight Compute	Kernel metrics, roofline, optimization guidance	`ncu --set full`, `ncu --metrics`

Key Takeaways

1. Always start with Nsight Systems to understand the big picture.
2. Use Nsight Compute for deep kernel analysis.
3. Check roofline position to diagnose compute vs memory bound.
4. Look at warp stall reasons to understand why threads wait.
5. Iterate: profile → analyze → optimize → repeat.