1. Why Profile GPU Code?
Writing CUDA code that works is one thing. Writing CUDA code that achieves peak performance is another challenge entirely. Without profiling, you're optimizing blind.
Common performance issues that profiling reveals:
- Memory bottlenecks: Uncoalesced access patterns, bank conflicts
- Low occupancy: Not enough threads to hide latency
- Kernel launch overhead: Too many small kernels
- Host-device synchronization: Hidden blocking calls
- Compute underutilization: Arithmetic intensity too low
If your GPU utilization is below 70%, there's almost certainly a bottleneck that profiling can identify. Even at 90%, there may be optimization opportunities.
2. Nsight Systems vs Nsight Compute
NVIDIA provides two complementary profiling tools, each designed for different levels of analysis:
When to Use Each Tool
| Question | Tool |
|---|---|
| "Where is time being spent?" | Nsight Systems |
| "Why is this kernel slow?" | Nsight Compute |
| "Is my memory transfer overlapping with compute?" | Nsight Systems |
| "What's my memory bandwidth utilization?" | Nsight Compute |
| "Is my application CPU or GPU bound?" | Nsight Systems |
3. Nsight Systems Deep Dive
Nsight Systems captures a timeline of your entire application, showing how CPU and GPU activities interleave. It's the best starting point for any optimization effort.
Command-Line Profiling
Basic profiling with nsys:
# Basic profile - generates .nsys-rep file
nsys profile --output=my_app_profile ./my_cuda_app
# Profile with specific options
nsys profile \
--trace=cuda,nvtx,osrt \ # What to trace
--sample=cpu \ # CPU sampling
--cudabacktrace=all \ # CUDA API backtraces
--output=detailed_profile \
./my_cuda_app
# Profile for specific duration
nsys profile --duration=10 ./my_cuda_app
# Profile with NVTX ranges (see below)
nsys profile --trace=cuda,nvtx ./my_cuda_app
# Generate stats summary
nsys stats my_app_profile.nsys-rep
Reading the Timeline
The timeline view is the most powerful feature. Here's how to interpret it:
NVTX Annotations
Add custom markers to your code for better timeline navigation:
#include <nvtx3/nvToolsExt.h>
void train_epoch() {
// Create a named range for this epoch
nvtxRangePush("Epoch");
for (int batch = 0; batch < num_batches; batch++) {
// Nested range for each phase
nvtxRangePush("Forward Pass");
forward<<>>(input, output);
nvtxRangePop();
nvtxRangePush("Backward Pass");
backward<<>>(grad_output, grad_input);
nvtxRangePop();
nvtxRangePush("Weight Update");
update_weights<<>>(weights, gradients, lr);
nvtxRangePop();
}
nvtxRangePop(); // End Epoch range
}
// Python with PyTorch
import torch.cuda.nvtx as nvtx
with nvtx.range("training_loop"):
for batch in dataloader:
with nvtx.range("forward"):
output = model(batch)
with nvtx.range("backward"):
loss.backward()
Use color-coded NVTX ranges: nvtxRangePushA("Forward") with different colors for different phases. Makes complex timelines much easier to read.
4. Nsight Compute Deep Dive
Once you've identified slow kernels with Nsight Systems, use Nsight Compute to understand why they're slow.
Kernel Profiling
# Profile all kernels (basic)
ncu --set full -o kernel_report ./my_cuda_app
# Profile specific kernel by name
ncu --kernel-name "matmul_kernel" --set full ./my_cuda_app
# Profile specific kernel launch (by index)
ncu --launch-skip 10 --launch-count 5 ./my_cuda_app
# Collect specific metrics
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
./my_cuda_app
# Generate CSV for analysis
ncu --csv --set full ./my_cuda_app > metrics.csv
# Compare two implementations
ncu --set full -o baseline ./my_app_v1
ncu --set full -o optimized ./my_app_v2
ncu --page raw --diff baseline.ncu-rep optimized.ncu-rep
Key Metrics Explained
Roofline Analysis
The roofline model visualizes whether a kernel is limited by compute or memory bandwidth:
Arithmetic Intensity = FLOP / Bytes transferred. Low AI kernels (vector add, copy) are memory bound. High AI kernels (dense matmul) are compute bound. The ridge point is where both resources are fully utilized.
5. Optimization Workflow
A systematic approach to GPU optimization:
6. Case Study: Optimizing Matrix Multiply
Let's walk through a real optimization using both tools.
Step 1: Baseline Implementation
// Naive matrix multiply - each thread computes one element
__global__ void matmul_naive(float* A, float* B, float* C,
int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
float sum = 0.0f;
for (int k = 0; k < K; k++) {
sum += A[row * K + k] * B[k * N + col]; // Global memory!
}
C[row * N + col] = sum;
}
}
Step 2: Profile with Nsight Systems
Step 3: Profile with Nsight Compute
The kernel is memory bound: high memory throughput, low compute throughput, and 68% of time waiting for memory. Solution: use shared memory tiling.
Step 4: Optimized Implementation
#define TILE_SIZE 32
// Tiled matrix multiply using shared memory
__global__ void matmul_tiled(float* A, float* B, float* C,
int M, int N, int K) {
// Shared memory tiles
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// Loop over tiles
for (int t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Cooperative load into shared memory
if (row < M && t * TILE_SIZE + threadIdx.x < K)
As[threadIdx.y][threadIdx.x] = A[row * K + t * TILE_SIZE + threadIdx.x];
else
As[threadIdx.y][threadIdx.x] = 0.0f;
if (col < N && t * TILE_SIZE + threadIdx.y < K)
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
else
Bs[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads(); // Wait for tile to load
// Compute on tile (from shared memory - fast!)
for (int k = 0; k < TILE_SIZE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads(); // Wait before loading next tile
}
if (row < M && col < N)
C[row * N + col] = sum;
}
Step 5: Verify Improvement
Summary & Best Practices
| Tool | Use For | Key Commands |
|---|---|---|
| Nsight Systems | Timeline, CPU-GPU interaction, finding hotspots | nsys profile, nsys stats |
| Nsight Compute | Kernel metrics, roofline, optimization guidance | ncu --set full, ncu --metrics |
1. Always start with Nsight Systems to understand the big picture.
2. Use Nsight Compute for deep kernel analysis.
3. Check roofline position to diagnose compute vs memory bound.
4. Look at warp stall reasons to understand why threads wait.
5. Iterate: profile → analyze → optimize → repeat.