1. Introduction: Why Memory Matters
If you've ever wondered why your CUDA kernel isn't hitting the theoretical peak performance of your GPU, the answer is almost always memory. Modern GPUs have incredible computational throughput—an NVIDIA A100 can perform over 19 TFLOPS of FP32 operations—but feeding all those compute units with data is the real challenge.
Understanding the CUDA memory hierarchy is the single most important skill for writing efficient GPU code. In this guide, we'll explore each memory type, when to use it, and practical patterns for optimization.
The gap between memory bandwidth and compute throughput continues to grow. On modern GPUs, a single arithmetic operation can be 100-1000x faster than a global memory access. This is why memory optimization is so critical.
2. The CUDA Memory Hierarchy Overview
CUDA provides a hierarchical memory system with different characteristics for speed, size, and scope. Here's a visualization of the complete hierarchy:
Figure 1: The CUDA memory hierarchy showing the trade-off between speed, size, and scope. Bandwidth values are approximate for NVIDIA A100.
Let's look at the key characteristics of each memory type:
| Memory Type | Scope | Size | Latency | Bandwidth |
|---|---|---|---|---|
| Registers | Thread | ~255 per thread | 1 cycle | ~19 TB/s |
| Shared Memory | Block | Up to 164KB | ~20-30 cycles | ~12 TB/s |
| L2 Cache | Device | 40-80MB | ~200 cycles | ~5 TB/s |
| Global Memory | Device | 16-80GB | ~400-800 cycles | ~2 TB/s |
| Constant Memory | Device (cached) | 64KB | ~100 cycles (cached) | Broadcast efficient |
3. Registers: The Fastest Memory
Registers are the fastest memory available on a GPU. Each thread has exclusive access to its own registers, and accessing them takes only a single clock cycle. However, registers are a precious resource—using too many per thread reduces the number of concurrent threads (occupancy).
How Registers Work
Local variables in your kernel code typically reside in registers. The compiler decides register allocation, but you can influence it through code structure.
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
// idx is stored in a register - very fast access
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// temp_a and temp_b are register variables
float temp_a = a[idx]; // Load from global memory to register
float temp_b = b[idx]; // Load from global memory to register
// Computation happens entirely in registers
float result = temp_a + temp_b;
c[idx] = result; // Store from register to global memory
}
}
Use nvcc --ptxas-options=-v to see register usage per kernel. If a kernel uses more than
32 registers per thread, consider reducing complexity or using __launch_bounds__ to control allocation.
Register Spilling
When a kernel uses too many registers, the compiler "spills" some to local memory (which is actually in global memory, just thread-private). This can devastate performance.
// Bad: Large arrays cause register spilling
__global__ void badKernel() {
float data[64]; // Too large! Will spill to local memory
// ...
}
// Better: Use shared memory for larger working sets
__global__ void betterKernel() {
__shared__ float sharedData[64]; // Block-shared, no spilling
float myValue; // Single register variable
// ...
}
4. Shared Memory: The Programmable Cache
Shared memory is the secret weapon of GPU optimization. It's an on-chip memory that's shared among all threads in a block, with roughly 10-20x lower latency than global memory and much higher bandwidth.
Think of shared memory as a programmer-managed L1 cache. Unlike automatic caches, you decide exactly what data gets stored there and when.
Figure 2: Shared memory is organized into 32 banks. When threads access different banks, accesses happen in parallel. Conflicts serialize access.
Basic Shared Memory Usage
The most common use of shared memory is to cache data from global memory that will be accessed multiple times.
#define BLOCK_SIZE 256
__global__ void reductionSum(float* input, float* output, int n) {
// Declare shared memory for the block
__shared__ float sdata[BLOCK_SIZE];
unsigned int tid = threadIdx.x;
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global memory to shared memory
sdata[tid] = (idx < n) ? input[idx] : 0.0f;
__syncthreads(); // Ensure all threads have loaded
// Parallel reduction in shared memory
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads(); // Sync after each reduction step
}
// Thread 0 writes the result for this block
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}
Always use __syncthreads() after writing to shared memory and before reading from locations
written by other threads. Missing synchronization causes race conditions that may work sometimes
but fail unpredictably.
Bank Conflicts
Shared memory is divided into 32 banks (on modern GPUs). When multiple threads in a warp access different addresses in the same bank, accesses are serialized—this is called a bank conflict.
__shared__ float data[1024];
// Good: Stride-1 access, no bank conflicts
// Thread i accesses bank i % 32
float val = data[threadIdx.x]; // ✓ Conflict-free
// Bad: Stride-32 access, maximum bank conflicts!
// All threads access bank 0
float val = data[threadIdx.x * 32]; // ✗ 32-way conflict
// Solution: Add padding to avoid conflicts
__shared__ float data_padded[32][33]; // 33 instead of 32!
float val = data_padded[threadIdx.y][threadIdx.x]; // ✓ Conflict-free
5. Global Memory: The Main Workhorse
Global memory is where your data lives. It's the largest memory space (tens of GBs on modern GPUs) but also the slowest. The key to fast global memory access is coalescing—combining memory accesses from multiple threads into fewer, larger transactions.
Figure 3: Coalesced access combines thread requests into minimal memory transactions. Scattered access requires multiple transactions, wasting bandwidth.
Achieving Coalesced Access
// ✓ Good: Coalesced access - adjacent threads access adjacent memory
__global__ void coalescedAccess(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float val = data[idx]; // Thread 0 → data[0], Thread 1 → data[1], ...
}
}
// ✗ Bad: Strided access - threads skip memory locations
__global__ void stridedAccess(float* data, int n, int stride) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx * stride < n) {
float val = data[idx * stride]; // Scattered access!
}
}
// Structure of Arrays (SoA) vs Array of Structures (AoS)
// ✗ AoS: Poor coalescing
struct ParticleAoS {
float x, y, z;
float vx, vy, vz;
};
// ParticleAoS particles[N]; → Accessing x values is strided
// ✓ SoA: Perfect coalescing
struct ParticlesSoA {
float* x; // All x values contiguous
float* y; // All y values contiguous
float* z;
float* vx;
float* vy;
float* vz;
};
// Access particles.x[idx] → Coalesced!
6. Constant & Texture Memory
CUDA provides two additional memory types optimized for specific access patterns:
Constant Memory
Constant memory is ideal for read-only data that's accessed uniformly by all threads. It has a dedicated cache that can broadcast a single value to all threads in a warp in one cycle.
// Declare constant memory (limited to 64KB)
__constant__ float filterKernel[25]; // 5x5 convolution kernel
__constant__ float transformMatrix[16]; // 4x4 matrix
// Copy to constant memory from host
void setupConstants(float* h_kernel, float* h_matrix) {
cudaMemcpyToSymbol(filterKernel, h_kernel, 25 * sizeof(float));
cudaMemcpyToSymbol(transformMatrix, h_matrix, 16 * sizeof(float));
}
__global__ void convolution(float* input, float* output, int width) {
// All threads read the same kernel value → broadcast is efficient
float sum = 0.0f;
for (int i = 0; i < 25; i++) {
sum += input[...] * filterKernel[i]; // Cached broadcast
}
output[...] = sum;
}
Texture Memory
Texture memory provides hardware interpolation, boundary handling, and is optimized for 2D spatial locality. It's particularly useful for image processing.
// Modern CUDA texture objects (preferred)
cudaTextureObject_t texObj;
void createTexture(float* d_data, int width, int height) {
// Create channel descriptor
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc<float>();
// Create CUDA array
cudaArray_t cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
cudaMemcpy2DToArray(cuArray, 0, 0, d_data,
width * sizeof(float),
width * sizeof(float), height,
cudaMemcpyDeviceToDevice);
// Setup texture parameters
cudaResourceDesc resDesc = {};
resDesc.resType = cudaResourceTypeArray;
resDesc.res.array.array = cuArray;
cudaTextureDesc texDesc = {};
texDesc.addressMode[0] = cudaAddressModeClamp; // Clamp at edges
texDesc.addressMode[1] = cudaAddressModeClamp;
texDesc.filterMode = cudaFilterModeLinear; // Bilinear interpolation
texDesc.normalizedCoords = true;
cudaCreateTextureObject(&texObj, &resDesc, &texDesc, nullptr);
}
__global__ void sampleTexture(cudaTextureObject_t tex, float* output) {
float u = threadIdx.x / (float)blockDim.x;
float v = threadIdx.y / (float)blockDim.y;
// Hardware-accelerated bilinear interpolation!
output[...] = tex2D<float>(tex, u, v);
}
7. Practical Optimization Techniques
Let's put it all together with a real-world example: optimizing matrix multiplication.
Naive Implementation
// Naive matrix multiplication - poor memory access pattern
__global__ void matmulNaive(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
// Each thread reads N elements from A and N from B
// A is accessed with stride N (non-coalesced)
// B is accessed with stride 1 (coalesced)
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
// Performance: ~200-400 GFLOPS on A100
Optimized with Shared Memory Tiling
#define TILE_SIZE 32
__global__ void matmulTiled(float* A, float* B, float* C, int N) {
// Shared memory tiles for A and B
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int bx = blockIdx.x, by = blockIdx.y;
int tx = threadIdx.x, ty = threadIdx.y;
int row = by * TILE_SIZE + ty;
int col = bx * TILE_SIZE + tx;
float sum = 0.0f;
// Loop over tiles
for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
// Collaborative loading: each thread loads one element
int aCol = t * TILE_SIZE + tx;
int bRow = t * TILE_SIZE + ty;
As[ty][tx] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
Bs[ty][tx] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
__syncthreads();
// Compute partial dot product using shared memory
#pragma unroll
for (int k = 0; k < TILE_SIZE; k++) {
sum += As[ty][k] * Bs[k][tx]; // Fast shared memory access!
}
__syncthreads();
}
if (row < N && col < N) {
C[row * N + col] = sum;
}
}
// Performance: ~2000-5000 GFLOPS on A100 (10-20x improvement!)
The tiled version reduces global memory accesses by a factor of TILE_SIZE. Instead of reading N elements per output element, we read N/TILE_SIZE tiles, reusing each element TILE_SIZE times from shared memory.
8. Summary & Best Practices
Here are the key takeaways for optimizing CUDA memory access:
- Minimize global memory access - Use shared memory to cache repeatedly accessed data
- Coalesce memory access - Ensure adjacent threads access adjacent memory locations
- Avoid bank conflicts - Use padding when accessing shared memory with strides
- Use Structure of Arrays (SoA) - Better coalescing than Array of Structures
- Leverage constant memory - For read-only data broadcast to all threads
- Profile with Nsight - Use tools to identify memory bottlenecks
- Watch register usage - Avoid spilling to local memory
Understanding memory hierarchy is fundamental to GPU programming. While modern GPUs have incredible compute power, the memory subsystem is almost always the bottleneck. By carefully managing data placement and access patterns, you can often achieve 10-100x speedups over naive implementations.
In the next post, we'll explore CUDA profiling with Nsight Compute and learn how to identify and fix memory bottlenecks in real-world applications.