1. Introduction: Why Memory Matters

If you've ever wondered why your CUDA kernel isn't hitting the theoretical peak performance of your GPU, the answer is almost always memory. Modern GPUs have incredible computational throughput—an NVIDIA A100 can perform over 19 TFLOPS of FP32 operations—but feeding all those compute units with data is the real challenge.

Understanding the CUDA memory hierarchy is the single most important skill for writing efficient GPU code. In this guide, we'll explore each memory type, when to use it, and practical patterns for optimization.

Key Insight

The gap between memory bandwidth and compute throughput continues to grow. On modern GPUs, a single arithmetic operation can be 100-1000x faster than a global memory access. This is why memory optimization is so critical.

2. The CUDA Memory Hierarchy Overview

CUDA provides a hierarchical memory system with different characteristics for speed, size, and scope. Here's a visualization of the complete hierarchy:

CUDA Memory Hierarchy
REGISTERS Per-thread • ~256KB total per SM • ~1 cycle latency SHARED MEMORY / L1 CACHE Per-block • Up to 164KB per SM • ~20-30 cycles latency L2 CACHE Device-wide • 40-80MB • ~200 cycles latency GLOBAL MEMORY (HBM/GDDR) Device-wide • 16-80GB • ~400-800 cycles latency ~19 TB/s ~12 TB/s ~5 TB/s ~2 TB/s ← Faster, Smaller Slower, Larger →

Figure 1: The CUDA memory hierarchy showing the trade-off between speed, size, and scope. Bandwidth values are approximate for NVIDIA A100.

Let's look at the key characteristics of each memory type:

Memory Type Scope Size Latency Bandwidth
Registers Thread ~255 per thread 1 cycle ~19 TB/s
Shared Memory Block Up to 164KB ~20-30 cycles ~12 TB/s
L2 Cache Device 40-80MB ~200 cycles ~5 TB/s
Global Memory Device 16-80GB ~400-800 cycles ~2 TB/s
Constant Memory Device (cached) 64KB ~100 cycles (cached) Broadcast efficient

3. Registers: The Fastest Memory

Registers are the fastest memory available on a GPU. Each thread has exclusive access to its own registers, and accessing them takes only a single clock cycle. However, registers are a precious resource—using too many per thread reduces the number of concurrent threads (occupancy).

How Registers Work

Local variables in your kernel code typically reside in registers. The compiler decides register allocation, but you can influence it through code structure.

CUDA C++ register_example.cu
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    // idx is stored in a register - very fast access
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (idx < n) {
        // temp_a and temp_b are register variables
        float temp_a = a[idx];  // Load from global memory to register
        float temp_b = b[idx];  // Load from global memory to register
        
        // Computation happens entirely in registers
        float result = temp_a + temp_b;
        
        c[idx] = result;  // Store from register to global memory
    }
}
Pro Tip

Use nvcc --ptxas-options=-v to see register usage per kernel. If a kernel uses more than 32 registers per thread, consider reducing complexity or using __launch_bounds__ to control allocation.

Register Spilling

When a kernel uses too many registers, the compiler "spills" some to local memory (which is actually in global memory, just thread-private). This can devastate performance.

CUDA C++ avoid_spilling.cu
// Bad: Large arrays cause register spilling
__global__ void badKernel() {
    float data[64];  // Too large! Will spill to local memory
    // ...
}

// Better: Use shared memory for larger working sets
__global__ void betterKernel() {
    __shared__ float sharedData[64];  // Block-shared, no spilling
    float myValue;  // Single register variable
    // ...
}

4. Shared Memory: The Programmable Cache

Shared memory is the secret weapon of GPU optimization. It's an on-chip memory that's shared among all threads in a block, with roughly 10-20x lower latency than global memory and much higher bandwidth.

Think of shared memory as a programmer-managed L1 cache. Unlike automatic caches, you decide exactly what data gets stored there and when.

Shared Memory Architecture
Streaming Multiprocessor (SM) Shared Memory (32 Banks) B0 B1 B2 B3 ... B28 B29 B30 B31 Warp (32 Threads) T0 T1 T2 T3 ... T28 T29 T30 T31 Bank Conflict!

Figure 2: Shared memory is organized into 32 banks. When threads access different banks, accesses happen in parallel. Conflicts serialize access.

Basic Shared Memory Usage

The most common use of shared memory is to cache data from global memory that will be accessed multiple times.

CUDA C++ shared_memory_basic.cu
#define BLOCK_SIZE 256

__global__ void reductionSum(float* input, float* output, int n) {
    // Declare shared memory for the block
    __shared__ float sdata[BLOCK_SIZE];
    
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Load from global memory to shared memory
    sdata[tid] = (idx < n) ? input[idx] : 0.0f;
    __syncthreads();  // Ensure all threads have loaded
    
    // Parallel reduction in shared memory
    for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();  // Sync after each reduction step
    }
    
    // Thread 0 writes the result for this block
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
Don't Forget __syncthreads()

Always use __syncthreads() after writing to shared memory and before reading from locations written by other threads. Missing synchronization causes race conditions that may work sometimes but fail unpredictably.

Bank Conflicts

Shared memory is divided into 32 banks (on modern GPUs). When multiple threads in a warp access different addresses in the same bank, accesses are serialized—this is called a bank conflict.

CUDA C++ bank_conflicts.cu
__shared__ float data[1024];

// Good: Stride-1 access, no bank conflicts
// Thread i accesses bank i % 32
float val = data[threadIdx.x];  // ✓ Conflict-free

// Bad: Stride-32 access, maximum bank conflicts!
// All threads access bank 0
float val = data[threadIdx.x * 32];  // ✗ 32-way conflict

// Solution: Add padding to avoid conflicts
__shared__ float data_padded[32][33];  // 33 instead of 32!
float val = data_padded[threadIdx.y][threadIdx.x];  // ✓ Conflict-free

5. Global Memory: The Main Workhorse

Global memory is where your data lives. It's the largest memory space (tens of GBs on modern GPUs) but also the slowest. The key to fast global memory access is coalescing—combining memory accesses from multiple threads into fewer, larger transactions.

Memory Coalescing
✓ Coalesced Access 128-byte Memory Segment T0 T1 T2 T3 T4 T5 T6 T7 1 Memory Transaction ✗ Non-Coalesced Access T0 T1 T2 T3 T4 5+ Memory Transactions ~2000 GB/s Full bandwidth utilization ~200-400 GB/s 5-10x bandwidth loss

Figure 3: Coalesced access combines thread requests into minimal memory transactions. Scattered access requires multiple transactions, wasting bandwidth.

Achieving Coalesced Access

CUDA C++ coalescing_patterns.cu
// ✓ Good: Coalesced access - adjacent threads access adjacent memory
__global__ void coalescedAccess(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = data[idx];  // Thread 0 → data[0], Thread 1 → data[1], ...
    }
}

// ✗ Bad: Strided access - threads skip memory locations
__global__ void stridedAccess(float* data, int n, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx * stride < n) {
        float val = data[idx * stride];  // Scattered access!
    }
}

// Structure of Arrays (SoA) vs Array of Structures (AoS)

// ✗ AoS: Poor coalescing
struct ParticleAoS {
    float x, y, z;
    float vx, vy, vz;
};
// ParticleAoS particles[N]; → Accessing x values is strided

// ✓ SoA: Perfect coalescing
struct ParticlesSoA {
    float* x;   // All x values contiguous
    float* y;   // All y values contiguous
    float* z;
    float* vx;
    float* vy;
    float* vz;
};
// Access particles.x[idx] → Coalesced!

6. Constant & Texture Memory

CUDA provides two additional memory types optimized for specific access patterns:

Constant Memory

Constant memory is ideal for read-only data that's accessed uniformly by all threads. It has a dedicated cache that can broadcast a single value to all threads in a warp in one cycle.

CUDA C++ constant_memory.cu
// Declare constant memory (limited to 64KB)
__constant__ float filterKernel[25];  // 5x5 convolution kernel
__constant__ float transformMatrix[16];  // 4x4 matrix

// Copy to constant memory from host
void setupConstants(float* h_kernel, float* h_matrix) {
    cudaMemcpyToSymbol(filterKernel, h_kernel, 25 * sizeof(float));
    cudaMemcpyToSymbol(transformMatrix, h_matrix, 16 * sizeof(float));
}

__global__ void convolution(float* input, float* output, int width) {
    // All threads read the same kernel value → broadcast is efficient
    float sum = 0.0f;
    for (int i = 0; i < 25; i++) {
        sum += input[...] * filterKernel[i];  // Cached broadcast
    }
    output[...] = sum;
}

Texture Memory

Texture memory provides hardware interpolation, boundary handling, and is optimized for 2D spatial locality. It's particularly useful for image processing.

CUDA C++ texture_memory.cu
// Modern CUDA texture objects (preferred)
cudaTextureObject_t texObj;

void createTexture(float* d_data, int width, int height) {
    // Create channel descriptor
    cudaChannelFormatDesc channelDesc = 
        cudaCreateChannelDesc<float>();
    
    // Create CUDA array
    cudaArray_t cuArray;
    cudaMallocArray(&cuArray, &channelDesc, width, height);
    cudaMemcpy2DToArray(cuArray, 0, 0, d_data, 
                         width * sizeof(float), 
                         width * sizeof(float), height,
                         cudaMemcpyDeviceToDevice);
    
    // Setup texture parameters
    cudaResourceDesc resDesc = {};
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;
    
    cudaTextureDesc texDesc = {};
    texDesc.addressMode[0] = cudaAddressModeClamp;  // Clamp at edges
    texDesc.addressMode[1] = cudaAddressModeClamp;
    texDesc.filterMode = cudaFilterModeLinear;     // Bilinear interpolation
    texDesc.normalizedCoords = true;
    
    cudaCreateTextureObject(&texObj, &resDesc, &texDesc, nullptr);
}

__global__ void sampleTexture(cudaTextureObject_t tex, float* output) {
    float u = threadIdx.x / (float)blockDim.x;
    float v = threadIdx.y / (float)blockDim.y;
    
    // Hardware-accelerated bilinear interpolation!
    output[...] = tex2D<float>(tex, u, v);
}

7. Practical Optimization Techniques

Let's put it all together with a real-world example: optimizing matrix multiplication.

Naive Implementation

CUDA C++ matmul_naive.cu
// Naive matrix multiplication - poor memory access pattern
__global__ void matmulNaive(float* A, float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            // Each thread reads N elements from A and N from B
            // A is accessed with stride N (non-coalesced)
            // B is accessed with stride 1 (coalesced)
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}
// Performance: ~200-400 GFLOPS on A100

Optimized with Shared Memory Tiling

CUDA C++ matmul_tiled.cu
#define TILE_SIZE 32

__global__ void matmulTiled(float* A, float* B, float* C, int N) {
    // Shared memory tiles for A and B
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
    
    int bx = blockIdx.x, by = blockIdx.y;
    int tx = threadIdx.x, ty = threadIdx.y;
    
    int row = by * TILE_SIZE + ty;
    int col = bx * TILE_SIZE + tx;
    
    float sum = 0.0f;
    
    // Loop over tiles
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Collaborative loading: each thread loads one element
        int aCol = t * TILE_SIZE + tx;
        int bRow = t * TILE_SIZE + ty;
        
        As[ty][tx] = (row < N && aCol < N) ? A[row * N + aCol] : 0.0f;
        Bs[ty][tx] = (bRow < N && col < N) ? B[bRow * N + col] : 0.0f;
        
        __syncthreads();
        
        // Compute partial dot product using shared memory
        #pragma unroll
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += As[ty][k] * Bs[k][tx];  // Fast shared memory access!
        }
        
        __syncthreads();
    }
    
    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}
// Performance: ~2000-5000 GFLOPS on A100 (10-20x improvement!)
Performance Impact

The tiled version reduces global memory accesses by a factor of TILE_SIZE. Instead of reading N elements per output element, we read N/TILE_SIZE tiles, reusing each element TILE_SIZE times from shared memory.

8. Summary & Best Practices

Here are the key takeaways for optimizing CUDA memory access:

Memory Optimization Checklist

  • Minimize global memory access - Use shared memory to cache repeatedly accessed data
  • Coalesce memory access - Ensure adjacent threads access adjacent memory locations
  • Avoid bank conflicts - Use padding when accessing shared memory with strides
  • Use Structure of Arrays (SoA) - Better coalescing than Array of Structures
  • Leverage constant memory - For read-only data broadcast to all threads
  • Profile with Nsight - Use tools to identify memory bottlenecks
  • Watch register usage - Avoid spilling to local memory

Understanding memory hierarchy is fundamental to GPU programming. While modern GPUs have incredible compute power, the memory subsystem is almost always the bottleneck. By carefully managing data placement and access patterns, you can often achieve 10-100x speedups over naive implementations.

In the next post, we'll explore CUDA profiling with Nsight Compute and learn how to identify and fix memory bottlenecks in real-world applications.

AO

Alireza Olama

Postdoctoral Researcher at Åbo Akademi University, specializing in distributed machine learning systems, GPU programming, and HPC optimization.