The Roofline Model: A Complete Guide to Performance Analysis

Table of Contents

Introduction & History
The Roofline Model Fundamentals
Computing Peak Performance
Computing Peak Bandwidth
Building Your Roofline
Worked Examples
Achievable vs Theoretical Performance
Cache-Aware Roofline Analysis
Operational vs Arithmetic Intensity
Profiling Tools for Roofline
BLAS Operations on the Roofline
Deep Learning Operations
Batch Size Effects
Precision Effects & Tensor Cores
Extended Roofline Models
1. Energy Roofline
2. Communication Roofline
Optimization Trajectories
Case Studies
Conclusion & Key Takeaways

1. Introduction & History

The Roofline Model, introduced by Samuel Williams, Andrew Waterman, and David Patterson at UC Berkeley in 2009, provides a visual framework to understand the performance limits of a kernel on a given architecture.

The core insight: every computation is limited by either compute throughput or memory bandwidth - never both simultaneously. The roofline diagram shows you which limit applies and by how much.

Why Roofline Matters

Before roofline, developers would optimize blindly - adding more compute optimizations to a memory-bound kernel, or vice versa. The roofline model tells you which optimization will actually help before you write any code.

The Classic Roofline Model

Log-log plot: X-axis is arithmetic intensity, Y-axis is performance. The "roof" defines maximum achievable performance.

Key terminology:

Roofline: The performance ceiling - you cannot exceed this line
Ridge Point: Where memory-bound transitions to compute-bound
Arithmetic Intensity (AI): FLOP/Byte - operations per byte transferred

2. The Roofline Model Fundamentals

Arithmetic Intensity

Arithmetic Intensity (AI) is the ratio of floating-point operations to bytes moved from memory:

\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}} \quad \left[\frac{\text{FLOP}}{\text{Byte}}\right]

This single metric determines whether your kernel is memory-bound or compute-bound:

Computing Arithmetic Intensity

Low AI = memory-bound, High AI = compute-bound. Ridge point depends on hardware.

The Roofline Equation

The attainable performance is bounded by:

P_{\text{attainable}} = \min\left( P_{\text{peak}}, \quad \text{BW}_{\text{peak}} \times \text{AI} \right)

Where:

$P_{\text{peak}}$ = Peak compute performance (GFLOP/s)
$\text{BW}_{\text{peak}}$ = Peak memory bandwidth (GB/s)
$\text{AI}$ = Arithmetic Intensity (FLOP/Byte)

On a log-log plot, this creates two lines:

Bandwidth ceiling: $\log(P) = \log(\text{BW}) + \log(\text{AI})$ → slope of 1
Compute ceiling: $\log(P) = \log(P_{\text{peak}})$ → horizontal line

Compute-Bound vs Memory-Bound

The ridge point is where the two ceilings meet:

\text{AI}_{\text{ridge}} = \frac{P_{\text{peak}}}{\text{BW}_{\text{peak}}}

Example: NVIDIA A100

Peak FP32: 19.5 TFLOP/s = 19,500 GFLOP/s
HBM2e Bandwidth: 2,039 GB/s
Ridge Point: 19,500 / 2,039 = 9.6 FLOP/Byte

Kernels with AI < 9.6 are memory-bound, AI > 9.6 are compute-bound.

3. Computing Peak Performance

FLOPS Calculation

Peak theoretical FLOPS depends on:

P_{\text{peak}} = \text{Cores} \times \text{Clock} \times \text{FLOPs/cycle/core}

For GPUs, the calculation differs by unit type:

Python peak_flops.py

def calculate_peak_flops(gpu: str) -> dict:
    """Calculate theoretical peak FLOPS for common GPUs."""
    
    specs = {
        "A100_80GB": {
            "sm_count": 108,
            "fp32_cores_per_sm": 64,
            "fp64_cores_per_sm": 32,
            "tensor_cores_per_sm": 4,
            "clock_ghz": 1.41,  # Boost clock
        },
        "H100_SXM": {
            "sm_count": 132,
            "fp32_cores_per_sm": 128,
            "fp64_cores_per_sm": 64,
            "tensor_cores_per_sm": 4,
            "clock_ghz": 1.83,
        },
    }
    
    s = specs[gpu]
    
    # CUDA Cores: 2 FLOP/cycle (FMA = multiply + add)
    fp32_peak = s["sm_count"] * s["fp32_cores_per_sm"] * s["clock_ghz"] * 2 * 1000
    fp64_peak = s["sm_count"] * s["fp64_cores_per_sm"] * s["clock_ghz"] * 2 * 1000
    
    return {
        "FP32 (TFLOPS)": fp32_peak / 1000,
        "FP64 (TFLOPS)": fp64_peak / 1000,
    }

# A100 Example:
# FP32: 108 SM × 64 cores × 1.41 GHz × 2 FLOP/cycle = 19.5 TFLOPS
# FP64: 108 SM × 32 cores × 1.41 GHz × 2 FLOP/cycle = 9.7 TFLOPS

Precision Effects (FP64 → INT8)

Different precisions have vastly different peak throughputs:

Precision	A100 (CUDA)	A100 (Tensor)	H100 (Tensor)	Ratio vs FP32
`FP64`	9.7 TFLOPS	19.5 TFLOPS	67 TFLOPS	0.5×
`FP32`	19.5 TFLOPS	156 TFLOPS*	989 TFLOPS*	1×
`TF32`	—	156 TFLOPS	989 TFLOPS	8×
`FP16/BF16`	39 TFLOPS	312 TFLOPS	1,979 TFLOPS	16×
`INT8`	—	624 TOPS	3,958 TOPS	32×

*TF32 uses FP32 inputs but computes with reduced mantissa precision on Tensor Cores

Tensor Cores Peak

Tensor Cores perform matrix multiply-accumulate on small tiles:

D = A \times B + C \quad \text{where } A, B, C, D \text{ are small matrices}

Python tensor_core_peak.py

def tensor_core_peak(gpu: str, precision: str) -> float:
    """
    Tensor Core peak calculation.
    
    A100 Tensor Core: 256 FP16 FMA ops/cycle per TC
    4 TCs per SM × 108 SMs × 1.41 GHz × 256 × 2 = 312 TFLOPS
    """
    
    tc_specs = {
        "A100": {
            "tcs_per_sm": 4,
            "sm_count": 108,
            "clock_ghz": 1.41,
            "fp16_ops_per_tc_per_cycle": 256,  # 16×16×16 MMA
        },
        "H100": {
            "tcs_per_sm": 4,
            "sm_count": 132,
            "clock_ghz": 1.83,
            "fp16_ops_per_tc_per_cycle": 512,  # Larger MMA
        },
    }
    
    s = tc_specs[gpu]
    
    # Base FP16 calculation
    fp16_tflops = (s["tcs_per_sm"] * s["sm_count"] * s["clock_ghz"] 
                   * s["fp16_ops_per_tc_per_cycle"] * 2 / 1000)
    
    # Precision multipliers
    multipliers = {
        "FP16": 1.0,
        "BF16": 1.0,
        "TF32": 0.5,
        "FP64": 0.0625,  # 1/16
        "INT8": 2.0,
    }
    
    return fp16_tflops * multipliers[precision]

A100 Peak Performance by Precision (Log Scale)

Tensor Cores (TC) provide massive speedups for supported precisions

4. Computing Peak Bandwidth

Memory Hierarchy

Modern GPUs have multiple memory levels, each with different bandwidth:

GPU Memory Hierarchy (A100)

Higher = faster but smaller. Most data lives in HBM.

Theoretical Bandwidth

HBM bandwidth is calculated from memory specifications:

\text{BW}_{\text{theoretical}} = \text{Memory Clock} \times \text{Bus Width} \times \text{Data Rate}

Python bandwidth_calc.py

def calculate_hbm_bandwidth(gpu: str) -> float:
    """Calculate theoretical HBM bandwidth."""
    
    hbm_specs = {
        "A100_80GB": {
            # HBM2e specifications
            "memory_clock_mhz": 1593,       # Effective clock
            "bus_width_bits": 5120,         # 5 stacks × 1024 bits
            "data_rate": 2,                 # DDR = 2 transfers/cycle
        },
        "H100_SXM": {
            # HBM3 specifications
            "memory_clock_mhz": 2619,
            "bus_width_bits": 5120,
            "data_rate": 2,
        },
    }
    
    s = hbm_specs[gpu]
    
    # BW (GB/s) = Clock (MHz) × Width (bits) × DataRate / 8 / 1000
    bw_gbps = (s["memory_clock_mhz"] * s["bus_width_bits"] 
               * s["data_rate"] / 8 / 1000)
    
    return bw_gbps

# A100 80GB:
# 1593 MHz × 5120 bits × 2 / 8 / 1000 = 2,039 GB/s

# H100 SXM:
# 2619 MHz × 5120 bits × 2 / 8 / 1000 = 3,352 GB/s

A100 80GB

2,039

GB/s HBM2e Bandwidth

H100 SXM

3,352

GB/s HBM3 Bandwidth

L2 Cache

~5,000

GB/s (A100)

L1/Shared

~19,000

GB/s (A100)

Multi-Ceiling Rooflines

Since each memory level has different bandwidth, we can draw multiple rooflines:

Multi-Ceiling Roofline (A100)

Multiple memory bandwidth ceilings and compute ceilings show different performance bounds

Key Insight

A kernel that's memory-bound at the HBM level might become compute-bound if data fits in L2 cache. This is why tiling is so important - it moves your working set to faster memory levels.

5. Building Your Roofline

To create a roofline for your GPU, you need:

Peak Compute: From specs or microbenchmark
Peak Bandwidth: From specs or STREAM benchmark
Ridge Point: Compute / Bandwidth

Python build_roofline.py

import numpy as np
import matplotlib.pyplot as plt

def build_roofline(
    peak_gflops: float,
    peak_bw_gbs: float,
    ai_range: tuple = (0.01, 1000),
):
    """
    Build and plot a roofline model.
    
    Args:
        peak_gflops: Peak compute performance in GFLOP/s
        peak_bw_gbs: Peak memory bandwidth in GB/s
        ai_range: (min, max) arithmetic intensity range
    """
    # Ridge point
    ridge_ai = peak_gflops / peak_bw_gbs
    print(f"Ridge Point: {ridge_ai:.2f} FLOP/Byte")
    
    # Generate AI values (log scale)
    ai = np.logspace(np.log10(ai_range[0]), 
                      np.log10(ai_range[1]), 1000)
    
    # Roofline equation: P = min(peak, BW × AI)
    performance = np.minimum(peak_gflops, peak_bw_gbs * ai)
    
    # Plot
    plt.figure(figsize=(10, 7))
    plt.loglog(ai, performance, 'b-', linewidth=2.5, label='Roofline')
    
    # Mark ridge point
    plt.axvline(x=ridge_ai, color='purple', linestyle='--', alpha=0.7)
    plt.scatter([ridge_ai], [peak_gflops], color='purple', s=100, 
                zorder=5, label=f'Ridge ({ridge_ai:.1f} FLOP/B)')
    
    # Labels
    plt.xlabel('Arithmetic Intensity (FLOP/Byte)', fontsize=12)
    plt.ylabel('Performance (GFLOP/s)', fontsize=12)
    plt.title('Roofline Model', fontsize=14)
    
    # Add region annotations
    plt.text(ridge_ai / 10, peak_gflops / 5, 'Memory\nBound', 
            fontsize=12, color='blue', ha='center')
    plt.text(ridge_ai * 10, peak_gflops / 2, 'Compute\nBound', 
            fontsize=12, color='red', ha='center')
    
    plt.grid(True, which='both', alpha=0.3)
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    return ridge_ai


# Example: A100 FP32
ridge = build_roofline(
    peak_gflops=19500,     # 19.5 TFLOPS
    peak_bw_gbs=2039,      # 2039 GB/s
)
# Output: Ridge Point: 9.56 FLOP/Byte

6. Worked Examples

Example 1: Vector Scaling (SAXPY)

SAXPY: $y = \alpha x + y$

Python saxpy_analysis.py

def analyze_saxpy(n: int, dtype_bytes: int = 4):
    """
    SAXPY: y = α * x + y
    
    For each element:
    - Read: x[i], y[i] → 2 reads
    - Compute: multiply, add → 2 FLOPs  
    - Write: y[i] → 1 write
    """
    
    # FLOPs: 2 per element (1 mul + 1 add)
    flops = 2 * n
    
    # Bytes: read x, read y, write y = 3 × n × sizeof(dtype)
    bytes_transferred = 3 * n * dtype_bytes
    
    # Arithmetic Intensity
    ai = flops / bytes_transferred
    
    print(f"SAXPY Analysis (n={n:,}, FP32)")
    print(f"  FLOPs: {flops:,}")
    print(f"  Bytes: {bytes_transferred:,}")
    print(f"  AI: {ai:.4f} FLOP/Byte")
    
    # A100 Analysis
    a100_peak = 19500  # GFLOP/s
    a100_bw = 2039     # GB/s
    a100_ridge = a100_peak / a100_bw
    
    print(f"\n  A100 Ridge: {a100_ridge:.2f} FLOP/Byte")
    print(f"  Status: {'Memory-bound' if ai < a100_ridge else 'Compute-bound'}")
    
    # Expected performance
    expected_gflops = min(a100_peak, a100_bw * ai)
    print(f"  Expected perf: {expected_gflops:.1f} GFLOP/s")
    print(f"  Efficiency: {100 * expected_gflops / a100_peak:.1f}% of peak compute")
    
    return ai

analyze_saxpy(100_000_000)
# Output:
# SAXPY Analysis (n=100,000,000, FP32)
#   FLOPs: 200,000,000
#   Bytes: 1,200,000,000
#   AI: 0.1667 FLOP/Byte
#
#   A100 Ridge: 9.56 FLOP/Byte
#   Status: Memory-bound
#   Expected perf: 339.8 GFLOP/s
#   Efficiency: 1.7% of peak compute

SAXPY Reality Check

SAXPY can only achieve 1.7% of the A100's compute capacity! It's so memory-bound that optimizing compute is pointless - you need to reduce memory traffic (e.g., kernel fusion).

Example 2: Matrix Multiplication (GEMM)

GEMM: $C = A \times B$ where $A$ is $M \times K$ and $B$ is $K \times N$

Python gemm_analysis.py

def analyze_gemm(M: int, N: int, K: int, dtype_bytes: int = 4):
    """
    GEMM: C = A × B
    
    A: M×K, B: K×N, C: M×N
    
    FLOPs: M × N × (2K - 1) ≈ 2MNK for large K
           (K multiplies + K-1 adds per output element)
    
    Bytes (naive, no tiling):
           Read A, Read B, Write C = (MK + KN + MN) × dtype_bytes
    """
    
    # FLOPs
    flops = 2 * M * N * K
    
    # Bytes (assuming matrices read once from HBM)
    bytes_transferred = (M*K + K*N + M*N) * dtype_bytes
    
    # Arithmetic Intensity
    ai = flops / bytes_transferred
    
    print(f"GEMM Analysis ({M}×{K} @ {K}×{N}, FP32)")
    print(f"  FLOPs: {flops:,}")
    print(f"  Bytes: {bytes_transferred:,}")
    print(f"  AI: {ai:.2f} FLOP/Byte")
    
    # A100 Analysis
    a100_ridge = 9.56
    status = 'Compute-bound' if ai > a100_ridge else 'Memory-bound'
    print(f"  Status: {status}")
    
    return ai

# Small matrices (memory-bound)
analyze_gemm(128, 128, 128)
# AI: 42.67 FLOP/Byte → Compute-bound

# Larger = higher AI (more reuse)
analyze_gemm(4096, 4096, 4096)
# AI: 1365.33 FLOP/Byte → Very compute-bound

# For square matrices: AI ≈ N/3
# N=128: AI ≈ 42
# N=4096: AI ≈ 1365

Example 3: Reduction (Sum)

Python reduction_analysis.py

def analyze_reduction(n: int, dtype_bytes: int = 4):
    """
    Reduction: s = sum(x)
    
    FLOPs: n-1 additions
    Bytes: n reads + 1 write (negligible)
    """
    
    flops = n - 1
    bytes_transferred = n * dtype_bytes  # Ignore single output write
    ai = flops / bytes_transferred
    
    print(f"Reduction (n={n:,})")
    print(f"  AI: {ai:.4f} FLOP/Byte")
    print(f"  Status: Extremely memory-bound")
    
    return ai

analyze_reduction(100_000_000)
# AI: 0.25 FLOP/Byte - Severely memory-bound!

Summary: Common Operations

Operation	FLOPs	Bytes	AI (FP32)	Status on A100
Vector Add	N	12N	0.083	🔵 Memory-bound
SAXPY	2N	12N	0.167	🔵 Memory-bound
Dot Product	2N	8N	0.25	🔵 Memory-bound
Sum Reduction	N	4N	0.25	🔵 Memory-bound
Softmax	5N	8N	0.625	🔵 Memory-bound
LayerNorm	8N	8N	1.0	🔵 Memory-bound
GEMM (N=128)	2N³	12N²	~43	🔴 Compute-bound
GEMM (N=4096)	2N³	12N²	~1365	🔴 Compute-bound

7. Achievable vs Theoretical Performance

Theoretical peak performance is a ceiling you'll never reach. Real-world achievable performance depends on many factors: memory access patterns, instruction mix, occupancy, and more.

Critical Distinction

Theoretical Peak: What the hardware could do if perfectly utilized (from specs).
Achievable Peak: What you can actually measure with optimized benchmarks.
The gap between them is typically 10-30% for bandwidth, 5-15% for compute.

STREAM Benchmark

The STREAM benchmark measures achievable memory bandwidth using simple vector operations:

Kernel	Operation	Bytes/Element	FLOPs/Element
`Copy`	a[i] = b[i]	16 (2 × 8 for FP64)	0
`Scale`	a[i] = q × b[i]	16	1
`Add`	a[i] = b[i] + c[i]	24	1
`Triad`	a[i] = b[i] + q × c[i]	24	2

Python stream_benchmark.py

import torch
import time

def stream_triad(n: int, dtype=torch.float64, device="cuda", iterations=100):
    """STREAM Triad: a = b + scalar * c"""
    
    # Allocate arrays
    a = torch.zeros(n, dtype=dtype, device=device)
    b = torch.ones(n, dtype=dtype, device=device)
    c = torch.ones(n, dtype=dtype, device=device)
    scalar = 3.0
    
    # Warmup
    for _ in range(10):
        a = b + scalar * c
    
    torch.cuda.synchronize()
    
    # Timed iterations
    start = time.perf_counter()
    for _ in range(iterations):
        a = b + scalar * c
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    
    # Calculate bandwidth
    bytes_per_elem = a.element_size()
    total_bytes = 3 * n * bytes_per_elem * iterations  # 2 read + 1 write
    bandwidth_gbs = total_bytes / elapsed / 1e9
    
    print(f"STREAM Triad (n={n:,}, {dtype})")
    print(f"  Measured bandwidth: {bandwidth_gbs:.1f} GB/s")
    
    return bandwidth_gbs

# Run on A100
bw = stream_triad(n=100_000_000)  # 100M elements
# Typical output: ~1800-1900 GB/s (vs 2039 theoretical)
# Efficiency: ~88-93%

STREAM Benchmark Results: Theoretical vs Achievable

STREAM Triad measures achievable memory bandwidth - use this for roofline analysis

GEMM Benchmark

For compute-bound operations, the GEMM benchmark measures achievable compute throughput:

Python gemm_benchmark.py

import torch
import time

def gemm_benchmark(
    M: int, N: int, K: int,
    dtype=torch.float16,
    device="cuda",
    iterations=100
):
    """Benchmark matrix multiplication throughput."""
    
    # Allocate matrices
    A = torch.randn(M, K, dtype=dtype, device=device)
    B = torch.randn(K, N, dtype=dtype, device=device)
    
    # Warmup (important for Tensor Cores!)
    for _ in range(20):
        C = torch.mm(A, B)
    
    torch.cuda.synchronize()
    
    # Timed iterations
    start = time.perf_counter()
    for _ in range(iterations):
        C = torch.mm(A, B)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    
    # Calculate FLOPS
    flops_per_gemm = 2 * M * N * K
    total_flops = flops_per_gemm * iterations
    tflops = total_flops / elapsed / 1e12
    
    print(f"GEMM Benchmark ({M}×{K} @ {K}×{N}, {dtype})")
    print(f"  Measured: {tflops:.1f} TFLOPS")
    
    return tflops

# Large square matrices for peak throughput
tflops = gemm_benchmark(8192, 8192, 8192, dtype=torch.float16)
# A100 FP16 TC: ~280 TFLOPS (vs 312 theoretical) = 90%
# H100 FP16 TC: ~1700 TFLOPS (vs 1979 theoretical) = 86%

# Shape matters for efficiency
gemm_benchmark(1024, 1024, 1024, dtype=torch.float16)  # ~200 TFLOPS
gemm_benchmark(256, 256, 256, dtype=torch.float16)    # ~50 TFLOPS (poor!)

GEMM Efficiency vs Matrix Size (A100 FP16 Tensor Cores)

GEMM efficiency improves dramatically with matrix size - small matrices waste Tensor Core capacity

The Efficiency Gap

Multiple factors contribute to the gap between theoretical and achievable performance:

Sources of Performance Loss

Performance loss cascade - each factor compounds to reduce achievable throughput

Python achievable_roofline.py

def build_realistic_roofline(gpu: str):
    """Build roofline with achievable (not theoretical) peaks."""
    
    # Achievable peaks from benchmarks
    achievable = {
        "A100_80GB": {
            # Memory bandwidth (STREAM Triad)
            "hbm_bw_gbs": 1880,    # vs 2039 theoretical
            "l2_bw_gbs": 4500,     # vs ~5000 theoretical
            "l1_bw_gbs": 17000,    # vs ~19000 theoretical
            
            # Compute throughput (GEMM benchmark)
            "fp32_tflops": 18.5,    # vs 19.5 theoretical
            "tf32_tflops": 140,     # vs 156 theoretical
            "fp16_tflops": 285,     # vs 312 theoretical
        },
        "H100_SXM": {
            "hbm_bw_gbs": 3040,     # vs 3352 theoretical
            "l2_bw_gbs": 10000,
            "l1_bw_gbs": 30000,
            "fp32_tflops": 55,
            "tf32_tflops": 850,
            "fp16_tflops": 1700,
        },
    }
    
    a = achievable[gpu]
    
    # Ridge points with achievable values
    ridge_fp32 = a["fp32_tflops"] * 1000 / a["hbm_bw_gbs"]
    ridge_fp16 = a["fp16_tflops"] * 1000 / a["hbm_bw_gbs"]
    
    print(f"{gpu} Achievable Roofline:")
    print(f"  FP32 Ridge: {ridge_fp32:.1f} FLOP/B")
    print(f"  FP16 Ridge: {ridge_fp16:.1f} FLOP/B")
    
    return a

build_realistic_roofline("A100_80GB")
# A100 FP32 Ridge: 9.8 FLOP/B (vs 9.6 theoretical)
# A100 FP16 Ridge: 151.6 FLOP/B (vs 153 theoretical)

8. Cache-Aware Roofline Analysis

The basic roofline assumes all data comes from main memory (HBM). In reality, caches dramatically change the picture.

Hierarchical Roofline Model

The Hierarchical Roofline Model (HRM) accounts for different memory levels:

P_{\text{attainable}} = \min\left( P_{\text{peak}}, \sum_{i} \text{BW}_i \times \text{AI}_i \right)

Where $i$ iterates over memory levels (L1, L2, HBM) and $\text{AI}_i$ is the arithmetic intensity relative to that level.

Hierarchical Roofline: Same Kernel, Different Cache Behavior

The same kernel can land on different rooflines depending on cache behavior

Data Locality Effects

Data locality determines which bandwidth ceiling limits your kernel:

Python locality_analysis.py

def analyze_locality(
    working_set_bytes: int,
    l1_size: int = 192 * 1024,      # 192 KB per SM
    l2_size: int = 40 * 1024 * 1024,  # 40 MB total
):
    """Determine which memory level bounds a kernel."""
    
    if working_set_bytes <= l1_size:
        return "L1", 19000  # GB/s
    elif working_set_bytes <= l2_size:
        return "L2", 5000
    else:
        return "HBM", 1880

# Example: Matrix multiply tile sizes
for tile_size in [32, 64, 128, 256, 512]:
    # 3 tiles for A, B, C
    working_set = 3 * tile_size * tile_size * 4  # FP32
    level, bw = analyze_locality(working_set)
    print(f"Tile {tile_size}×{tile_size}: {working_set//1024} KB → {level} ({bw} GB/s)")

# Output:
# Tile 32×32: 12 KB → L1 (19000 GB/s)
# Tile 64×64: 48 KB → L1 (19000 GB/s)
# Tile 128×128: 192 KB → L1 (19000 GB/s)  ← Fits exactly!
# Tile 256×256: 768 KB → L2 (5000 GB/s)
# Tile 512×512: 3072 KB → L2 (5000 GB/s)

Impact of Tiling

Tiling (blocking) is the primary technique to improve data locality and move kernels to faster rooflines:

Effect of Tiling on GEMM Performance

Tiling moves data access from HBM to L1, dramatically improving performance

Tiling Rule of Thumb

Tile size should maximize L1/shared memory usage without spilling. For A100: aim for ~192KB working set per SM. This typically means 64×64 to 128×128 tiles for FP32, larger for FP16.

9. Operational vs Arithmetic Intensity

There's an important distinction between two related metrics:

Metric	Definition	What It Measures	Use Case
Arithmetic Intensity	FLOPs / Bytes from algorithm spec	Inherent algorithm characteristic	Algorithm selection
Operational Intensity	FLOPs / Bytes actually transferred	Real cache behavior	Implementation optimization

Python intensity_comparison.py

def compare_intensities(M: int, N: int, K: int, tile_size: int):
    """Compare arithmetic vs operational intensity for GEMM."""
    
    flops = 2 * M * N * K
    
    #----------------------------------------
    # Arithmetic Intensity (from algorithm)
    #----------------------------------------
    bytes_algo = (M*K + K*N + M*N) * 4  # FP32
    ai = flops / bytes_algo
    
    #----------------------------------------
    # Operational Intensity (with tiling)
    #----------------------------------------
    # With tiling, each HBM byte is reused tile_size times
    num_tiles = (M // tile_size) * (N // tile_size) * (K // tile_size)
    
    # HBM traffic: load each tile once
    hbm_bytes = bytes_algo  # Still need to load all data
    oi_hbm = flops / hbm_bytes  # Same as AI
    
    # L1 traffic: each tile loaded from L2 once per K-tile
    l1_loads = (M // tile_size) * (N // tile_size) * (K // tile_size)
    l1_bytes = l1_loads * 3 * tile_size * tile_size * 4
    oi_l1 = flops / l1_bytes
    
    print(f"GEMM {M}×{K} @ {K}×{N}, tile={tile_size}")
    print(f"  Arithmetic Intensity: {ai:.1f} FLOP/B")
    print(f"  Operational Intensity (HBM): {oi_hbm:.1f} FLOP/B")
    print(f"  Operational Intensity (L1): {oi_l1:.1f} FLOP/B")
    print(f"  Effective amplification: {oi_l1/ai:.1f}×")

compare_intensities(4096, 4096, 4096, tile_size=128)
# Output:
# GEMM 4096×4096 @ 4096×4096, tile=128
#   Arithmetic Intensity: 1365.3 FLOP/B
#   Operational Intensity (HBM): 1365.3 FLOP/B
#   Operational Intensity (L1): 1365.3 FLOP/B
#   Effective amplification: 1.0×

# For smaller matrices or poor tiling, OI << AI

When They Differ

AI = OI: Perfect implementation, all cache benefits realized.
AI > OI: Poor implementation, cache misses, redundant loads.
AI < OI: Impossible (can't do fewer operations than algorithm requires).

10. Profiling Tools for Roofline

Modern profiling tools can automatically generate roofline plots:

NVIDIA Nsight Compute

Nsight Compute provides built-in roofline analysis for GPU kernels:

Bash nsight_roofline.sh

# Profile with roofline metrics
ncu --set roofline -o my_kernel_roofline ./my_application

# Open in GUI
ncu-ui my_kernel_roofline.ncu-rep

# Key sections in report:
# - "Roofline Analysis" - visual plot
# - "Memory Workload Analysis" - bandwidth utilization
# - "Compute Workload Analysis" - FLOP utilization

Intel Advisor

For CPU analysis, Intel Advisor provides comprehensive roofline support:

Bash advisor_roofline.sh

# Collect roofline data
advisor --collect=roofline --project-dir=./advisor_results -- ./my_application

# Generate report
advisor --report=roofline --project-dir=./advisor_results

# Features:
# - Automatic AI calculation per loop
# - Cache-aware roofline (L1/L2/L3/DRAM)
# - Vectorization efficiency overlay

PyTorch Profiler

Python pytorch_profiling.py

import torch
from torch.profiler import profile, ProfilerActivity

def profile_for_roofline(model, input_data, device="cuda"):
    """Profile model for roofline analysis."""
    
    model = model.to(device)
    input_data = input_data.to(device)
    
    # Warmup
    for _ in range(10):
        _ = model(input_data)
    
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        with_flops=True,
        profile_memory=True,
        record_shapes=True,
    ) as prof:
        _ = model(input_data)
    
    # Print breakdown
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=20
    ))
    
    # Export for visualization
    prof.export_chrome_trace("trace.json")
    
    # Calculate AI for each kernel
    for event in prof.key_averages():
        if event.flops > 0 and event.cuda_memory_usage > 0:
            ai = event.flops / event.cuda_memory_usage
            print(f"{event.key}: AI = {ai:.2f} FLOP/B")

# Example usage
model = torch.nn.Linear(4096, 4096)
input_data = torch.randn(32, 4096)
profile_for_roofline(model, input_data)

Roofline Profiling Tool Comparison

Choose the right tool based on your hardware and analysis needs

11. BLAS Operations on the Roofline

The Basic Linear Algebra Subprograms (BLAS) are organized into three levels based on their arithmetic intensity:

Level 1: Vector Operations

Level 1 BLAS performs vector-vector operations: $O(n)$ data, $O(n)$ FLOPs → AI ≈ constant

Operation	Definition	FLOPs	Bytes (FP32)	AI
`SCOPY`	y ← x	0	8n	0
`SSCAL`	x ← αx	n	8n	0.125
`SAXPY`	y ← αx + y	2n	12n	0.167
`SDOT`	s ← x·y	2n	8n	0.25
`SNRM2`	s ← \|\|x\|\|₂	2n	4n	0.5
`SASUM`	s ← Σ\|xᵢ\|	n	4n	0.25

Level 1 Reality

All Level 1 BLAS operations are severely memory-bound. On an A100 (ridge point ~10 FLOP/B), they achieve only 1-5% of peak compute. Optimization focus: minimize memory traffic, maximize bandwidth utilization.

Level 2: Matrix-Vector Operations

Level 2 BLAS performs matrix-vector operations: $O(n^2)$ data, $O(n^2)$ FLOPs → AI ≈ constant

Operation	Definition	FLOPs	Bytes (FP32)	AI
`SGEMV`	y ← αAx + βy	2mn	4(mn + m + n)	≈ 0.5
`SSYMV`	y ← αAx + βy (A symmetric)	2n²	4(n²/2 + 2n)	≈ 1.0
`STRMV`	x ← Ax (A triangular)	n²	4(n²/2 + n)	≈ 0.5
`SGER`	A ← αxy' + A	2mn	4(mn + m + n)	≈ 0.5

Python level2_analysis.py

def analyze_gemv(m: int, n: int, dtype_bytes: int = 4):
    """GEMV: y = A @ x where A is m×n."""
    
    flops = 2 * m * n  # Each output: n muls + n adds
    
    # Bytes: read A (m×n), read x (n), write y (m)
    bytes_transferred = (m * n + n + m) * dtype_bytes
    
    ai = flops / bytes_transferred
    
    print(f"GEMV ({m}×{n})")
    print(f"  AI: {ai:.3f} FLOP/Byte")
    print(f"  Status: Memory-bound (ridge=10)")
    
    # For large matrices, AI → 2/(4+4/n+4/m) → 0.5
    return ai

# GEMV is always memory-bound regardless of size!
analyze_gemv(1024, 1024)    # AI: 0.499
analyze_gemv(4096, 4096)    # AI: 0.500
analyze_gemv(16384, 16384)  # AI: 0.500

Level 3: Matrix-Matrix Operations

Level 3 BLAS performs matrix-matrix operations: $O(n^2)$ data, $O(n^3)$ FLOPs → AI grows with n

Operation	Definition	FLOPs	Bytes (FP32)	AI
`SGEMM`	C ← αAB + βC	2mnk	4(mk + kn + mn)	≈ n/3 (square)
`SSYMM`	C ← αAB + βC (A symmetric)	2mn²	4(n²/2 + mn + mn)	≈ n/3
`SSYRK`	C ← αAA' + βC	n²k	4(nk + n²/2)	≈ n/2
`STRMM`	B ← αAB (A triangular)	mn²	4(n²/2 + mn)	≈ n/3

BLAS Levels on the Roofline (A100)

Only Level 3 BLAS can effectively utilize modern GPU compute capacity

12. Deep Learning Operations

Deep learning workloads combine many operations with different characteristics. Understanding each operation's roofline position is critical for optimization.

Linear Layers & GEMM

Fully-connected layers are GEMMs: $Y = XW + b$

Python linear_analysis.py

def analyze_linear(
    batch: int, in_features: int, out_features: int,
    dtype_bytes: int = 2  # FP16
):
    """Linear layer: Y = X @ W + b"""
    
    # GEMM: (batch × in) @ (in × out) = (batch × out)
    flops = 2 * batch * in_features * out_features
    
    # Bytes: read X, read W, write Y (ignore bias - small)
    bytes_x = batch * in_features * dtype_bytes
    bytes_w = in_features * out_features * dtype_bytes
    bytes_y = batch * out_features * dtype_bytes
    bytes_total = bytes_x + bytes_w + bytes_y
    
    ai = flops / bytes_total
    
    # Classification
    a100_ridge = 152  # FP16 Tensor Core
    status = "Compute-bound" if ai > a100_ridge else "Memory-bound"
    
    print(f"Linear({in_features}→{out_features}), batch={batch}")
    print(f"  AI: {ai:.1f} FLOP/B ({status})")
    
    return ai

# Typical LLM dimensions
analyze_linear(1, 4096, 4096)      # AI: 1.0 - Memory-bound!
analyze_linear(32, 4096, 4096)     # AI: 30.1 - Memory-bound
analyze_linear(256, 4096, 4096)    # AI: 170.7 - Compute-bound!
analyze_linear(1024, 4096, 4096)   # AI: 341.3 - Compute-bound

# Key insight: batch size determines bound type!

Convolutions

Convolutions can be expressed as im2col + GEMM or direct computation:

Python conv_analysis.py

def analyze_conv2d(
    batch: int,
    in_channels: int, out_channels: int,
    h: int, w: int,
    kernel: int = 3,
    dtype_bytes: int = 2
):
    """Conv2D arithmetic intensity analysis."""
    
    out_h, out_w = h, w  # Assume same padding
    
    # FLOPs: each output pixel needs kernel² × in_ch MACs
    flops = 2 * batch * out_channels * out_h * out_w * in_channels * kernel * kernel
    
    # Bytes
    bytes_input = batch * in_channels * h * w * dtype_bytes
    bytes_kernel = out_channels * in_channels * kernel * kernel * dtype_bytes
    bytes_output = batch * out_channels * out_h * out_w * dtype_bytes
    bytes_total = bytes_input + bytes_kernel + bytes_output
    
    ai = flops / bytes_total
    
    print(f"Conv2D: {in_channels}→{out_channels}, {kernel}×{kernel}, input {h}×{w}, batch={batch}")
    print(f"  AI: {ai:.1f} FLOP/B")
    
    return ai

# ResNet-style convolutions
analyze_conv2d(1, 64, 64, 56, 56)    # AI: 28.5
analyze_conv2d(32, 64, 64, 56, 56)   # AI: 47.2
analyze_conv2d(1, 512, 512, 7, 7)   # AI: 17.1
analyze_conv2d(32, 512, 512, 7, 7)  # AI: 153.6 - Finally compute-bound!

Attention Mechanisms

Self-attention: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

Attention Mechanism: Operation Breakdown

Flash Attention eliminates memory-bound softmax by keeping data in SRAM

Python attention_analysis.py

def analyze_attention(
    batch: int, heads: int, seq_len: int, head_dim: int,
    use_flash: bool = False,
    dtype_bytes: int = 2
):
    """Analyze attention arithmetic intensity."""
    
    B, H, S, D = batch, heads, seq_len, head_dim
    
    # FLOPs: Q@K.T + softmax + scores@V
    flops_qk = 2 * B * H * S * S * D      # Q @ K.T
    flops_softmax = 5 * B * H * S * S      # exp, sum, div, etc.
    flops_sv = 2 * B * H * S * S * D       # scores @ V
    total_flops = flops_qk + flops_softmax + flops_sv
    
    if use_flash:
        # Flash: only Q, K, V, O traverse HBM once
        bytes_total = 4 * B * H * S * D * dtype_bytes
    else:
        # Standard: Q, K, V, O + attention scores matrix
        bytes_qkvo = 4 * B * H * S * D * dtype_bytes
        bytes_scores = B * H * S * S * dtype_bytes  # The killer!
        bytes_total = bytes_qkvo + 2 * bytes_scores  # Read + write
    
    ai = total_flops / bytes_total
    
    mode = "Flash" if use_flash else "Standard"
    print(f"{mode} Attention: B={B}, H={H}, S={S}, D={D}")
    print(f"  AI: {ai:.1f} FLOP/B")
    
    return ai

# GPT-3 style: batch=1, heads=96, seq=2048, dim=128
analyze_attention(1, 96, 2048, 128, use_flash=False)  # AI: 4.2
analyze_attention(1, 96, 2048, 128, use_flash=True)   # AI: 68.3 (16× better!)

# Longer sequences amplify the difference
analyze_attention(1, 96, 8192, 128, use_flash=False)  # AI: 1.1
analyze_attention(1, 96, 8192, 128, use_flash=True)   # AI: 68.3 (62× better!)

Normalization Layers

LayerNorm, BatchNorm, RMSNorm - all are memory-bound due to low AI:

Operation	Formula	FLOPs/elem	Bytes/elem	AI
`LayerNorm`	$(x - \mu) / \sigma \cdot \gamma + \beta$	~8	8 (FP16)	1.0
`RMSNorm`	$x / \text{RMS}(x) \cdot \gamma$	~5	8	0.625
`BatchNorm`	$(x - \mu) / \sigma \cdot \gamma + \beta$	~8	8	1.0

Fusing Norms

Since norms are memory-bound, kernel fusion is critical. Fusing LayerNorm with the preceding or following operation avoids a full round-trip to HBM. This is why libraries like FlashAttention and Apex fused kernels exist.

Activations & Elementwise Ops

All elementwise operations have very low AI:

Operation	FLOPs/elem	Bytes/elem (FP16)	AI
`ReLU`	1 (comparison)	4	0.25
`GELU`	~12	4	3.0
`SiLU/Swish`	~4	4	1.0
`Dropout`	2	4	0.5
`Add (residual)`	1	6	0.17

Deep Learning Operations on the Roofline

Most DL operations are memory-bound; only large GEMMs fully utilize GPU compute

13. Batch Size Effects

Batch size is the most important knob for moving operations from memory-bound to compute-bound:

Python batch_size_analysis.py

import numpy as np

def analyze_batch_effect(
    in_features: int,
    out_features: int,
    batch_sizes: list,
    dtype_bytes: int = 2  # FP16
):
    """Show how batch size affects arithmetic intensity."""
    
    # Weight bytes (constant regardless of batch)
    W_bytes = in_features * out_features * dtype_bytes
    
    print(f"Linear Layer: {in_features}→{out_features}")
    print(f"Weights: {W_bytes / 1e6:.1f} MB")
    print(f"\n{'Batch':>8} {'FLOPs':>12} {'Bytes':>12} {'AI':>10} {'Status':>15}")
    print("-" * 60)
    
    a100_ridge = 152  # FP16 TC
    
    for B in batch_sizes:
        # FLOPs: 2 * B * in * out
        flops = 2 * B * in_features * out_features
        
        # Bytes: X (B×in) + W (in×out) + Y (B×out)
        bytes_total = (B * in_features + in_features * out_features + 
                      B * out_features) * dtype_bytes
        
        ai = flops / bytes_total
        status = "✓ Compute" if ai > a100_ridge else "✗ Memory"
        
        print(f"{B:>8} {flops:>12,} {bytes_total:>12,} {ai:>10.1f} {status:>15}")

# LLM-style dimensions
analyze_batch_effect(
    in_features=4096,
    out_features=4096,
    batch_sizes=[1, 4, 16, 64, 128, 256, 512, 1024]
)

# Output:
# Linear Layer: 4096→4096
# Weights: 33.6 MB
#
#    Batch        FLOPs        Bytes         AI          Status
# ------------------------------------------------------------
#        1   33,554,432   33,570,816        1.0       ✗ Memory
#        4  134,217,728   33,603,584        4.0       ✗ Memory
#       16  536,870,912   33,685,504       15.9       ✗ Memory
#       64 2,147,483,648   33,882,112       63.4       ✗ Memory
#      128 4,294,967,296   34,078,720      126.0       ✗ Memory
#      256 8,589,934,592   34,471,936      249.2       ✓ Compute
#      512 17,179,869,184   35,258,368      487.3       ✓ Compute
#     1024 34,359,738,368   36,831,232      932.9       ✓ Compute

Batch Size Effect on AI (Linear 4096→4096)

Arithmetic intensity scales nearly linearly with batch size for large batches

LLM Inference Reality

Batch size 1 inference (autoregressive generation) is almost always memory-bound. This is why LLM inference is dominated by weight loading time, not compute. Optimization strategies: quantization (fewer bytes), speculative decoding (more FLOPs per weight load), continuous batching (larger effective batch).

Critical Batch Size

The batch size where an operation transitions from memory-bound to compute-bound:

B_{\text{critical}} \approx \frac{\text{AI}_{\text{ridge}} \cdot W_{\text{bytes}} - 2 \cdot \text{in} \cdot \text{out}}{2 \cdot (\text{in} + \text{out})}

Python critical_batch.py

def critical_batch_size(
    in_features: int,
    out_features: int,
    ridge_ai: float,
    dtype_bytes: int = 2
) -> int:
    """Calculate critical batch size for compute-bound operation."""
    
    # From the AI equation:
    # AI = 2*B*in*out / (B*(in+out) + in*out) * dtype_bytes
    # Solve for B when AI = ridge_ai
    
    W = in_features * out_features
    factor = 2 * dtype_bytes
    
    # Simplified: B_crit ≈ ridge_ai * W / (2 * (in + out) * dtype)
    B_crit = ridge_ai * W * dtype_bytes / (2 * in_features * out_features - 
                                            ridge_ai * (in_features + out_features) * dtype_bytes)
    
    return max(1, int(B_crit))

# A100 FP16 Tensor Core ridge point
ridge = 152  # FLOP/Byte

# Common LLM dimensions
print("Critical batch sizes (A100 FP16):")
print(f"  Linear 4096→4096: B={critical_batch_size(4096, 4096, ridge)}")     # ~200
print(f"  Linear 4096→16384: B={critical_batch_size(4096, 16384, ridge)}")  # ~160
print(f"  Linear 16384→4096: B={critical_batch_size(16384, 4096, ridge)}")  # ~160
print(f"  Linear 768→768: B={critical_batch_size(768, 768, ridge)}")        # ~200
print(f"  Linear 768→3072: B={critical_batch_size(768, 3072, ridge)}")      # ~160

14. Precision Effects & Tensor Cores

Modern GPUs support multiple numerical precisions, each with different peak throughput. This creates multiple rooflines on the same diagram.

Peak Compute by Precision

NVIDIA A100 peak throughput by precision:

Precision	Bits/Element	Peak TFLOPS	Relative to FP32	Use Case
`FP64`	64	9.7	0.5×	Scientific computing
`FP32`	32	19.5	1×	Baseline
`TF32` (Tensor)	19 (stored as 32)	156	8×	DL training
`FP16` (Tensor)	16	312	16×	DL training/inference
`BF16` (Tensor)	16	312	16×	DL training
`INT8` (Tensor)	8	624	32×	Inference
`INT4` (Tensor)	4	1248	64×	Inference

Memory Bandwidth Effects

Lower precision also means fewer bytes transferred, affecting arithmetic intensity:

Python precision_analysis.py

def analyze_precision_effect(
    batch: int,
    in_features: int,
    out_features: int,
    precisions: dict
):
    """Compare AI and performance across precisions."""
    
    # A100 specs
    peak_tflops = {
        'FP32': 19.5, 'TF32': 156, 'FP16': 312,
        'BF16': 312, 'INT8': 624, 'INT4': 1248
    }
    bytes_per_elem = {
        'FP32': 4, 'TF32': 4, 'FP16': 2,
        'BF16': 2, 'INT8': 1, 'INT4': 0.5
    }
    
    bw = 2039  # GB/s HBM
    
    print(f"Linear {in_features}→{out_features}, batch={batch}")
    print(f"{'Precision':>8} {'AI':>8} {'Ridge':>8} {'Bound':>12} {'Peak':>10}")
    print("-" * 50)
    
    flops = 2 * batch * in_features * out_features
    
    for prec in ['FP32', 'TF32', 'FP16', 'INT8', 'INT4']:
        elem_size = bytes_per_elem[prec]
        bytes_total = (batch * in_features + in_features * out_features + 
                      batch * out_features) * elem_size
        
        ai = flops / bytes_total
        ridge = peak_tflops[prec] * 1000 / bw  # FLOP/Byte
        bound = "Compute" if ai > ridge else "Memory"
        
        # Attainable performance
        if ai > ridge:
            perf = peak_tflops[prec]
        else:
            perf = ai * bw / 1000  # TFLOPS
        
        print(f"{prec:>8} {ai:>8.1f} {ridge:>8.1f} {bound:>12} {perf:>10.1f}")

# Example: LLM inference (small batch)
analyze_precision_effect(1, 4096, 4096, {})

# Output:
# Linear 4096→4096, batch=1
# Precision       AI    Ridge        Bound       Peak
# --------------------------------------------------
#     FP32      1.0      9.6       Memory        2.0
#     TF32      1.0     76.5       Memory        2.0
#     FP16      2.0    153.1       Memory        4.1
#     INT8      4.0    306.2       Memory        8.2
#     INT4      8.0    612.4       Memory       16.3
#
# Key insight: Lower precision → higher AI → closer to compute roof!

Multi-Precision Roofline

Multi-Precision Roofline (A100)

Lower precision pushes the ridge point right, making more operations compute-bound

Quantization as Optimization

Quantization isn't just about model size—it fundamentally changes your position on the roofline. A memory-bound FP32 kernel can become compute-bound at INT4, unlocking 32× more compute throughput while halving memory traffic.

15. Extended Roofline Models

The classic roofline focuses on compute vs memory bandwidth. Extended models add other limiting factors.

Energy Roofline

For mobile/edge deployment, energy efficiency matters as much as raw performance:

\text{Energy Efficiency} = \frac{\text{FLOPs}}{\text{Joules}} = \frac{\text{Performance}}{\text{Power}}

Energy Roofline Model

Memory operations cost ~60× more energy per equivalent compute

Python energy_analysis.py

def energy_roofline(flops: int, bytes_transferred: int):
    """Calculate energy efficiency using energy roofline."""
    
    # Energy costs (approximate for modern GPUs)
    E_FLOP = 0.3e-12    # Joules per FP16 FLOP
    E_BYTE = 20e-12     # Joules per byte (HBM)
    
    energy_compute = flops * E_FLOP
    energy_memory = bytes_transferred * E_BYTE
    total_energy = energy_compute + energy_memory
    
    efficiency = flops / total_energy / 1e9  # GFLOPS/Watt
    
    # Which bound dominates?
    ai = flops / bytes_transferred
    energy_ridge = E_BYTE / E_FLOP  # ~67 FLOP/Byte for these params
    
    if ai < energy_ridge:
        bound = "Memory-energy bound"
    else:
        bound = "Compute-energy bound"
    
    print(f"Energy Efficiency: {efficiency:.1f} GFLOPS/W ({bound})")
    return efficiency

# GEMM 4K×4K×4K FP16
flops = 2 * 4096**3  # 137B FLOPs
bytes_mem = 3 * 4096**2 * 2  # 100 MB
energy_roofline(flops, bytes_mem)  # ~3000 GFLOPS/W (compute-bound)

# Attention softmax
flops_softmax = 5 * 2048**2  # 21M FLOPs
bytes_softmax = 2 * 2048**2 * 2  # 16 MB
energy_roofline(flops_softmax, bytes_softmax)  # ~60 GFLOPS/W (memory-bound)

Communication Roofline

For distributed training, network bandwidth becomes another ceiling:

Communication Roofline (Multi-GPU)

At scale, network bandwidth often becomes the bottleneck before GPU compute

Scaling Limits

Data parallelism becomes communication-bound when: $\frac{2 \cdot \text{FLOPs}_{\text{forward}}}{N \cdot \text{ParamBytes}} < \text{CI}_{\text{ridge}}$. This is why large-scale training uses model parallelism, pipeline parallelism, and gradient compression.

16. Optimization Trajectories

Different optimizations move you in different directions on the roofline:

Optimization Strategies on the Roofline

Choose optimizations based on your current position relative to the roofline

Optimization	Direction	When to Use	Examples
Better BW Utilization	↑ Vertical	Memory-bound, below roof	Coalesced access, prefetching
Kernel Fusion	↗ Diagonal	Memory-bound, multiple kernels	FlashAttention, fused LayerNorm
Algorithmic (Tiling)	→ Horizontal	Memory-bound, data reuse possible	Blocked GEMM, im2col
Quantization	↗ To ceiling	Memory-bound, accuracy permits	INT8, INT4, FP8
Compute Opts	↑ To ceiling	Compute-bound	Tensor Cores, loop unrolling
Parallelism	↑ Higher roof	At ceiling, need more	Multi-GPU, model parallelism

Python optimization_decision.py

def recommend_optimization(
    current_perf: float,      # GFLOP/s
    arithmetic_intensity: float,  # FLOP/Byte
    peak_compute: float,     # GFLOP/s
    peak_bandwidth: float,   # GB/s
) -> str:
    """Recommend optimization based on roofline position."""
    
    ridge_point = peak_compute / peak_bandwidth
    roof_perf = min(peak_compute, arithmetic_intensity * peak_bandwidth)
    efficiency = current_perf / roof_perf
    
    if arithmetic_intensity < ridge_point:
        # Memory-bound region
        if efficiency < 0.5:
            return """Memory-bound, low efficiency:
  1. Fix memory access patterns (coalescing)
  2. Use faster memory tier (L2 cache)
  3. Profile for bank conflicts"""
        elif efficiency < 0.8:
            return """Memory-bound, moderate efficiency:
  1. Try kernel fusion to increase AI
  2. Use tiling for data reuse
  3. Consider quantization"""
        else:
            return """Memory-bound, high efficiency:
  1. Algorithm change needed for more AI
  2. Quantization is the main lever
  3. Accept memory-bound nature"""
    else:
        # Compute-bound region
        if efficiency < 0.7:
            return """Compute-bound, low efficiency:
  1. Use Tensor Cores (if not already)
  2. Increase occupancy
  3. Reduce register pressure"""
        else:
            return """Compute-bound, high efficiency:
  1. Already well-optimized!
  2. Scale with more GPUs
  3. Use lower precision if acceptable"""

# Example usage
print(recommend_optimization(
    current_perf=150,          # 150 GFLOP/s observed
    arithmetic_intensity=2.0, # 2 FLOP/Byte
    peak_compute=312000,       # A100 FP16
    peak_bandwidth=2039        # A100 HBM
))

17. Case Studies

ResNet-50 Inference

ResNet-50 Layer Analysis (Batch=1 vs Batch=64)

Batching transforms ResNet from memory-bound to compute-bound

Transformer Training

Python transformer_analysis.py

def analyze_transformer_block(
    batch: int,
    seq_len: int,
    hidden: int,
    heads: int,
    ffn_mult: int = 4,
    dtype_bytes: int = 2
):
    """Analyze a transformer block's operations."""
    
    head_dim = hidden // heads
    ffn_hidden = hidden * ffn_mult
    
    ops = []
    
    # QKV Projection
    qkv_flops = 2 * batch * seq_len * hidden * 3 * hidden
    qkv_bytes = (batch * seq_len * hidden + 3 * hidden * hidden + 
                 batch * seq_len * 3 * hidden) * dtype_bytes
    ops.append(('QKV Proj', qkv_flops, qkv_bytes))
    
    # Attention (with Flash)
    attn_flops = 4 * batch * heads * seq_len * seq_len * head_dim
    attn_bytes = 4 * batch * seq_len * hidden * dtype_bytes  # Flash: no S×S
    ops.append(('Attention', attn_flops, attn_bytes))
    
    # Output Projection
    out_flops = 2 * batch * seq_len * hidden * hidden
    out_bytes = (batch * seq_len * hidden + hidden * hidden + 
                batch * seq_len * hidden) * dtype_bytes
    ops.append(('Out Proj', out_flops, out_bytes))
    
    # LayerNorm (×2)
    ln_flops = 2 * 8 * batch * seq_len * hidden
    ln_bytes = 2 * 2 * batch * seq_len * hidden * dtype_bytes
    ops.append(('LayerNorm', ln_flops, ln_bytes))
    
    # FFN (up + down)
    ffn_flops = 2 * batch * seq_len * hidden * ffn_hidden * 2
    ffn_bytes = (batch * seq_len * hidden + hidden * ffn_hidden + 
                batch * seq_len * ffn_hidden + ffn_hidden * hidden + 
                batch * seq_len * hidden) * dtype_bytes
    ops.append(('FFN', ffn_flops, ffn_bytes))
    
    # Activations (GELU, residuals)
    act_flops = 15 * batch * seq_len * ffn_hidden
    act_bytes = 4 * batch * seq_len * ffn_hidden * dtype_bytes
    ops.append(('Activations', act_flops, act_bytes))
    
    print(f"Transformer Block: B={batch}, S={seq_len}, H={hidden}")
    print(f"{'Operation':>12} {'GFLOP':>10} {'GB':>8} {'AI':>8} {'Bound':>10}")
    print("-" * 52)
    
    ridge = 153  # A100 FP16 TC
    total_flops = 0
    total_bytes = 0
    
    for name, flops, bytes_mem in ops:
        ai = flops / bytes_mem
        bound = "Compute" if ai > ridge else "Memory"
        print(f"{name:>12} {flops/1e9:>10.1f} {bytes_mem/1e9:>8.3f} {ai:>8.1f} {bound:>10}")
        total_flops += flops
        total_bytes += bytes_mem
    
    print("-" * 52)
    total_ai = total_flops / total_bytes
    print(f"{'TOTAL':>12} {total_flops/1e9:>10.1f} {total_bytes/1e9:>8.3f} {total_ai:>8.1f}")

# GPT-2 style block
analyze_transformer_block(batch=32, seq_len=1024, hidden=768, heads=12)

# Output:
# Transformer Block: B=32, S=1024, H=768
#    Operation      GFLOP       GB       AI      Bound
# ----------------------------------------------------
#     QKV Proj      115.96    0.058    2008.5    Compute
#    Attention      103.08    0.101    1024.0    Compute
#     Out Proj       38.65    0.058     668.3    Compute
#    LayerNorm        0.40    0.201       2.0     Memory
#          FFN      309.24    0.211    1468.1    Compute
#  Activations        1.51    0.402       3.8     Memory
# ----------------------------------------------------
#        TOTAL      568.84    1.030     552.3

Training vs Inference

Training has 3× the FLOPs (forward + backward + optimizer) but similar memory access per parameter. This naturally increases AI, making training more compute-bound than inference.

LLM Inference

LLM inference has two distinct phases with very different roofline characteristics:

LLM Inference: Prefill vs Decode

Decode phase dominates latency because it's severely memory-bound

Python llm_inference_analysis.py

def analyze_llm_inference(
    model_params_B: float,    # Billions of parameters
    prompt_tokens: int,
    gen_tokens: int,
    precision_bytes: int = 2,  # FP16
    peak_tflops: float = 312,  # A100
    bw_gb_s: float = 2039      # A100 HBM
):
    """Analyze LLM inference performance breakdown."""
    
    params = model_params_B * 1e9
    param_bytes = params * precision_bytes
    
    # ===== PREFILL PHASE =====
    # All tokens processed in parallel (like batch = prompt_tokens)
    prefill_flops = 2 * params * prompt_tokens  # 2 FLOPs per param per token
    prefill_bytes = param_bytes + prompt_tokens * 4096 * precision_bytes  # Weights + activations
    prefill_ai = prefill_flops / prefill_bytes
    
    ridge = peak_tflops * 1000 / bw_gb_s
    
    if prefill_ai > ridge:
        prefill_time = prefill_flops / (peak_tflops * 1e12)
    else:
        prefill_time = prefill_bytes / (bw_gb_s * 1e9)
    
    # ===== DECODE PHASE =====
    # One token at a time (batch = 1)
    decode_flops_per_token = 2 * params
    decode_bytes_per_token = param_bytes  # Must load ALL weights for each token!
    decode_ai = decode_flops_per_token / decode_bytes_per_token  # Always ~1!
    
    # Decode is always memory-bound
    decode_time_per_token = decode_bytes_per_token / (bw_gb_s * 1e9)
    total_decode_time = decode_time_per_token * gen_tokens
    
    tokens_per_sec = 1 / decode_time_per_token
    
    print(f"===== LLM Inference Analysis =====")
    print(f"Model: {model_params_B}B params ({param_bytes/1e9:.1f} GB)")
    print(f"Prompt: {prompt_tokens} tokens, Generate: {gen_tokens} tokens")
    print()
    print(f"PREFILL:")
    print(f"  AI: {prefill_ai:.0f} FLOP/B → {'Compute' if prefill_ai > ridge else 'Memory'}-bound")
    print(f"  Time: {prefill_time*1000:.1f} ms")
    print()
    print(f"DECODE:")
    print(f"  AI: {decode_ai:.1f} FLOP/B → Memory-bound (ALWAYS!)")
    print(f"  Per-token: {decode_time_per_token*1000:.1f} ms")
    print(f"  Total: {total_decode_time*1000:.0f} ms")
    print(f"  Throughput: {tokens_per_sec:.0f} tokens/sec")
    print()
    print(f"TOTAL: {(prefill_time + total_decode_time)*1000:.0f} ms")
    print(f"Decode dominates: {total_decode_time/(prefill_time+total_decode_time)*100:.0f}%")

# Llama-2 7B example
analyze_llm_inference(
    model_params_B=7,
    prompt_tokens=512,
    gen_tokens=256
)

# Output:
# ===== LLM Inference Analysis =====
# Model: 7B params (14.0 GB)
# Prompt: 512 tokens, Generate: 256 tokens
#
# PREFILL:
#   AI: 512 FLOP/B → Compute-bound
#   Time: 22.9 ms
#
# DECODE:
#   AI: 1.0 FLOP/B → Memory-bound (ALWAYS!)
#   Per-token: 6.9 ms
#   Total: 1759 ms
#   Throughput: 146 tokens/sec
#
# TOTAL: 1782 ms
# Decode dominates: 99%

LLM Inference Optimization

Decode is the bottleneck. Key optimizations target this phase: Quantization (fewer bytes to load), Speculative Decoding (more FLOPs per weight load), Continuous Batching (higher effective batch), KV Cache Compression (less memory), Model Parallelism (aggregate bandwidth).

18. Conclusion & Key Takeaways

Essential Roofline Insights

Every kernel has two possible bottlenecks: compute or memory bandwidth. The roofline tells you which one applies.
Arithmetic Intensity (AI) determines your bound: AI < ridge → memory-bound, AI > ridge → compute-bound.
Most DL operations are memory-bound: Only large GEMMs with sufficient batch size hit compute ceilings.
Batch size is the primary lever: Increasing batch moves operations from memory-bound to compute-bound.
Quantization helps both ways: Fewer bytes transferred AND higher compute throughput.
Kernel fusion increases effective AI: FlashAttention eliminates the memory-bound softmax bottleneck.
LLM decode is always memory-bound: ~1 FLOP/Byte regardless of model size. Speed = bandwidth.
Profile before optimizing: Know your position on the roofline before choosing an optimization strategy.

Roofline Decision Framework

Use this framework to guide every GPU optimization effort

The roofline model transforms GPU optimization from guesswork into a systematic process. By understanding your workload's position on the roofline, you can:

Prioritize correctly: Don't optimize compute for memory-bound kernels
Set realistic targets: Know the theoretical maximum before chasing it
Choose the right technique: Quantization, batching, fusion, or algorithmic change
Know when to stop: Recognize when you've hit the ceiling

Master the roofline, and you'll write faster GPU code in less time.