1. Introduction & History

The Roofline Model, introduced by Samuel Williams, Andrew Waterman, and David Patterson at UC Berkeley in 2009, provides a visual framework to understand the performance limits of a kernel on a given architecture.

The core insight: every computation is limited by either compute throughput or memory bandwidth - never both simultaneously. The roofline diagram shows you which limit applies and by how much.

Why Roofline Matters

Before roofline, developers would optimize blindly - adding more compute optimizations to a memory-bound kernel, or vice versa. The roofline model tells you which optimization will actually help before you write any code.

The Classic Roofline Model
Arithmetic Intensity (FLOP/Byte) Attainable Performance (GFLOP/s) 1/16 1/4 1 4 16 64 256 1 10 100 1K 10K Ridge Point Memory Bound Compute Bound Bandwidth Ceiling Compute Ceiling (Peak FLOPS) SAXPY GEMM Conv2D BW Ceiling Compute Ceiling Ridge Point
Log-log plot: X-axis is arithmetic intensity, Y-axis is performance. The "roof" defines maximum achievable performance.

Key terminology:

2. The Roofline Model Fundamentals

Arithmetic Intensity

Arithmetic Intensity (AI) is the ratio of floating-point operations to bytes moved from memory:

$$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}} \quad \left[\frac{\text{FLOP}}{\text{Byte}}\right]$$

This single metric determines whether your kernel is memory-bound or compute-bound:

Computing Arithmetic Intensity
Vector Addition: C = A + B FLOPs: N additions = N Bytes: 2N read + N write = 3N × 4 AI = N / (12N) = 1/12 = 0.083 Dot Product: s = Σ aᵢbᵢ FLOPs: N mul + N add = 2N Bytes: 2N read = 2N × 4 AI = 2N / (8N) = 1/4 = 0.25 Matrix Multiply: C = A × B FLOPs: 2N³ (N² elements × 2N ops) Bytes: 3N² × 4 (naive, no tiling) AI = 2N³ / (12N²) = N/6 Arithmetic Intensity Scale Vec Add 0.08 Dot Prod 0.25 GEMM (N=1K) ~167 A100 Ridge ~104
Low AI = memory-bound, High AI = compute-bound. Ridge point depends on hardware.

The Roofline Equation

The attainable performance is bounded by:

$$P_{\text{attainable}} = \min\left( P_{\text{peak}}, \quad \text{BW}_{\text{peak}} \times \text{AI} \right)$$

Where:

On a log-log plot, this creates two lines:

  1. Bandwidth ceiling: $\log(P) = \log(\text{BW}) + \log(\text{AI})$ → slope of 1
  2. Compute ceiling: $\log(P) = \log(P_{\text{peak}})$ → horizontal line

Compute-Bound vs Memory-Bound

The ridge point is where the two ceilings meet:

$$\text{AI}_{\text{ridge}} = \frac{P_{\text{peak}}}{\text{BW}_{\text{peak}}}$$
Example: NVIDIA A100

Peak FP32: 19.5 TFLOP/s = 19,500 GFLOP/s
HBM2e Bandwidth: 2,039 GB/s
Ridge Point: 19,500 / 2,039 = 9.6 FLOP/Byte

Kernels with AI < 9.6 are memory-bound, AI > 9.6 are compute-bound.

3. Computing Peak Performance

FLOPS Calculation

Peak theoretical FLOPS depends on:

$$P_{\text{peak}} = \text{Cores} \times \text{Clock} \times \text{FLOPs/cycle/core}$$

For GPUs, the calculation differs by unit type:

Python peak_flops.py
def calculate_peak_flops(gpu: str) -> dict:
    """Calculate theoretical peak FLOPS for common GPUs."""
    
    specs = {
        "A100_80GB": {
            "sm_count": 108,
            "fp32_cores_per_sm": 64,
            "fp64_cores_per_sm": 32,
            "tensor_cores_per_sm": 4,
            "clock_ghz": 1.41,  # Boost clock
        },
        "H100_SXM": {
            "sm_count": 132,
            "fp32_cores_per_sm": 128,
            "fp64_cores_per_sm": 64,
            "tensor_cores_per_sm": 4,
            "clock_ghz": 1.83,
        },
    }
    
    s = specs[gpu]
    
    # CUDA Cores: 2 FLOP/cycle (FMA = multiply + add)
    fp32_peak = s["sm_count"] * s["fp32_cores_per_sm"] * s["clock_ghz"] * 2 * 1000
    fp64_peak = s["sm_count"] * s["fp64_cores_per_sm"] * s["clock_ghz"] * 2 * 1000
    
    return {
        "FP32 (TFLOPS)": fp32_peak / 1000,
        "FP64 (TFLOPS)": fp64_peak / 1000,
    }

# A100 Example:
# FP32: 108 SM × 64 cores × 1.41 GHz × 2 FLOP/cycle = 19.5 TFLOPS
# FP64: 108 SM × 32 cores × 1.41 GHz × 2 FLOP/cycle = 9.7 TFLOPS

Precision Effects (FP64 → INT8)

Different precisions have vastly different peak throughputs:

Precision A100 (CUDA) A100 (Tensor) H100 (Tensor) Ratio vs FP32
FP64 9.7 TFLOPS 19.5 TFLOPS 67 TFLOPS 0.5×
FP32 19.5 TFLOPS 156 TFLOPS* 989 TFLOPS*
TF32 156 TFLOPS 989 TFLOPS
FP16/BF16 39 TFLOPS 312 TFLOPS 1,979 TFLOPS 16×
INT8 624 TOPS 3,958 TOPS 32×

*TF32 uses FP32 inputs but computes with reduced mantissa precision on Tensor Cores

Tensor Cores Peak

Tensor Cores perform matrix multiply-accumulate on small tiles:

$$D = A \times B + C \quad \text{where } A, B, C, D \text{ are small matrices}$$
Python tensor_core_peak.py
def tensor_core_peak(gpu: str, precision: str) -> float:
    """
    Tensor Core peak calculation.
    
    A100 Tensor Core: 256 FP16 FMA ops/cycle per TC
    4 TCs per SM × 108 SMs × 1.41 GHz × 256 × 2 = 312 TFLOPS
    """
    
    tc_specs = {
        "A100": {
            "tcs_per_sm": 4,
            "sm_count": 108,
            "clock_ghz": 1.41,
            "fp16_ops_per_tc_per_cycle": 256,  # 16×16×16 MMA
        },
        "H100": {
            "tcs_per_sm": 4,
            "sm_count": 132,
            "clock_ghz": 1.83,
            "fp16_ops_per_tc_per_cycle": 512,  # Larger MMA
        },
    }
    
    s = tc_specs[gpu]
    
    # Base FP16 calculation
    fp16_tflops = (s["tcs_per_sm"] * s["sm_count"] * s["clock_ghz"] 
                   * s["fp16_ops_per_tc_per_cycle"] * 2 / 1000)
    
    # Precision multipliers
    multipliers = {
        "FP16": 1.0,
        "BF16": 1.0,
        "TF32": 0.5,
        "FP64": 0.0625,  # 1/16
        "INT8": 2.0,
    }
    
    return fp16_tflops * multipliers[precision]
A100 Peak Performance by Precision (Log Scale)
FP64 9.7T FP32 19.5T TF32 (TC) 156T FP16 (TC) 312T INT8 (TC) 624T Tensor Cores 32× throughput vs FP64 16× throughput vs FP32
Tensor Cores (TC) provide massive speedups for supported precisions

4. Computing Peak Bandwidth

Memory Hierarchy

Modern GPUs have multiple memory levels, each with different bandwidth:

GPU Memory Hierarchy (A100)
Registers 256 KB/SM × 108 SM ~20 TB/s 1 cycle latency L1 Cache / Shared Memory 192 KB/SM (configurable) ~19 TB/s ~28 cycle latency L2 Cache 40 MB (shared across all SMs) ~5 TB/s ~200 cycle latency HBM2e (Global Memory) 80 GB (5 stacks × 8 channels) 2,039 GB/s ~400 cycle latency 27 MB total 20 MB total 40 MB 80 GB
Higher = faster but smaller. Most data lives in HBM.

Theoretical Bandwidth

HBM bandwidth is calculated from memory specifications:

$$\text{BW}_{\text{theoretical}} = \text{Memory Clock} \times \text{Bus Width} \times \text{Data Rate}$$
Python bandwidth_calc.py
def calculate_hbm_bandwidth(gpu: str) -> float:
    """Calculate theoretical HBM bandwidth."""
    
    hbm_specs = {
        "A100_80GB": {
            # HBM2e specifications
            "memory_clock_mhz": 1593,       # Effective clock
            "bus_width_bits": 5120,         # 5 stacks × 1024 bits
            "data_rate": 2,                 # DDR = 2 transfers/cycle
        },
        "H100_SXM": {
            # HBM3 specifications
            "memory_clock_mhz": 2619,
            "bus_width_bits": 5120,
            "data_rate": 2,
        },
    }
    
    s = hbm_specs[gpu]
    
    # BW (GB/s) = Clock (MHz) × Width (bits) × DataRate / 8 / 1000
    bw_gbps = (s["memory_clock_mhz"] * s["bus_width_bits"] 
               * s["data_rate"] / 8 / 1000)
    
    return bw_gbps

# A100 80GB:
# 1593 MHz × 5120 bits × 2 / 8 / 1000 = 2,039 GB/s

# H100 SXM:
# 2619 MHz × 5120 bits × 2 / 8 / 1000 = 3,352 GB/s

A100 80GB

2,039
GB/s HBM2e Bandwidth

H100 SXM

3,352
GB/s HBM3 Bandwidth

L2 Cache

~5,000
GB/s (A100)

L1/Shared

~19,000
GB/s (A100)

Multi-Ceiling Rooflines

Since each memory level has different bandwidth, we can draw multiple rooflines:

Multi-Ceiling Roofline (A100)
Arithmetic Intensity (FLOP/Byte) Performance (GFLOP/s) 1/16 1/4 1 4 16 64 256 10 100 1K 10K 100K HBM: 2,039 GB/s L2: ~5 TB/s L1: ~19 TB/s FP32 CUDA: 19.5 TFLOPS TF32 Tensor Core: 156 TFLOPS FP16 Tensor Core: 312 TFLOPS HBM Ridge Memory Ceilings HBM L2 Cache L1/Shared Compute
Multiple memory bandwidth ceilings and compute ceilings show different performance bounds
Key Insight

A kernel that's memory-bound at the HBM level might become compute-bound if data fits in L2 cache. This is why tiling is so important - it moves your working set to faster memory levels.

5. Building Your Roofline

To create a roofline for your GPU, you need:

  1. Peak Compute: From specs or microbenchmark
  2. Peak Bandwidth: From specs or STREAM benchmark
  3. Ridge Point: Compute / Bandwidth
Python build_roofline.py
import numpy as np
import matplotlib.pyplot as plt

def build_roofline(
    peak_gflops: float,
    peak_bw_gbs: float,
    ai_range: tuple = (0.01, 1000),
):
    """
    Build and plot a roofline model.
    
    Args:
        peak_gflops: Peak compute performance in GFLOP/s
        peak_bw_gbs: Peak memory bandwidth in GB/s
        ai_range: (min, max) arithmetic intensity range
    """
    # Ridge point
    ridge_ai = peak_gflops / peak_bw_gbs
    print(f"Ridge Point: {ridge_ai:.2f} FLOP/Byte")
    
    # Generate AI values (log scale)
    ai = np.logspace(np.log10(ai_range[0]), 
                      np.log10(ai_range[1]), 1000)
    
    # Roofline equation: P = min(peak, BW × AI)
    performance = np.minimum(peak_gflops, peak_bw_gbs * ai)
    
    # Plot
    plt.figure(figsize=(10, 7))
    plt.loglog(ai, performance, 'b-', linewidth=2.5, label='Roofline')
    
    # Mark ridge point
    plt.axvline(x=ridge_ai, color='purple', linestyle='--', alpha=0.7)
    plt.scatter([ridge_ai], [peak_gflops], color='purple', s=100, 
                zorder=5, label=f'Ridge ({ridge_ai:.1f} FLOP/B)')
    
    # Labels
    plt.xlabel('Arithmetic Intensity (FLOP/Byte)', fontsize=12)
    plt.ylabel('Performance (GFLOP/s)', fontsize=12)
    plt.title('Roofline Model', fontsize=14)
    
    # Add region annotations
    plt.text(ridge_ai / 10, peak_gflops / 5, 'Memory\nBound', 
            fontsize=12, color='blue', ha='center')
    plt.text(ridge_ai * 10, peak_gflops / 2, 'Compute\nBound', 
            fontsize=12, color='red', ha='center')
    
    plt.grid(True, which='both', alpha=0.3)
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    return ridge_ai


# Example: A100 FP32
ridge = build_roofline(
    peak_gflops=19500,     # 19.5 TFLOPS
    peak_bw_gbs=2039,      # 2039 GB/s
)
# Output: Ridge Point: 9.56 FLOP/Byte

6. Worked Examples

Example 1: Vector Scaling (SAXPY)

SAXPY: $y = \alpha x + y$

Python saxpy_analysis.py
def analyze_saxpy(n: int, dtype_bytes: int = 4):
    """
    SAXPY: y = α * x + y
    
    For each element:
    - Read: x[i], y[i] → 2 reads
    - Compute: multiply, add → 2 FLOPs  
    - Write: y[i] → 1 write
    """
    
    # FLOPs: 2 per element (1 mul + 1 add)
    flops = 2 * n
    
    # Bytes: read x, read y, write y = 3 × n × sizeof(dtype)
    bytes_transferred = 3 * n * dtype_bytes
    
    # Arithmetic Intensity
    ai = flops / bytes_transferred
    
    print(f"SAXPY Analysis (n={n:,}, FP32)")
    print(f"  FLOPs: {flops:,}")
    print(f"  Bytes: {bytes_transferred:,}")
    print(f"  AI: {ai:.4f} FLOP/Byte")
    
    # A100 Analysis
    a100_peak = 19500  # GFLOP/s
    a100_bw = 2039     # GB/s
    a100_ridge = a100_peak / a100_bw
    
    print(f"\n  A100 Ridge: {a100_ridge:.2f} FLOP/Byte")
    print(f"  Status: {'Memory-bound' if ai < a100_ridge else 'Compute-bound'}")
    
    # Expected performance
    expected_gflops = min(a100_peak, a100_bw * ai)
    print(f"  Expected perf: {expected_gflops:.1f} GFLOP/s")
    print(f"  Efficiency: {100 * expected_gflops / a100_peak:.1f}% of peak compute")
    
    return ai

analyze_saxpy(100_000_000)
# Output:
# SAXPY Analysis (n=100,000,000, FP32)
#   FLOPs: 200,000,000
#   Bytes: 1,200,000,000
#   AI: 0.1667 FLOP/Byte
#
#   A100 Ridge: 9.56 FLOP/Byte
#   Status: Memory-bound
#   Expected perf: 339.8 GFLOP/s
#   Efficiency: 1.7% of peak compute
SAXPY Reality Check

SAXPY can only achieve 1.7% of the A100's compute capacity! It's so memory-bound that optimizing compute is pointless - you need to reduce memory traffic (e.g., kernel fusion).

Example 2: Matrix Multiplication (GEMM)

GEMM: $C = A \times B$ where $A$ is $M \times K$ and $B$ is $K \times N$

Python gemm_analysis.py
def analyze_gemm(M: int, N: int, K: int, dtype_bytes: int = 4):
    """
    GEMM: C = A × B
    
    A: M×K, B: K×N, C: M×N
    
    FLOPs: M × N × (2K - 1) ≈ 2MNK for large K
           (K multiplies + K-1 adds per output element)
    
    Bytes (naive, no tiling):
           Read A, Read B, Write C = (MK + KN + MN) × dtype_bytes
    """
    
    # FLOPs
    flops = 2 * M * N * K
    
    # Bytes (assuming matrices read once from HBM)
    bytes_transferred = (M*K + K*N + M*N) * dtype_bytes
    
    # Arithmetic Intensity
    ai = flops / bytes_transferred
    
    print(f"GEMM Analysis ({M}×{K} @ {K}×{N}, FP32)")
    print(f"  FLOPs: {flops:,}")
    print(f"  Bytes: {bytes_transferred:,}")
    print(f"  AI: {ai:.2f} FLOP/Byte")
    
    # A100 Analysis
    a100_ridge = 9.56
    status = 'Compute-bound' if ai > a100_ridge else 'Memory-bound'
    print(f"  Status: {status}")
    
    return ai

# Small matrices (memory-bound)
analyze_gemm(128, 128, 128)
# AI: 42.67 FLOP/Byte → Compute-bound

# Larger = higher AI (more reuse)
analyze_gemm(4096, 4096, 4096)
# AI: 1365.33 FLOP/Byte → Very compute-bound

# For square matrices: AI ≈ N/3
# N=128: AI ≈ 42
# N=4096: AI ≈ 1365

Example 3: Reduction (Sum)

Python reduction_analysis.py
def analyze_reduction(n: int, dtype_bytes: int = 4):
    """
    Reduction: s = sum(x)
    
    FLOPs: n-1 additions
    Bytes: n reads + 1 write (negligible)
    """
    
    flops = n - 1
    bytes_transferred = n * dtype_bytes  # Ignore single output write
    ai = flops / bytes_transferred
    
    print(f"Reduction (n={n:,})")
    print(f"  AI: {ai:.4f} FLOP/Byte")
    print(f"  Status: Extremely memory-bound")
    
    return ai

analyze_reduction(100_000_000)
# AI: 0.25 FLOP/Byte - Severely memory-bound!

Summary: Common Operations

Operation FLOPs Bytes AI (FP32) Status on A100
Vector Add N 12N 0.083 🔵 Memory-bound
SAXPY 2N 12N 0.167 🔵 Memory-bound
Dot Product 2N 8N 0.25 🔵 Memory-bound
Sum Reduction N 4N 0.25 🔵 Memory-bound
Softmax 5N 8N 0.625 🔵 Memory-bound
LayerNorm 8N 8N 1.0 🔵 Memory-bound
GEMM (N=128) 2N³ 12N² ~43 🔴 Compute-bound
GEMM (N=4096) 2N³ 12N² ~1365 🔴 Compute-bound

7. Achievable vs Theoretical Performance

Theoretical peak performance is a ceiling you'll never reach. Real-world achievable performance depends on many factors: memory access patterns, instruction mix, occupancy, and more.

Critical Distinction

Theoretical Peak: What the hardware could do if perfectly utilized (from specs).
Achievable Peak: What you can actually measure with optimized benchmarks.
The gap between them is typically 10-30% for bandwidth, 5-15% for compute.

STREAM Benchmark

The STREAM benchmark measures achievable memory bandwidth using simple vector operations:

Kernel Operation Bytes/Element FLOPs/Element
Copy a[i] = b[i] 16 (2 × 8 for FP64) 0
Scale a[i] = q × b[i] 16 1
Add a[i] = b[i] + c[i] 24 1
Triad a[i] = b[i] + q × c[i] 24 2
Python stream_benchmark.py
import torch
import time

def stream_triad(n: int, dtype=torch.float64, device="cuda", iterations=100):
    """STREAM Triad: a = b + scalar * c"""
    
    # Allocate arrays
    a = torch.zeros(n, dtype=dtype, device=device)
    b = torch.ones(n, dtype=dtype, device=device)
    c = torch.ones(n, dtype=dtype, device=device)
    scalar = 3.0
    
    # Warmup
    for _ in range(10):
        a = b + scalar * c
    
    torch.cuda.synchronize()
    
    # Timed iterations
    start = time.perf_counter()
    for _ in range(iterations):
        a = b + scalar * c
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    
    # Calculate bandwidth
    bytes_per_elem = a.element_size()
    total_bytes = 3 * n * bytes_per_elem * iterations  # 2 read + 1 write
    bandwidth_gbs = total_bytes / elapsed / 1e9
    
    print(f"STREAM Triad (n={n:,}, {dtype})")
    print(f"  Measured bandwidth: {bandwidth_gbs:.1f} GB/s")
    
    return bandwidth_gbs

# Run on A100
bw = stream_triad(n=100_000_000)  # 100M elements
# Typical output: ~1800-1900 GB/s (vs 2039 theoretical)
# Efficiency: ~88-93%
STREAM Benchmark Results: Theoretical vs Achievable
Memory Bandwidth (GB/s) NVIDIA A100 80GB Theoretical: 2,039 GB/s STREAM Triad: ~1,880 GB/s 92% NVIDIA H100 SXM Theoretical: 3,352 GB/s STREAM Triad: ~3,040 GB/s 91% Why the Gap? • ECC overhead (~6-7%) • Memory controller inefficiency • Address translation overhead • Refresh cycles • Bank conflicts Use STREAM results for realistic rooflines!
STREAM Triad measures achievable memory bandwidth - use this for roofline analysis

GEMM Benchmark

For compute-bound operations, the GEMM benchmark measures achievable compute throughput:

Python gemm_benchmark.py
import torch
import time

def gemm_benchmark(
    M: int, N: int, K: int,
    dtype=torch.float16,
    device="cuda",
    iterations=100
):
    """Benchmark matrix multiplication throughput."""
    
    # Allocate matrices
    A = torch.randn(M, K, dtype=dtype, device=device)
    B = torch.randn(K, N, dtype=dtype, device=device)
    
    # Warmup (important for Tensor Cores!)
    for _ in range(20):
        C = torch.mm(A, B)
    
    torch.cuda.synchronize()
    
    # Timed iterations
    start = time.perf_counter()
    for _ in range(iterations):
        C = torch.mm(A, B)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    
    # Calculate FLOPS
    flops_per_gemm = 2 * M * N * K
    total_flops = flops_per_gemm * iterations
    tflops = total_flops / elapsed / 1e12
    
    print(f"GEMM Benchmark ({M}×{K} @ {K}×{N}, {dtype})")
    print(f"  Measured: {tflops:.1f} TFLOPS")
    
    return tflops

# Large square matrices for peak throughput
tflops = gemm_benchmark(8192, 8192, 8192, dtype=torch.float16)
# A100 FP16 TC: ~280 TFLOPS (vs 312 theoretical) = 90%
# H100 FP16 TC: ~1700 TFLOPS (vs 1979 theoretical) = 86%

# Shape matters for efficiency
gemm_benchmark(1024, 1024, 1024, dtype=torch.float16)  # ~200 TFLOPS
gemm_benchmark(256, 256, 256, dtype=torch.float16)    # ~50 TFLOPS (poor!)
GEMM Efficiency vs Matrix Size (A100 FP16 Tensor Cores)
Matrix Size (N × N × N) TFLOPS 0 100 200 300 400 Peak: 312T 128 8T 256 42T 512 117T 1K 217T 2K 261T 4K 277T 8K 285T 16K 292T Peak Efficiency N ≥ 4096: ~90% N = 1024: ~70% N = 256: ~13%
GEMM efficiency improves dramatically with matrix size - small matrices waste Tensor Core capacity

The Efficiency Gap

Multiple factors contribute to the gap between theoretical and achievable performance:

Sources of Performance Loss
Memory-Bound Operations Theoretical Peak: 2,039 GB/s (100%) After ECC: ~1,905 GB/s (93%) -7% TLB/Paging: ~1,850 GB/s (91%) -2% Bank Conflicts: ~1,800 GB/s (88%) -3% Achievable: ~1,750 GB/s (86%) Compute-Bound Operations Theoretical Peak: 312 TFLOPS (100%) Instruction Overhead: ~294T (94%) -6% Memory Latency: ~286T (92%) -2% Occupancy Limits: ~280T (90%) -2% Achievable: ~275T (88%) Key Insight Use measured (achievable) values for roofline analysis! Theoretical peaks overestimate what's possible by 10-15%
Performance loss cascade - each factor compounds to reduce achievable throughput
Python achievable_roofline.py
def build_realistic_roofline(gpu: str):
    """Build roofline with achievable (not theoretical) peaks."""
    
    # Achievable peaks from benchmarks
    achievable = {
        "A100_80GB": {
            # Memory bandwidth (STREAM Triad)
            "hbm_bw_gbs": 1880,    # vs 2039 theoretical
            "l2_bw_gbs": 4500,     # vs ~5000 theoretical
            "l1_bw_gbs": 17000,    # vs ~19000 theoretical
            
            # Compute throughput (GEMM benchmark)
            "fp32_tflops": 18.5,    # vs 19.5 theoretical
            "tf32_tflops": 140,     # vs 156 theoretical
            "fp16_tflops": 285,     # vs 312 theoretical
        },
        "H100_SXM": {
            "hbm_bw_gbs": 3040,     # vs 3352 theoretical
            "l2_bw_gbs": 10000,
            "l1_bw_gbs": 30000,
            "fp32_tflops": 55,
            "tf32_tflops": 850,
            "fp16_tflops": 1700,
        },
    }
    
    a = achievable[gpu]
    
    # Ridge points with achievable values
    ridge_fp32 = a["fp32_tflops"] * 1000 / a["hbm_bw_gbs"]
    ridge_fp16 = a["fp16_tflops"] * 1000 / a["hbm_bw_gbs"]
    
    print(f"{gpu} Achievable Roofline:")
    print(f"  FP32 Ridge: {ridge_fp32:.1f} FLOP/B")
    print(f"  FP16 Ridge: {ridge_fp16:.1f} FLOP/B")
    
    return a

build_realistic_roofline("A100_80GB")
# A100 FP32 Ridge: 9.8 FLOP/B (vs 9.6 theoretical)
# A100 FP16 Ridge: 151.6 FLOP/B (vs 153 theoretical)

8. Cache-Aware Roofline Analysis

The basic roofline assumes all data comes from main memory (HBM). In reality, caches dramatically change the picture.

Hierarchical Roofline Model

The Hierarchical Roofline Model (HRM) accounts for different memory levels:

$$P_{\text{attainable}} = \min\left( P_{\text{peak}}, \sum_{i} \text{BW}_i \times \text{AI}_i \right)$$

Where $i$ iterates over memory levels (L1, L2, HBM) and $\text{AI}_i$ is the arithmetic intensity relative to that level.

Hierarchical Roofline: Same Kernel, Different Cache Behavior
Operational Intensity (FLOP/Byte from each level) Performance (GFLOP/s) 0.1 1 10 100 1K 100 1K 10K 100K L1: 19 TB/s L2: 5 TB/s HBM: 1.88 TB/s FP16 TC: 285 TFLOPS Cold Cache L2 Hit L1 Hit Same Kernel Different data locality No reuse L2 reuse L1 reuse
The same kernel can land on different rooflines depending on cache behavior

Data Locality Effects

Data locality determines which bandwidth ceiling limits your kernel:

Python locality_analysis.py
def analyze_locality(
    working_set_bytes: int,
    l1_size: int = 192 * 1024,      # 192 KB per SM
    l2_size: int = 40 * 1024 * 1024,  # 40 MB total
):
    """Determine which memory level bounds a kernel."""
    
    if working_set_bytes <= l1_size:
        return "L1", 19000  # GB/s
    elif working_set_bytes <= l2_size:
        return "L2", 5000
    else:
        return "HBM", 1880

# Example: Matrix multiply tile sizes
for tile_size in [32, 64, 128, 256, 512]:
    # 3 tiles for A, B, C
    working_set = 3 * tile_size * tile_size * 4  # FP32
    level, bw = analyze_locality(working_set)
    print(f"Tile {tile_size}×{tile_size}: {working_set//1024} KB → {level} ({bw} GB/s)")

# Output:
# Tile 32×32: 12 KB → L1 (19000 GB/s)
# Tile 64×64: 48 KB → L1 (19000 GB/s)
# Tile 128×128: 192 KB → L1 (19000 GB/s)  ← Fits exactly!
# Tile 256×256: 768 KB → L2 (5000 GB/s)
# Tile 512×512: 3072 KB → L2 (5000 GB/s)

Impact of Tiling

Tiling (blocking) is the primary technique to improve data locality and move kernels to faster rooflines:

Effect of Tiling on GEMM Performance
Naive GEMM (No Tiling) Each output: read full row A, full column B O(N³) memory traffic for O(N³) FLOPs For 4096×4096 FP32: • Working set: 192 MB (3 matrices) • Data from: HBM (1880 GB/s) • AI: ~1365 FLOP/Byte • Status: Compute-bound ✓ • BUT: Poor cache utilization • Actual perf: ~60% of peak Tiled GEMM (128×128 tiles) Load tile to shared memory, reuse N times O(N²/tile) memory traffic for O(N³) FLOPs For 4096×4096 FP32: • Working set: 192 KB (3 tiles) • Data from: L1/Shared (19 TB/s) • Effective AI: ~43,600 FLOP/Byte • Status: Heavily compute-bound ✓ • Maximum cache utilization • Actual perf: ~92% of peak
Tiling moves data access from HBM to L1, dramatically improving performance
Tiling Rule of Thumb

Tile size should maximize L1/shared memory usage without spilling. For A100: aim for ~192KB working set per SM. This typically means 64×64 to 128×128 tiles for FP32, larger for FP16.

9. Operational vs Arithmetic Intensity

There's an important distinction between two related metrics:

Metric Definition What It Measures Use Case
Arithmetic Intensity FLOPs / Bytes from algorithm spec Inherent algorithm characteristic Algorithm selection
Operational Intensity FLOPs / Bytes actually transferred Real cache behavior Implementation optimization
Python intensity_comparison.py
def compare_intensities(M: int, N: int, K: int, tile_size: int):
    """Compare arithmetic vs operational intensity for GEMM."""
    
    flops = 2 * M * N * K
    
    #----------------------------------------
    # Arithmetic Intensity (from algorithm)
    #----------------------------------------
    bytes_algo = (M*K + K*N + M*N) * 4  # FP32
    ai = flops / bytes_algo
    
    #----------------------------------------
    # Operational Intensity (with tiling)
    #----------------------------------------
    # With tiling, each HBM byte is reused tile_size times
    num_tiles = (M // tile_size) * (N // tile_size) * (K // tile_size)
    
    # HBM traffic: load each tile once
    hbm_bytes = bytes_algo  # Still need to load all data
    oi_hbm = flops / hbm_bytes  # Same as AI
    
    # L1 traffic: each tile loaded from L2 once per K-tile
    l1_loads = (M // tile_size) * (N // tile_size) * (K // tile_size)
    l1_bytes = l1_loads * 3 * tile_size * tile_size * 4
    oi_l1 = flops / l1_bytes
    
    print(f"GEMM {M}×{K} @ {K}×{N}, tile={tile_size}")
    print(f"  Arithmetic Intensity: {ai:.1f} FLOP/B")
    print(f"  Operational Intensity (HBM): {oi_hbm:.1f} FLOP/B")
    print(f"  Operational Intensity (L1): {oi_l1:.1f} FLOP/B")
    print(f"  Effective amplification: {oi_l1/ai:.1f}×")

compare_intensities(4096, 4096, 4096, tile_size=128)
# Output:
# GEMM 4096×4096 @ 4096×4096, tile=128
#   Arithmetic Intensity: 1365.3 FLOP/B
#   Operational Intensity (HBM): 1365.3 FLOP/B
#   Operational Intensity (L1): 1365.3 FLOP/B
#   Effective amplification: 1.0×

# For smaller matrices or poor tiling, OI << AI
When They Differ

AI = OI: Perfect implementation, all cache benefits realized.
AI > OI: Poor implementation, cache misses, redundant loads.
AI < OI: Impossible (can't do fewer operations than algorithm requires).

10. Profiling Tools for Roofline

Modern profiling tools can automatically generate roofline plots:

NVIDIA Nsight Compute

Nsight Compute provides built-in roofline analysis for GPU kernels:

Bash nsight_roofline.sh
# Profile with roofline metrics
ncu --set roofline -o my_kernel_roofline ./my_application

# Open in GUI
ncu-ui my_kernel_roofline.ncu-rep

# Key sections in report:
# - "Roofline Analysis" - visual plot
# - "Memory Workload Analysis" - bandwidth utilization
# - "Compute Workload Analysis" - FLOP utilization

Intel Advisor

For CPU analysis, Intel Advisor provides comprehensive roofline support:

Bash advisor_roofline.sh
# Collect roofline data
advisor --collect=roofline --project-dir=./advisor_results -- ./my_application

# Generate report
advisor --report=roofline --project-dir=./advisor_results

# Features:
# - Automatic AI calculation per loop
# - Cache-aware roofline (L1/L2/L3/DRAM)
# - Vectorization efficiency overlay

PyTorch Profiler

Python pytorch_profiling.py
import torch
from torch.profiler import profile, ProfilerActivity

def profile_for_roofline(model, input_data, device="cuda"):
    """Profile model for roofline analysis."""
    
    model = model.to(device)
    input_data = input_data.to(device)
    
    # Warmup
    for _ in range(10):
        _ = model(input_data)
    
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        with_flops=True,
        profile_memory=True,
        record_shapes=True,
    ) as prof:
        _ = model(input_data)
    
    # Print breakdown
    print(prof.key_averages().table(
        sort_by="cuda_time_total",
        row_limit=20
    ))
    
    # Export for visualization
    prof.export_chrome_trace("trace.json")
    
    # Calculate AI for each kernel
    for event in prof.key_averages():
        if event.flops > 0 and event.cuda_memory_usage > 0:
            ai = event.flops / event.cuda_memory_usage
            print(f"{event.key}: AI = {ai:.2f} FLOP/B")

# Example usage
model = torch.nn.Linear(4096, 4096)
input_data = torch.randn(32, 4096)
profile_for_roofline(model, input_data)
Roofline Profiling Tool Comparison
Nsight Compute Intel Advisor PyTorch Profiler NVIDIA GPUs ✓ Automatic roofline plot ✓ Per-kernel analysis ✓ Cache-aware metrics ✓ Tensor Core tracking ✓ Memory level breakdown ✗ Overhead for full profile Intel CPUs ✓ Loop-level roofline ✓ Vectorization analysis ✓ Cache simulation ✓ Optimization hints ✓ Source correlation ✗ Intel CPUs only Cross-Platform ✓ Easy integration ✓ FLOPS per op ✓ Memory tracking ✓ Timeline export ✗ No built-in roofline ✗ Manual AI calculation ✗ Op-level only
Choose the right tool based on your hardware and analysis needs

11. BLAS Operations on the Roofline

The Basic Linear Algebra Subprograms (BLAS) are organized into three levels based on their arithmetic intensity:

Level 1: Vector Operations

Level 1 BLAS performs vector-vector operations: $O(n)$ data, $O(n)$ FLOPs → AI ≈ constant

Operation Definition FLOPs Bytes (FP32) AI
SCOPY y ← x 0 8n 0
SSCAL x ← αx n 8n 0.125
SAXPY y ← αx + y 2n 12n 0.167
SDOT s ← x·y 2n 8n 0.25
SNRM2 s ← ||x||₂ 2n 4n 0.5
SASUM s ← Σ|xᵢ| n 4n 0.25
Level 1 Reality

All Level 1 BLAS operations are severely memory-bound. On an A100 (ridge point ~10 FLOP/B), they achieve only 1-5% of peak compute. Optimization focus: minimize memory traffic, maximize bandwidth utilization.

Level 2: Matrix-Vector Operations

Level 2 BLAS performs matrix-vector operations: $O(n^2)$ data, $O(n^2)$ FLOPs → AI ≈ constant

Operation Definition FLOPs Bytes (FP32) AI
SGEMV y ← αAx + βy 2mn 4(mn + m + n) ≈ 0.5
SSYMV y ← αAx + βy (A symmetric) 2n² 4(n²/2 + 2n) ≈ 1.0
STRMV x ← Ax (A triangular) 4(n²/2 + n) ≈ 0.5
SGER A ← αxy' + A 2mn 4(mn + m + n) ≈ 0.5
Python level2_analysis.py
def analyze_gemv(m: int, n: int, dtype_bytes: int = 4):
    """GEMV: y = A @ x where A is m×n."""
    
    flops = 2 * m * n  # Each output: n muls + n adds
    
    # Bytes: read A (m×n), read x (n), write y (m)
    bytes_transferred = (m * n + n + m) * dtype_bytes
    
    ai = flops / bytes_transferred
    
    print(f"GEMV ({m}×{n})")
    print(f"  AI: {ai:.3f} FLOP/Byte")
    print(f"  Status: Memory-bound (ridge=10)")
    
    # For large matrices, AI → 2/(4+4/n+4/m) → 0.5
    return ai

# GEMV is always memory-bound regardless of size!
analyze_gemv(1024, 1024)    # AI: 0.499
analyze_gemv(4096, 4096)    # AI: 0.500
analyze_gemv(16384, 16384)  # AI: 0.500

Level 3: Matrix-Matrix Operations

Level 3 BLAS performs matrix-matrix operations: $O(n^2)$ data, $O(n^3)$ FLOPs → AI grows with n

Operation Definition FLOPs Bytes (FP32) AI
SGEMM C ← αAB + βC 2mnk 4(mk + kn + mn) ≈ n/3 (square)
SSYMM C ← αAB + βC (A symmetric) 2mn² 4(n²/2 + mn + mn) ≈ n/3
SSYRK C ← αAA' + βC n²k 4(nk + n²/2) ≈ n/2
STRMM B ← αAB (A triangular) mn² 4(n²/2 + mn) ≈ n/3
BLAS Levels on the Roofline (A100)
Arithmetic Intensity (FLOP/Byte) Performance (GFLOP/s) 0.1 1 10 100 1000 10 100 1K 10K 100K Ridge (~10) Level 1 AI: 0.1-0.5 Level 2 AI: 0.5-2 N=128 N=512 N=2048 N=8192 Level 3 (GEMM) AI: N/3 (grows with size!) Key Insight Level 1 & 2: Always memory-bound Level 3: Compute- bound for N > 30
Only Level 3 BLAS can effectively utilize modern GPU compute capacity

12. Deep Learning Operations

Deep learning workloads combine many operations with different characteristics. Understanding each operation's roofline position is critical for optimization.

Linear Layers & GEMM

Fully-connected layers are GEMMs: $Y = XW + b$

Python linear_analysis.py
def analyze_linear(
    batch: int, in_features: int, out_features: int,
    dtype_bytes: int = 2  # FP16
):
    """Linear layer: Y = X @ W + b"""
    
    # GEMM: (batch × in) @ (in × out) = (batch × out)
    flops = 2 * batch * in_features * out_features
    
    # Bytes: read X, read W, write Y (ignore bias - small)
    bytes_x = batch * in_features * dtype_bytes
    bytes_w = in_features * out_features * dtype_bytes
    bytes_y = batch * out_features * dtype_bytes
    bytes_total = bytes_x + bytes_w + bytes_y
    
    ai = flops / bytes_total
    
    # Classification
    a100_ridge = 152  # FP16 Tensor Core
    status = "Compute-bound" if ai > a100_ridge else "Memory-bound"
    
    print(f"Linear({in_features}→{out_features}), batch={batch}")
    print(f"  AI: {ai:.1f} FLOP/B ({status})")
    
    return ai

# Typical LLM dimensions
analyze_linear(1, 4096, 4096)      # AI: 1.0 - Memory-bound!
analyze_linear(32, 4096, 4096)     # AI: 30.1 - Memory-bound
analyze_linear(256, 4096, 4096)    # AI: 170.7 - Compute-bound!
analyze_linear(1024, 4096, 4096)   # AI: 341.3 - Compute-bound

# Key insight: batch size determines bound type!

Convolutions

Convolutions can be expressed as im2col + GEMM or direct computation:

Python conv_analysis.py
def analyze_conv2d(
    batch: int,
    in_channels: int, out_channels: int,
    h: int, w: int,
    kernel: int = 3,
    dtype_bytes: int = 2
):
    """Conv2D arithmetic intensity analysis."""
    
    out_h, out_w = h, w  # Assume same padding
    
    # FLOPs: each output pixel needs kernel² × in_ch MACs
    flops = 2 * batch * out_channels * out_h * out_w * in_channels * kernel * kernel
    
    # Bytes
    bytes_input = batch * in_channels * h * w * dtype_bytes
    bytes_kernel = out_channels * in_channels * kernel * kernel * dtype_bytes
    bytes_output = batch * out_channels * out_h * out_w * dtype_bytes
    bytes_total = bytes_input + bytes_kernel + bytes_output
    
    ai = flops / bytes_total
    
    print(f"Conv2D: {in_channels}→{out_channels}, {kernel}×{kernel}, input {h}×{w}, batch={batch}")
    print(f"  AI: {ai:.1f} FLOP/B")
    
    return ai

# ResNet-style convolutions
analyze_conv2d(1, 64, 64, 56, 56)    # AI: 28.5
analyze_conv2d(32, 64, 64, 56, 56)   # AI: 47.2
analyze_conv2d(1, 512, 512, 7, 7)   # AI: 17.1
analyze_conv2d(32, 512, 512, 7, 7)  # AI: 153.6 - Finally compute-bound!

Attention Mechanisms

Self-attention: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$

Attention Mechanism: Operation Breakdown
Q @ K.T GEMM: O(B·H·S²·D) Compute-bound ✓ Softmax Elementwise: O(B·H·S²) Memory-bound ✗ Scores @ V GEMM: O(B·H·S²·D) Compute-bound ✓ Output Proj GEMM: O(B·S·D²) Depends on batch Standard Attention • Materializes S×S attention matrix • Memory: O(B·H·S²) for scores • Multiple kernel launches • Memory-bound softmax kills perf Effective AI: ~2-5 FLOP/B Flash Attention • Tiled computation, never stores S×S • Memory: O(B·S·D) only • Single fused kernel • Softmax in registers (fast!) Effective AI: ~50-100 FLOP/B 10-20×
Flash Attention eliminates memory-bound softmax by keeping data in SRAM
Python attention_analysis.py
def analyze_attention(
    batch: int, heads: int, seq_len: int, head_dim: int,
    use_flash: bool = False,
    dtype_bytes: int = 2
):
    """Analyze attention arithmetic intensity."""
    
    B, H, S, D = batch, heads, seq_len, head_dim
    
    # FLOPs: Q@K.T + softmax + scores@V
    flops_qk = 2 * B * H * S * S * D      # Q @ K.T
    flops_softmax = 5 * B * H * S * S      # exp, sum, div, etc.
    flops_sv = 2 * B * H * S * S * D       # scores @ V
    total_flops = flops_qk + flops_softmax + flops_sv
    
    if use_flash:
        # Flash: only Q, K, V, O traverse HBM once
        bytes_total = 4 * B * H * S * D * dtype_bytes
    else:
        # Standard: Q, K, V, O + attention scores matrix
        bytes_qkvo = 4 * B * H * S * D * dtype_bytes
        bytes_scores = B * H * S * S * dtype_bytes  # The killer!
        bytes_total = bytes_qkvo + 2 * bytes_scores  # Read + write
    
    ai = total_flops / bytes_total
    
    mode = "Flash" if use_flash else "Standard"
    print(f"{mode} Attention: B={B}, H={H}, S={S}, D={D}")
    print(f"  AI: {ai:.1f} FLOP/B")
    
    return ai

# GPT-3 style: batch=1, heads=96, seq=2048, dim=128
analyze_attention(1, 96, 2048, 128, use_flash=False)  # AI: 4.2
analyze_attention(1, 96, 2048, 128, use_flash=True)   # AI: 68.3 (16× better!)

# Longer sequences amplify the difference
analyze_attention(1, 96, 8192, 128, use_flash=False)  # AI: 1.1
analyze_attention(1, 96, 8192, 128, use_flash=True)   # AI: 68.3 (62× better!)

Normalization Layers

LayerNorm, BatchNorm, RMSNorm - all are memory-bound due to low AI:

Operation Formula FLOPs/elem Bytes/elem AI
LayerNorm $(x - \mu) / \sigma \cdot \gamma + \beta$ ~8 8 (FP16) 1.0
RMSNorm $x / \text{RMS}(x) \cdot \gamma$ ~5 8 0.625
BatchNorm $(x - \mu) / \sigma \cdot \gamma + \beta$ ~8 8 1.0
Fusing Norms

Since norms are memory-bound, kernel fusion is critical. Fusing LayerNorm with the preceding or following operation avoids a full round-trip to HBM. This is why libraries like FlashAttention and Apex fused kernels exist.

Activations & Elementwise Ops

All elementwise operations have very low AI:

Operation FLOPs/elem Bytes/elem (FP16) AI
ReLU 1 (comparison) 4 0.25
GELU ~12 4 3.0
SiLU/Swish ~4 4 1.0
Dropout 2 4 0.5
Add (residual) 1 6 0.17
Deep Learning Operations on the Roofline
Arithmetic Intensity (FLOP/Byte) Performance (GFLOP/s) 0.1 1 10 100 1000 100 1K 10K 100K 1M Ridge: 152 ReLU Add LN GELU Elementwise Std Attn Softmax Std Attention Flash Attn Conv2D Linear B=256 GEMM 4K GEMM 8K GEMM (Compute-bound) Linear B=1 Operation Categories Elementwise (MB) Attention (varies) Conv (mid AI) GEMM (CB)
Most DL operations are memory-bound; only large GEMMs fully utilize GPU compute

13. Batch Size Effects

Batch size is the most important knob for moving operations from memory-bound to compute-bound:

Python batch_size_analysis.py
import numpy as np

def analyze_batch_effect(
    in_features: int,
    out_features: int,
    batch_sizes: list,
    dtype_bytes: int = 2  # FP16
):
    """Show how batch size affects arithmetic intensity."""
    
    # Weight bytes (constant regardless of batch)
    W_bytes = in_features * out_features * dtype_bytes
    
    print(f"Linear Layer: {in_features}→{out_features}")
    print(f"Weights: {W_bytes / 1e6:.1f} MB")
    print(f"\n{'Batch':>8} {'FLOPs':>12} {'Bytes':>12} {'AI':>10} {'Status':>15}")
    print("-" * 60)
    
    a100_ridge = 152  # FP16 TC
    
    for B in batch_sizes:
        # FLOPs: 2 * B * in * out
        flops = 2 * B * in_features * out_features
        
        # Bytes: X (B×in) + W (in×out) + Y (B×out)
        bytes_total = (B * in_features + in_features * out_features + 
                      B * out_features) * dtype_bytes
        
        ai = flops / bytes_total
        status = "✓ Compute" if ai > a100_ridge else "✗ Memory"
        
        print(f"{B:>8} {flops:>12,} {bytes_total:>12,} {ai:>10.1f} {status:>15}")

# LLM-style dimensions
analyze_batch_effect(
    in_features=4096,
    out_features=4096,
    batch_sizes=[1, 4, 16, 64, 128, 256, 512, 1024]
)

# Output:
# Linear Layer: 4096→4096
# Weights: 33.6 MB
#
#    Batch        FLOPs        Bytes         AI          Status
# ------------------------------------------------------------
#        1   33,554,432   33,570,816        1.0       ✗ Memory
#        4  134,217,728   33,603,584        4.0       ✗ Memory
#       16  536,870,912   33,685,504       15.9       ✗ Memory
#       64 2,147,483,648   33,882,112       63.4       ✗ Memory
#      128 4,294,967,296   34,078,720      126.0       ✗ Memory
#      256 8,589,934,592   34,471,936      249.2       ✓ Compute
#      512 17,179,869,184   35,258,368      487.3       ✓ Compute
#     1024 34,359,738,368   36,831,232      932.9       ✓ Compute
Batch Size Effect on AI (Linear 4096→4096)
Batch Size Arithmetic Intensity (FLOP/B) 1 4 16 64 256 1024 4096 1 10 100 1000 Ridge: 152 MEMORY-BOUND COMPUTE-BOUND 1.0 4.0 15.9 63.4 249 ✓ 487 933 Key Formula AI ≈ 2B / (1 + 1/in + 1/out) For B >> N: AI ≈ B
Arithmetic intensity scales nearly linearly with batch size for large batches
LLM Inference Reality

Batch size 1 inference (autoregressive generation) is almost always memory-bound. This is why LLM inference is dominated by weight loading time, not compute. Optimization strategies: quantization (fewer bytes), speculative decoding (more FLOPs per weight load), continuous batching (larger effective batch).

Critical Batch Size

The batch size where an operation transitions from memory-bound to compute-bound:

$$B_{\text{critical}} \approx \frac{\text{AI}_{\text{ridge}} \cdot W_{\text{bytes}} - 2 \cdot \text{in} \cdot \text{out}}{2 \cdot (\text{in} + \text{out})}$$
Python critical_batch.py
def critical_batch_size(
    in_features: int,
    out_features: int,
    ridge_ai: float,
    dtype_bytes: int = 2
) -> int:
    """Calculate critical batch size for compute-bound operation."""
    
    # From the AI equation:
    # AI = 2*B*in*out / (B*(in+out) + in*out) * dtype_bytes
    # Solve for B when AI = ridge_ai
    
    W = in_features * out_features
    factor = 2 * dtype_bytes
    
    # Simplified: B_crit ≈ ridge_ai * W / (2 * (in + out) * dtype)
    B_crit = ridge_ai * W * dtype_bytes / (2 * in_features * out_features - 
                                            ridge_ai * (in_features + out_features) * dtype_bytes)
    
    return max(1, int(B_crit))

# A100 FP16 Tensor Core ridge point
ridge = 152  # FLOP/Byte

# Common LLM dimensions
print("Critical batch sizes (A100 FP16):")
print(f"  Linear 4096→4096: B={critical_batch_size(4096, 4096, ridge)}")     # ~200
print(f"  Linear 4096→16384: B={critical_batch_size(4096, 16384, ridge)}")  # ~160
print(f"  Linear 16384→4096: B={critical_batch_size(16384, 4096, ridge)}")  # ~160
print(f"  Linear 768→768: B={critical_batch_size(768, 768, ridge)}")        # ~200
print(f"  Linear 768→3072: B={critical_batch_size(768, 3072, ridge)}")      # ~160

14. Precision Effects & Tensor Cores

Modern GPUs support multiple numerical precisions, each with different peak throughput. This creates multiple rooflines on the same diagram.

Peak Compute by Precision

NVIDIA A100 peak throughput by precision:

Precision Bits/Element Peak TFLOPS Relative to FP32 Use Case
FP64 64 9.7 0.5× Scientific computing
FP32 32 19.5 Baseline
TF32 (Tensor) 19 (stored as 32) 156 DL training
FP16 (Tensor) 16 312 16× DL training/inference
BF16 (Tensor) 16 312 16× DL training
INT8 (Tensor) 8 624 32× Inference
INT4 (Tensor) 4 1248 64× Inference

Memory Bandwidth Effects

Lower precision also means fewer bytes transferred, affecting arithmetic intensity:

Python precision_analysis.py
def analyze_precision_effect(
    batch: int,
    in_features: int,
    out_features: int,
    precisions: dict
):
    """Compare AI and performance across precisions."""
    
    # A100 specs
    peak_tflops = {
        'FP32': 19.5, 'TF32': 156, 'FP16': 312,
        'BF16': 312, 'INT8': 624, 'INT4': 1248
    }
    bytes_per_elem = {
        'FP32': 4, 'TF32': 4, 'FP16': 2,
        'BF16': 2, 'INT8': 1, 'INT4': 0.5
    }
    
    bw = 2039  # GB/s HBM
    
    print(f"Linear {in_features}→{out_features}, batch={batch}")
    print(f"{'Precision':>8} {'AI':>8} {'Ridge':>8} {'Bound':>12} {'Peak':>10}")
    print("-" * 50)
    
    flops = 2 * batch * in_features * out_features
    
    for prec in ['FP32', 'TF32', 'FP16', 'INT8', 'INT4']:
        elem_size = bytes_per_elem[prec]
        bytes_total = (batch * in_features + in_features * out_features + 
                      batch * out_features) * elem_size
        
        ai = flops / bytes_total
        ridge = peak_tflops[prec] * 1000 / bw  # FLOP/Byte
        bound = "Compute" if ai > ridge else "Memory"
        
        # Attainable performance
        if ai > ridge:
            perf = peak_tflops[prec]
        else:
            perf = ai * bw / 1000  # TFLOPS
        
        print(f"{prec:>8} {ai:>8.1f} {ridge:>8.1f} {bound:>12} {perf:>10.1f}")

# Example: LLM inference (small batch)
analyze_precision_effect(1, 4096, 4096, {})

# Output:
# Linear 4096→4096, batch=1
# Precision       AI    Ridge        Bound       Peak
# --------------------------------------------------
#     FP32      1.0      9.6       Memory        2.0
#     TF32      1.0     76.5       Memory        2.0
#     FP16      2.0    153.1       Memory        4.1
#     INT8      4.0    306.2       Memory        8.2
#     INT4      8.0    612.4       Memory       16.3
#
# Key insight: Lower precision → higher AI → closer to compute roof!

Multi-Precision Roofline

Multi-Precision Roofline (A100)
Arithmetic Intensity (FLOP/Byte) Performance (TFLOPS) 0.1 1 10 100 1000 1 10 100 1000 10000 2039 GB/s (HBM) INT4: 1248 INT8: 624 FP16: 312 TF32: 156 FP32: 19.5 9.6 76.5 153 306 612 Quantize! Ridge Points • FP32: 9.6 FLOP/B • TF32: 76.5 FLOP/B • FP16: 153 FLOP/B • INT4: 612 FLOP/B
Lower precision pushes the ridge point right, making more operations compute-bound
Quantization as Optimization

Quantization isn't just about model size—it fundamentally changes your position on the roofline. A memory-bound FP32 kernel can become compute-bound at INT4, unlocking 32× more compute throughput while halving memory traffic.

15. Extended Roofline Models

The classic roofline focuses on compute vs memory bandwidth. Extended models add other limiting factors.

Energy Roofline

For mobile/edge deployment, energy efficiency matters as much as raw performance:

$$\text{Energy Efficiency} = \frac{\text{FLOPs}}{\text{Joules}} = \frac{\text{Performance}}{\text{Power}}$$
Energy Roofline Model
Arithmetic Intensity (FLOP/Byte) Energy Efficiency (GFLOPS/W) 0.1 1 10 100 1000 0.1 1 10 100 Memory Energy Compute Energy Limit Energy Ridge Energy Model E_total = E_compute + E_memory E_compute ≈ 0.3 pJ/FLOP (FP16) E_memory ≈ 20 pJ/Byte (HBM)
Memory operations cost ~60× more energy per equivalent compute
Python energy_analysis.py
def energy_roofline(flops: int, bytes_transferred: int):
    """Calculate energy efficiency using energy roofline."""
    
    # Energy costs (approximate for modern GPUs)
    E_FLOP = 0.3e-12    # Joules per FP16 FLOP
    E_BYTE = 20e-12     # Joules per byte (HBM)
    
    energy_compute = flops * E_FLOP
    energy_memory = bytes_transferred * E_BYTE
    total_energy = energy_compute + energy_memory
    
    efficiency = flops / total_energy / 1e9  # GFLOPS/Watt
    
    # Which bound dominates?
    ai = flops / bytes_transferred
    energy_ridge = E_BYTE / E_FLOP  # ~67 FLOP/Byte for these params
    
    if ai < energy_ridge:
        bound = "Memory-energy bound"
    else:
        bound = "Compute-energy bound"
    
    print(f"Energy Efficiency: {efficiency:.1f} GFLOPS/W ({bound})")
    return efficiency

# GEMM 4K×4K×4K FP16
flops = 2 * 4096**3  # 137B FLOPs
bytes_mem = 3 * 4096**2 * 2  # 100 MB
energy_roofline(flops, bytes_mem)  # ~3000 GFLOPS/W (compute-bound)

# Attention softmax
flops_softmax = 5 * 2048**2  # 21M FLOPs
bytes_softmax = 2 * 2048**2 * 2  # 16 MB
energy_roofline(flops_softmax, bytes_softmax)  # ~60 GFLOPS/W (memory-bound)

Communication Roofline

For distributed training, network bandwidth becomes another ceiling:

Communication Roofline (Multi-GPU)
Computation-to-Communication Ratio (FLOP/Byte) Performance (TFLOPS) 1 10 100 1000 10000 Peak Compute (312 TFLOPS) NVLink (600 GB/s) PCIe (64 GB/s) 100GbE NVLink Ridge Compute Ridge Communication Intensity CI = FLOPs / CommBytes AllReduce: CI ≈ 2B/N B = batch, N = GPUs More GPUs → more comm!
At scale, network bandwidth often becomes the bottleneck before GPU compute
Scaling Limits

Data parallelism becomes communication-bound when: $\frac{2 \cdot \text{FLOPs}_{\text{forward}}}{N \cdot \text{ParamBytes}} < \text{CI}_{\text{ridge}}$. This is why large-scale training uses model parallelism, pipeline parallelism, and gradient compression.

16. Optimization Trajectories

Different optimizations move you in different directions on the roofline:

Optimization Strategies on the Roofline
Arithmetic Intensity (FLOP/Byte) Performance (GFLOP/s) 0.1 1 10 100 1000 A Baseline 1 2 3 Still below roof! 4 B 5 Optimization Types 1: Better BW util 2: Kernel fusion 3: Algorithm (tiling) 4: Quantization 5: Compute opts
Choose optimizations based on your current position relative to the roofline
Optimization Direction When to Use Examples
Better BW Utilization ↑ Vertical Memory-bound, below roof Coalesced access, prefetching
Kernel Fusion ↗ Diagonal Memory-bound, multiple kernels FlashAttention, fused LayerNorm
Algorithmic (Tiling) → Horizontal Memory-bound, data reuse possible Blocked GEMM, im2col
Quantization ↗ To ceiling Memory-bound, accuracy permits INT8, INT4, FP8
Compute Opts ↑ To ceiling Compute-bound Tensor Cores, loop unrolling
Parallelism ↑ Higher roof At ceiling, need more Multi-GPU, model parallelism
Python optimization_decision.py
def recommend_optimization(
    current_perf: float,      # GFLOP/s
    arithmetic_intensity: float,  # FLOP/Byte
    peak_compute: float,     # GFLOP/s
    peak_bandwidth: float,   # GB/s
) -> str:
    """Recommend optimization based on roofline position."""
    
    ridge_point = peak_compute / peak_bandwidth
    roof_perf = min(peak_compute, arithmetic_intensity * peak_bandwidth)
    efficiency = current_perf / roof_perf
    
    if arithmetic_intensity < ridge_point:
        # Memory-bound region
        if efficiency < 0.5:
            return """Memory-bound, low efficiency:
  1. Fix memory access patterns (coalescing)
  2. Use faster memory tier (L2 cache)
  3. Profile for bank conflicts"""
        elif efficiency < 0.8:
            return """Memory-bound, moderate efficiency:
  1. Try kernel fusion to increase AI
  2. Use tiling for data reuse
  3. Consider quantization"""
        else:
            return """Memory-bound, high efficiency:
  1. Algorithm change needed for more AI
  2. Quantization is the main lever
  3. Accept memory-bound nature"""
    else:
        # Compute-bound region
        if efficiency < 0.7:
            return """Compute-bound, low efficiency:
  1. Use Tensor Cores (if not already)
  2. Increase occupancy
  3. Reduce register pressure"""
        else:
            return """Compute-bound, high efficiency:
  1. Already well-optimized!
  2. Scale with more GPUs
  3. Use lower precision if acceptable"""

# Example usage
print(recommend_optimization(
    current_perf=150,          # 150 GFLOP/s observed
    arithmetic_intensity=2.0, # 2 FLOP/Byte
    peak_compute=312000,       # A100 FP16
    peak_bandwidth=2039        # A100 HBM
))

17. Case Studies

ResNet-50 Inference

ResNet-50 Layer Analysis (Batch=1 vs Batch=64)
Arithmetic Intensity (FLOP/Byte) Performance (TFLOPS) 1 10 100 1000 Batch=1 Batch=64 ↑ Batch ResNet-50 Stats 8.2B FLOPs total B=1: 8ms (1.0 TFLOPS) → 0.3% compute util B=64: 12ms (44 TFLOPS) → 14% compute util
Batching transforms ResNet from memory-bound to compute-bound

Transformer Training

Python transformer_analysis.py
def analyze_transformer_block(
    batch: int,
    seq_len: int,
    hidden: int,
    heads: int,
    ffn_mult: int = 4,
    dtype_bytes: int = 2
):
    """Analyze a transformer block's operations."""
    
    head_dim = hidden // heads
    ffn_hidden = hidden * ffn_mult
    
    ops = []
    
    # QKV Projection
    qkv_flops = 2 * batch * seq_len * hidden * 3 * hidden
    qkv_bytes = (batch * seq_len * hidden + 3 * hidden * hidden + 
                 batch * seq_len * 3 * hidden) * dtype_bytes
    ops.append(('QKV Proj', qkv_flops, qkv_bytes))
    
    # Attention (with Flash)
    attn_flops = 4 * batch * heads * seq_len * seq_len * head_dim
    attn_bytes = 4 * batch * seq_len * hidden * dtype_bytes  # Flash: no S×S
    ops.append(('Attention', attn_flops, attn_bytes))
    
    # Output Projection
    out_flops = 2 * batch * seq_len * hidden * hidden
    out_bytes = (batch * seq_len * hidden + hidden * hidden + 
                batch * seq_len * hidden) * dtype_bytes
    ops.append(('Out Proj', out_flops, out_bytes))
    
    # LayerNorm (×2)
    ln_flops = 2 * 8 * batch * seq_len * hidden
    ln_bytes = 2 * 2 * batch * seq_len * hidden * dtype_bytes
    ops.append(('LayerNorm', ln_flops, ln_bytes))
    
    # FFN (up + down)
    ffn_flops = 2 * batch * seq_len * hidden * ffn_hidden * 2
    ffn_bytes = (batch * seq_len * hidden + hidden * ffn_hidden + 
                batch * seq_len * ffn_hidden + ffn_hidden * hidden + 
                batch * seq_len * hidden) * dtype_bytes
    ops.append(('FFN', ffn_flops, ffn_bytes))
    
    # Activations (GELU, residuals)
    act_flops = 15 * batch * seq_len * ffn_hidden
    act_bytes = 4 * batch * seq_len * ffn_hidden * dtype_bytes
    ops.append(('Activations', act_flops, act_bytes))
    
    print(f"Transformer Block: B={batch}, S={seq_len}, H={hidden}")
    print(f"{'Operation':>12} {'GFLOP':>10} {'GB':>8} {'AI':>8} {'Bound':>10}")
    print("-" * 52)
    
    ridge = 153  # A100 FP16 TC
    total_flops = 0
    total_bytes = 0
    
    for name, flops, bytes_mem in ops:
        ai = flops / bytes_mem
        bound = "Compute" if ai > ridge else "Memory"
        print(f"{name:>12} {flops/1e9:>10.1f} {bytes_mem/1e9:>8.3f} {ai:>8.1f} {bound:>10}")
        total_flops += flops
        total_bytes += bytes_mem
    
    print("-" * 52)
    total_ai = total_flops / total_bytes
    print(f"{'TOTAL':>12} {total_flops/1e9:>10.1f} {total_bytes/1e9:>8.3f} {total_ai:>8.1f}")

# GPT-2 style block
analyze_transformer_block(batch=32, seq_len=1024, hidden=768, heads=12)

# Output:
# Transformer Block: B=32, S=1024, H=768
#    Operation      GFLOP       GB       AI      Bound
# ----------------------------------------------------
#     QKV Proj      115.96    0.058    2008.5    Compute
#    Attention      103.08    0.101    1024.0    Compute
#     Out Proj       38.65    0.058     668.3    Compute
#    LayerNorm        0.40    0.201       2.0     Memory
#          FFN      309.24    0.211    1468.1    Compute
#  Activations        1.51    0.402       3.8     Memory
# ----------------------------------------------------
#        TOTAL      568.84    1.030     552.3
Training vs Inference

Training has 3× the FLOPs (forward + backward + optimizer) but similar memory access per parameter. This naturally increases AI, making training more compute-bound than inference.

LLM Inference

LLM inference has two distinct phases with very different roofline characteristics:

LLM Inference: Prefill vs Decode
Arithmetic Intensity (FLOP/Byte) Performance (TFLOPS) 0.1 1 10 100 1000 Ridge: 153 DECODE AI ≈ 1-2 Memory-bound! PREFILL AI ≈ 100-500 Compute-bound! Key Difference Prefill: batch=prompt_len → High AI, uses compute Decode: batch=1 (per GPU) → Low AI, pure mem-bound
Decode phase dominates latency because it's severely memory-bound
Python llm_inference_analysis.py
def analyze_llm_inference(
    model_params_B: float,    # Billions of parameters
    prompt_tokens: int,
    gen_tokens: int,
    precision_bytes: int = 2,  # FP16
    peak_tflops: float = 312,  # A100
    bw_gb_s: float = 2039      # A100 HBM
):
    """Analyze LLM inference performance breakdown."""
    
    params = model_params_B * 1e9
    param_bytes = params * precision_bytes
    
    # ===== PREFILL PHASE =====
    # All tokens processed in parallel (like batch = prompt_tokens)
    prefill_flops = 2 * params * prompt_tokens  # 2 FLOPs per param per token
    prefill_bytes = param_bytes + prompt_tokens * 4096 * precision_bytes  # Weights + activations
    prefill_ai = prefill_flops / prefill_bytes
    
    ridge = peak_tflops * 1000 / bw_gb_s
    
    if prefill_ai > ridge:
        prefill_time = prefill_flops / (peak_tflops * 1e12)
    else:
        prefill_time = prefill_bytes / (bw_gb_s * 1e9)
    
    # ===== DECODE PHASE =====
    # One token at a time (batch = 1)
    decode_flops_per_token = 2 * params
    decode_bytes_per_token = param_bytes  # Must load ALL weights for each token!
    decode_ai = decode_flops_per_token / decode_bytes_per_token  # Always ~1!
    
    # Decode is always memory-bound
    decode_time_per_token = decode_bytes_per_token / (bw_gb_s * 1e9)
    total_decode_time = decode_time_per_token * gen_tokens
    
    tokens_per_sec = 1 / decode_time_per_token
    
    print(f"===== LLM Inference Analysis =====")
    print(f"Model: {model_params_B}B params ({param_bytes/1e9:.1f} GB)")
    print(f"Prompt: {prompt_tokens} tokens, Generate: {gen_tokens} tokens")
    print()
    print(f"PREFILL:")
    print(f"  AI: {prefill_ai:.0f} FLOP/B → {'Compute' if prefill_ai > ridge else 'Memory'}-bound")
    print(f"  Time: {prefill_time*1000:.1f} ms")
    print()
    print(f"DECODE:")
    print(f"  AI: {decode_ai:.1f} FLOP/B → Memory-bound (ALWAYS!)")
    print(f"  Per-token: {decode_time_per_token*1000:.1f} ms")
    print(f"  Total: {total_decode_time*1000:.0f} ms")
    print(f"  Throughput: {tokens_per_sec:.0f} tokens/sec")
    print()
    print(f"TOTAL: {(prefill_time + total_decode_time)*1000:.0f} ms")
    print(f"Decode dominates: {total_decode_time/(prefill_time+total_decode_time)*100:.0f}%")

# Llama-2 7B example
analyze_llm_inference(
    model_params_B=7,
    prompt_tokens=512,
    gen_tokens=256
)

# Output:
# ===== LLM Inference Analysis =====
# Model: 7B params (14.0 GB)
# Prompt: 512 tokens, Generate: 256 tokens
#
# PREFILL:
#   AI: 512 FLOP/B → Compute-bound
#   Time: 22.9 ms
#
# DECODE:
#   AI: 1.0 FLOP/B → Memory-bound (ALWAYS!)
#   Per-token: 6.9 ms
#   Total: 1759 ms
#   Throughput: 146 tokens/sec
#
# TOTAL: 1782 ms
# Decode dominates: 99%
LLM Inference Optimization

Decode is the bottleneck. Key optimizations target this phase: Quantization (fewer bytes to load), Speculative Decoding (more FLOPs per weight load), Continuous Batching (higher effective batch), KV Cache Compression (less memory), Model Parallelism (aggregate bandwidth).

18. Conclusion & Key Takeaways

Essential Roofline Insights
  1. Every kernel has two possible bottlenecks: compute or memory bandwidth. The roofline tells you which one applies.
  2. Arithmetic Intensity (AI) determines your bound: AI < ridge → memory-bound, AI > ridge → compute-bound.
  3. Most DL operations are memory-bound: Only large GEMMs with sufficient batch size hit compute ceilings.
  4. Batch size is the primary lever: Increasing batch moves operations from memory-bound to compute-bound.
  5. Quantization helps both ways: Fewer bytes transferred AND higher compute throughput.
  6. Kernel fusion increases effective AI: FlashAttention eliminates the memory-bound softmax bottleneck.
  7. LLM decode is always memory-bound: ~1 FLOP/Byte regardless of model size. Speed = bandwidth.
  8. Profile before optimizing: Know your position on the roofline before choosing an optimization strategy.
Roofline Decision Framework
Measure AI & Perf AI > Ridge? (~10-150) No Memory-Bound Focus: bandwidth, AI Yes Compute-Bound Focus: Tensor Cores, ILP Optimizations: • Increase batch size • Fuse kernels (↑ AI) • Quantize (↓ bytes) • Better coalescing • Use faster cache • Tile for reuse Optimizations: • Use Tensor Cores • Increase occupancy • Lower precision • ILP, loop unroll • Multi-GPU scale • Better algorithms
Use this framework to guide every GPU optimization effort

The roofline model transforms GPU optimization from guesswork into a systematic process. By understanding your workload's position on the roofline, you can:

Master the roofline, and you'll write faster GPU code in less time.