- Introduction & History
- The Roofline Model Fundamentals
- Computing Peak Performance
- Computing Peak Bandwidth
- Building Your Roofline
- Worked Examples
- Achievable vs Theoretical Performance
- Cache-Aware Roofline Analysis
- Operational vs Arithmetic Intensity
- Profiling Tools for Roofline
- BLAS Operations on the Roofline
- Deep Learning Operations
- Batch Size Effects
- Precision Effects & Tensor Cores
- Extended Roofline Models
- Optimization Trajectories
- Case Studies
- Conclusion & Key Takeaways
1. Introduction & History
The Roofline Model, introduced by Samuel Williams, Andrew Waterman, and David Patterson at UC Berkeley in 2009, provides a visual framework to understand the performance limits of a kernel on a given architecture.
The core insight: every computation is limited by either compute throughput or memory bandwidth - never both simultaneously. The roofline diagram shows you which limit applies and by how much.
Before roofline, developers would optimize blindly - adding more compute optimizations to a memory-bound kernel, or vice versa. The roofline model tells you which optimization will actually help before you write any code.
Key terminology:
- Roofline: The performance ceiling - you cannot exceed this line
- Ridge Point: Where memory-bound transitions to compute-bound
- Arithmetic Intensity (AI): FLOP/Byte - operations per byte transferred
2. The Roofline Model Fundamentals
Arithmetic Intensity
Arithmetic Intensity (AI) is the ratio of floating-point operations to bytes moved from memory:
This single metric determines whether your kernel is memory-bound or compute-bound:
The Roofline Equation
The attainable performance is bounded by:
Where:
- $P_{\text{peak}}$ = Peak compute performance (GFLOP/s)
- $\text{BW}_{\text{peak}}$ = Peak memory bandwidth (GB/s)
- $\text{AI}$ = Arithmetic Intensity (FLOP/Byte)
On a log-log plot, this creates two lines:
- Bandwidth ceiling: $\log(P) = \log(\text{BW}) + \log(\text{AI})$ → slope of 1
- Compute ceiling: $\log(P) = \log(P_{\text{peak}})$ → horizontal line
Compute-Bound vs Memory-Bound
The ridge point is where the two ceilings meet:
Peak FP32: 19.5 TFLOP/s = 19,500 GFLOP/s
HBM2e Bandwidth: 2,039 GB/s
Ridge Point: 19,500 / 2,039 = 9.6 FLOP/Byte
Kernels with AI < 9.6 are memory-bound, AI > 9.6 are compute-bound.
3. Computing Peak Performance
FLOPS Calculation
Peak theoretical FLOPS depends on:
For GPUs, the calculation differs by unit type:
def calculate_peak_flops(gpu: str) -> dict:
"""Calculate theoretical peak FLOPS for common GPUs."""
specs = {
"A100_80GB": {
"sm_count": 108,
"fp32_cores_per_sm": 64,
"fp64_cores_per_sm": 32,
"tensor_cores_per_sm": 4,
"clock_ghz": 1.41, # Boost clock
},
"H100_SXM": {
"sm_count": 132,
"fp32_cores_per_sm": 128,
"fp64_cores_per_sm": 64,
"tensor_cores_per_sm": 4,
"clock_ghz": 1.83,
},
}
s = specs[gpu]
# CUDA Cores: 2 FLOP/cycle (FMA = multiply + add)
fp32_peak = s["sm_count"] * s["fp32_cores_per_sm"] * s["clock_ghz"] * 2 * 1000
fp64_peak = s["sm_count"] * s["fp64_cores_per_sm"] * s["clock_ghz"] * 2 * 1000
return {
"FP32 (TFLOPS)": fp32_peak / 1000,
"FP64 (TFLOPS)": fp64_peak / 1000,
}
# A100 Example:
# FP32: 108 SM × 64 cores × 1.41 GHz × 2 FLOP/cycle = 19.5 TFLOPS
# FP64: 108 SM × 32 cores × 1.41 GHz × 2 FLOP/cycle = 9.7 TFLOPS
Precision Effects (FP64 → INT8)
Different precisions have vastly different peak throughputs:
| Precision | A100 (CUDA) | A100 (Tensor) | H100 (Tensor) | Ratio vs FP32 |
|---|---|---|---|---|
FP64 |
9.7 TFLOPS | 19.5 TFLOPS | 67 TFLOPS | 0.5× |
FP32 |
19.5 TFLOPS | 156 TFLOPS* | 989 TFLOPS* | 1× |
TF32 |
— | 156 TFLOPS | 989 TFLOPS | 8× |
FP16/BF16 |
39 TFLOPS | 312 TFLOPS | 1,979 TFLOPS | 16× |
INT8 |
— | 624 TOPS | 3,958 TOPS | 32× |
*TF32 uses FP32 inputs but computes with reduced mantissa precision on Tensor Cores
Tensor Cores Peak
Tensor Cores perform matrix multiply-accumulate on small tiles:
def tensor_core_peak(gpu: str, precision: str) -> float:
"""
Tensor Core peak calculation.
A100 Tensor Core: 256 FP16 FMA ops/cycle per TC
4 TCs per SM × 108 SMs × 1.41 GHz × 256 × 2 = 312 TFLOPS
"""
tc_specs = {
"A100": {
"tcs_per_sm": 4,
"sm_count": 108,
"clock_ghz": 1.41,
"fp16_ops_per_tc_per_cycle": 256, # 16×16×16 MMA
},
"H100": {
"tcs_per_sm": 4,
"sm_count": 132,
"clock_ghz": 1.83,
"fp16_ops_per_tc_per_cycle": 512, # Larger MMA
},
}
s = tc_specs[gpu]
# Base FP16 calculation
fp16_tflops = (s["tcs_per_sm"] * s["sm_count"] * s["clock_ghz"]
* s["fp16_ops_per_tc_per_cycle"] * 2 / 1000)
# Precision multipliers
multipliers = {
"FP16": 1.0,
"BF16": 1.0,
"TF32": 0.5,
"FP64": 0.0625, # 1/16
"INT8": 2.0,
}
return fp16_tflops * multipliers[precision]
4. Computing Peak Bandwidth
Memory Hierarchy
Modern GPUs have multiple memory levels, each with different bandwidth:
Theoretical Bandwidth
HBM bandwidth is calculated from memory specifications:
def calculate_hbm_bandwidth(gpu: str) -> float:
"""Calculate theoretical HBM bandwidth."""
hbm_specs = {
"A100_80GB": {
# HBM2e specifications
"memory_clock_mhz": 1593, # Effective clock
"bus_width_bits": 5120, # 5 stacks × 1024 bits
"data_rate": 2, # DDR = 2 transfers/cycle
},
"H100_SXM": {
# HBM3 specifications
"memory_clock_mhz": 2619,
"bus_width_bits": 5120,
"data_rate": 2,
},
}
s = hbm_specs[gpu]
# BW (GB/s) = Clock (MHz) × Width (bits) × DataRate / 8 / 1000
bw_gbps = (s["memory_clock_mhz"] * s["bus_width_bits"]
* s["data_rate"] / 8 / 1000)
return bw_gbps
# A100 80GB:
# 1593 MHz × 5120 bits × 2 / 8 / 1000 = 2,039 GB/s
# H100 SXM:
# 2619 MHz × 5120 bits × 2 / 8 / 1000 = 3,352 GB/s
A100 80GB
H100 SXM
L2 Cache
L1/Shared
Multi-Ceiling Rooflines
Since each memory level has different bandwidth, we can draw multiple rooflines:
A kernel that's memory-bound at the HBM level might become compute-bound if data fits in L2 cache. This is why tiling is so important - it moves your working set to faster memory levels.
5. Building Your Roofline
To create a roofline for your GPU, you need:
- Peak Compute: From specs or microbenchmark
- Peak Bandwidth: From specs or STREAM benchmark
- Ridge Point: Compute / Bandwidth
import numpy as np
import matplotlib.pyplot as plt
def build_roofline(
peak_gflops: float,
peak_bw_gbs: float,
ai_range: tuple = (0.01, 1000),
):
"""
Build and plot a roofline model.
Args:
peak_gflops: Peak compute performance in GFLOP/s
peak_bw_gbs: Peak memory bandwidth in GB/s
ai_range: (min, max) arithmetic intensity range
"""
# Ridge point
ridge_ai = peak_gflops / peak_bw_gbs
print(f"Ridge Point: {ridge_ai:.2f} FLOP/Byte")
# Generate AI values (log scale)
ai = np.logspace(np.log10(ai_range[0]),
np.log10(ai_range[1]), 1000)
# Roofline equation: P = min(peak, BW × AI)
performance = np.minimum(peak_gflops, peak_bw_gbs * ai)
# Plot
plt.figure(figsize=(10, 7))
plt.loglog(ai, performance, 'b-', linewidth=2.5, label='Roofline')
# Mark ridge point
plt.axvline(x=ridge_ai, color='purple', linestyle='--', alpha=0.7)
plt.scatter([ridge_ai], [peak_gflops], color='purple', s=100,
zorder=5, label=f'Ridge ({ridge_ai:.1f} FLOP/B)')
# Labels
plt.xlabel('Arithmetic Intensity (FLOP/Byte)', fontsize=12)
plt.ylabel('Performance (GFLOP/s)', fontsize=12)
plt.title('Roofline Model', fontsize=14)
# Add region annotations
plt.text(ridge_ai / 10, peak_gflops / 5, 'Memory\nBound',
fontsize=12, color='blue', ha='center')
plt.text(ridge_ai * 10, peak_gflops / 2, 'Compute\nBound',
fontsize=12, color='red', ha='center')
plt.grid(True, which='both', alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()
return ridge_ai
# Example: A100 FP32
ridge = build_roofline(
peak_gflops=19500, # 19.5 TFLOPS
peak_bw_gbs=2039, # 2039 GB/s
)
# Output: Ridge Point: 9.56 FLOP/Byte
6. Worked Examples
Example 1: Vector Scaling (SAXPY)
SAXPY: $y = \alpha x + y$
def analyze_saxpy(n: int, dtype_bytes: int = 4):
"""
SAXPY: y = α * x + y
For each element:
- Read: x[i], y[i] → 2 reads
- Compute: multiply, add → 2 FLOPs
- Write: y[i] → 1 write
"""
# FLOPs: 2 per element (1 mul + 1 add)
flops = 2 * n
# Bytes: read x, read y, write y = 3 × n × sizeof(dtype)
bytes_transferred = 3 * n * dtype_bytes
# Arithmetic Intensity
ai = flops / bytes_transferred
print(f"SAXPY Analysis (n={n:,}, FP32)")
print(f" FLOPs: {flops:,}")
print(f" Bytes: {bytes_transferred:,}")
print(f" AI: {ai:.4f} FLOP/Byte")
# A100 Analysis
a100_peak = 19500 # GFLOP/s
a100_bw = 2039 # GB/s
a100_ridge = a100_peak / a100_bw
print(f"\n A100 Ridge: {a100_ridge:.2f} FLOP/Byte")
print(f" Status: {'Memory-bound' if ai < a100_ridge else 'Compute-bound'}")
# Expected performance
expected_gflops = min(a100_peak, a100_bw * ai)
print(f" Expected perf: {expected_gflops:.1f} GFLOP/s")
print(f" Efficiency: {100 * expected_gflops / a100_peak:.1f}% of peak compute")
return ai
analyze_saxpy(100_000_000)
# Output:
# SAXPY Analysis (n=100,000,000, FP32)
# FLOPs: 200,000,000
# Bytes: 1,200,000,000
# AI: 0.1667 FLOP/Byte
#
# A100 Ridge: 9.56 FLOP/Byte
# Status: Memory-bound
# Expected perf: 339.8 GFLOP/s
# Efficiency: 1.7% of peak compute
SAXPY can only achieve 1.7% of the A100's compute capacity! It's so memory-bound that optimizing compute is pointless - you need to reduce memory traffic (e.g., kernel fusion).
Example 2: Matrix Multiplication (GEMM)
GEMM: $C = A \times B$ where $A$ is $M \times K$ and $B$ is $K \times N$
def analyze_gemm(M: int, N: int, K: int, dtype_bytes: int = 4):
"""
GEMM: C = A × B
A: M×K, B: K×N, C: M×N
FLOPs: M × N × (2K - 1) ≈ 2MNK for large K
(K multiplies + K-1 adds per output element)
Bytes (naive, no tiling):
Read A, Read B, Write C = (MK + KN + MN) × dtype_bytes
"""
# FLOPs
flops = 2 * M * N * K
# Bytes (assuming matrices read once from HBM)
bytes_transferred = (M*K + K*N + M*N) * dtype_bytes
# Arithmetic Intensity
ai = flops / bytes_transferred
print(f"GEMM Analysis ({M}×{K} @ {K}×{N}, FP32)")
print(f" FLOPs: {flops:,}")
print(f" Bytes: {bytes_transferred:,}")
print(f" AI: {ai:.2f} FLOP/Byte")
# A100 Analysis
a100_ridge = 9.56
status = 'Compute-bound' if ai > a100_ridge else 'Memory-bound'
print(f" Status: {status}")
return ai
# Small matrices (memory-bound)
analyze_gemm(128, 128, 128)
# AI: 42.67 FLOP/Byte → Compute-bound
# Larger = higher AI (more reuse)
analyze_gemm(4096, 4096, 4096)
# AI: 1365.33 FLOP/Byte → Very compute-bound
# For square matrices: AI ≈ N/3
# N=128: AI ≈ 42
# N=4096: AI ≈ 1365
Example 3: Reduction (Sum)
def analyze_reduction(n: int, dtype_bytes: int = 4):
"""
Reduction: s = sum(x)
FLOPs: n-1 additions
Bytes: n reads + 1 write (negligible)
"""
flops = n - 1
bytes_transferred = n * dtype_bytes # Ignore single output write
ai = flops / bytes_transferred
print(f"Reduction (n={n:,})")
print(f" AI: {ai:.4f} FLOP/Byte")
print(f" Status: Extremely memory-bound")
return ai
analyze_reduction(100_000_000)
# AI: 0.25 FLOP/Byte - Severely memory-bound!
Summary: Common Operations
| Operation | FLOPs | Bytes | AI (FP32) | Status on A100 |
|---|---|---|---|---|
| Vector Add | N | 12N | 0.083 | 🔵 Memory-bound |
| SAXPY | 2N | 12N | 0.167 | 🔵 Memory-bound |
| Dot Product | 2N | 8N | 0.25 | 🔵 Memory-bound |
| Sum Reduction | N | 4N | 0.25 | 🔵 Memory-bound |
| Softmax | 5N | 8N | 0.625 | 🔵 Memory-bound |
| LayerNorm | 8N | 8N | 1.0 | 🔵 Memory-bound |
| GEMM (N=128) | 2N³ | 12N² | ~43 | 🔴 Compute-bound |
| GEMM (N=4096) | 2N³ | 12N² | ~1365 | 🔴 Compute-bound |
7. Achievable vs Theoretical Performance
Theoretical peak performance is a ceiling you'll never reach. Real-world achievable performance depends on many factors: memory access patterns, instruction mix, occupancy, and more.
Theoretical Peak: What the hardware could do if perfectly utilized (from specs).
Achievable Peak: What you can actually measure with optimized benchmarks.
The gap between them is typically 10-30% for bandwidth, 5-15% for compute.
STREAM Benchmark
The STREAM benchmark measures achievable memory bandwidth using simple vector operations:
| Kernel | Operation | Bytes/Element | FLOPs/Element |
|---|---|---|---|
Copy |
a[i] = b[i] | 16 (2 × 8 for FP64) | 0 |
Scale |
a[i] = q × b[i] | 16 | 1 |
Add |
a[i] = b[i] + c[i] | 24 | 1 |
Triad |
a[i] = b[i] + q × c[i] | 24 | 2 |
import torch
import time
def stream_triad(n: int, dtype=torch.float64, device="cuda", iterations=100):
"""STREAM Triad: a = b + scalar * c"""
# Allocate arrays
a = torch.zeros(n, dtype=dtype, device=device)
b = torch.ones(n, dtype=dtype, device=device)
c = torch.ones(n, dtype=dtype, device=device)
scalar = 3.0
# Warmup
for _ in range(10):
a = b + scalar * c
torch.cuda.synchronize()
# Timed iterations
start = time.perf_counter()
for _ in range(iterations):
a = b + scalar * c
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# Calculate bandwidth
bytes_per_elem = a.element_size()
total_bytes = 3 * n * bytes_per_elem * iterations # 2 read + 1 write
bandwidth_gbs = total_bytes / elapsed / 1e9
print(f"STREAM Triad (n={n:,}, {dtype})")
print(f" Measured bandwidth: {bandwidth_gbs:.1f} GB/s")
return bandwidth_gbs
# Run on A100
bw = stream_triad(n=100_000_000) # 100M elements
# Typical output: ~1800-1900 GB/s (vs 2039 theoretical)
# Efficiency: ~88-93%
GEMM Benchmark
For compute-bound operations, the GEMM benchmark measures achievable compute throughput:
import torch
import time
def gemm_benchmark(
M: int, N: int, K: int,
dtype=torch.float16,
device="cuda",
iterations=100
):
"""Benchmark matrix multiplication throughput."""
# Allocate matrices
A = torch.randn(M, K, dtype=dtype, device=device)
B = torch.randn(K, N, dtype=dtype, device=device)
# Warmup (important for Tensor Cores!)
for _ in range(20):
C = torch.mm(A, B)
torch.cuda.synchronize()
# Timed iterations
start = time.perf_counter()
for _ in range(iterations):
C = torch.mm(A, B)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# Calculate FLOPS
flops_per_gemm = 2 * M * N * K
total_flops = flops_per_gemm * iterations
tflops = total_flops / elapsed / 1e12
print(f"GEMM Benchmark ({M}×{K} @ {K}×{N}, {dtype})")
print(f" Measured: {tflops:.1f} TFLOPS")
return tflops
# Large square matrices for peak throughput
tflops = gemm_benchmark(8192, 8192, 8192, dtype=torch.float16)
# A100 FP16 TC: ~280 TFLOPS (vs 312 theoretical) = 90%
# H100 FP16 TC: ~1700 TFLOPS (vs 1979 theoretical) = 86%
# Shape matters for efficiency
gemm_benchmark(1024, 1024, 1024, dtype=torch.float16) # ~200 TFLOPS
gemm_benchmark(256, 256, 256, dtype=torch.float16) # ~50 TFLOPS (poor!)
The Efficiency Gap
Multiple factors contribute to the gap between theoretical and achievable performance:
def build_realistic_roofline(gpu: str):
"""Build roofline with achievable (not theoretical) peaks."""
# Achievable peaks from benchmarks
achievable = {
"A100_80GB": {
# Memory bandwidth (STREAM Triad)
"hbm_bw_gbs": 1880, # vs 2039 theoretical
"l2_bw_gbs": 4500, # vs ~5000 theoretical
"l1_bw_gbs": 17000, # vs ~19000 theoretical
# Compute throughput (GEMM benchmark)
"fp32_tflops": 18.5, # vs 19.5 theoretical
"tf32_tflops": 140, # vs 156 theoretical
"fp16_tflops": 285, # vs 312 theoretical
},
"H100_SXM": {
"hbm_bw_gbs": 3040, # vs 3352 theoretical
"l2_bw_gbs": 10000,
"l1_bw_gbs": 30000,
"fp32_tflops": 55,
"tf32_tflops": 850,
"fp16_tflops": 1700,
},
}
a = achievable[gpu]
# Ridge points with achievable values
ridge_fp32 = a["fp32_tflops"] * 1000 / a["hbm_bw_gbs"]
ridge_fp16 = a["fp16_tflops"] * 1000 / a["hbm_bw_gbs"]
print(f"{gpu} Achievable Roofline:")
print(f" FP32 Ridge: {ridge_fp32:.1f} FLOP/B")
print(f" FP16 Ridge: {ridge_fp16:.1f} FLOP/B")
return a
build_realistic_roofline("A100_80GB")
# A100 FP32 Ridge: 9.8 FLOP/B (vs 9.6 theoretical)
# A100 FP16 Ridge: 151.6 FLOP/B (vs 153 theoretical)
8. Cache-Aware Roofline Analysis
The basic roofline assumes all data comes from main memory (HBM). In reality, caches dramatically change the picture.
Hierarchical Roofline Model
The Hierarchical Roofline Model (HRM) accounts for different memory levels:
Where $i$ iterates over memory levels (L1, L2, HBM) and $\text{AI}_i$ is the arithmetic intensity relative to that level.
Data Locality Effects
Data locality determines which bandwidth ceiling limits your kernel:
def analyze_locality(
working_set_bytes: int,
l1_size: int = 192 * 1024, # 192 KB per SM
l2_size: int = 40 * 1024 * 1024, # 40 MB total
):
"""Determine which memory level bounds a kernel."""
if working_set_bytes <= l1_size:
return "L1", 19000 # GB/s
elif working_set_bytes <= l2_size:
return "L2", 5000
else:
return "HBM", 1880
# Example: Matrix multiply tile sizes
for tile_size in [32, 64, 128, 256, 512]:
# 3 tiles for A, B, C
working_set = 3 * tile_size * tile_size * 4 # FP32
level, bw = analyze_locality(working_set)
print(f"Tile {tile_size}×{tile_size}: {working_set//1024} KB → {level} ({bw} GB/s)")
# Output:
# Tile 32×32: 12 KB → L1 (19000 GB/s)
# Tile 64×64: 48 KB → L1 (19000 GB/s)
# Tile 128×128: 192 KB → L1 (19000 GB/s) ← Fits exactly!
# Tile 256×256: 768 KB → L2 (5000 GB/s)
# Tile 512×512: 3072 KB → L2 (5000 GB/s)
Impact of Tiling
Tiling (blocking) is the primary technique to improve data locality and move kernels to faster rooflines:
Tile size should maximize L1/shared memory usage without spilling. For A100: aim for ~192KB working set per SM. This typically means 64×64 to 128×128 tiles for FP32, larger for FP16.
9. Operational vs Arithmetic Intensity
There's an important distinction between two related metrics:
| Metric | Definition | What It Measures | Use Case |
|---|---|---|---|
| Arithmetic Intensity | FLOPs / Bytes from algorithm spec | Inherent algorithm characteristic | Algorithm selection |
| Operational Intensity | FLOPs / Bytes actually transferred | Real cache behavior | Implementation optimization |
def compare_intensities(M: int, N: int, K: int, tile_size: int):
"""Compare arithmetic vs operational intensity for GEMM."""
flops = 2 * M * N * K
#----------------------------------------
# Arithmetic Intensity (from algorithm)
#----------------------------------------
bytes_algo = (M*K + K*N + M*N) * 4 # FP32
ai = flops / bytes_algo
#----------------------------------------
# Operational Intensity (with tiling)
#----------------------------------------
# With tiling, each HBM byte is reused tile_size times
num_tiles = (M // tile_size) * (N // tile_size) * (K // tile_size)
# HBM traffic: load each tile once
hbm_bytes = bytes_algo # Still need to load all data
oi_hbm = flops / hbm_bytes # Same as AI
# L1 traffic: each tile loaded from L2 once per K-tile
l1_loads = (M // tile_size) * (N // tile_size) * (K // tile_size)
l1_bytes = l1_loads * 3 * tile_size * tile_size * 4
oi_l1 = flops / l1_bytes
print(f"GEMM {M}×{K} @ {K}×{N}, tile={tile_size}")
print(f" Arithmetic Intensity: {ai:.1f} FLOP/B")
print(f" Operational Intensity (HBM): {oi_hbm:.1f} FLOP/B")
print(f" Operational Intensity (L1): {oi_l1:.1f} FLOP/B")
print(f" Effective amplification: {oi_l1/ai:.1f}×")
compare_intensities(4096, 4096, 4096, tile_size=128)
# Output:
# GEMM 4096×4096 @ 4096×4096, tile=128
# Arithmetic Intensity: 1365.3 FLOP/B
# Operational Intensity (HBM): 1365.3 FLOP/B
# Operational Intensity (L1): 1365.3 FLOP/B
# Effective amplification: 1.0×
# For smaller matrices or poor tiling, OI << AI
AI = OI: Perfect implementation, all cache benefits realized.
AI > OI: Poor implementation, cache misses, redundant loads.
AI < OI: Impossible (can't do fewer operations than algorithm requires).
10. Profiling Tools for Roofline
Modern profiling tools can automatically generate roofline plots:
NVIDIA Nsight Compute
Nsight Compute provides built-in roofline analysis for GPU kernels:
# Profile with roofline metrics
ncu --set roofline -o my_kernel_roofline ./my_application
# Open in GUI
ncu-ui my_kernel_roofline.ncu-rep
# Key sections in report:
# - "Roofline Analysis" - visual plot
# - "Memory Workload Analysis" - bandwidth utilization
# - "Compute Workload Analysis" - FLOP utilization
Intel Advisor
For CPU analysis, Intel Advisor provides comprehensive roofline support:
# Collect roofline data
advisor --collect=roofline --project-dir=./advisor_results -- ./my_application
# Generate report
advisor --report=roofline --project-dir=./advisor_results
# Features:
# - Automatic AI calculation per loop
# - Cache-aware roofline (L1/L2/L3/DRAM)
# - Vectorization efficiency overlay
PyTorch Profiler
import torch
from torch.profiler import profile, ProfilerActivity
def profile_for_roofline(model, input_data, device="cuda"):
"""Profile model for roofline analysis."""
model = model.to(device)
input_data = input_data.to(device)
# Warmup
for _ in range(10):
_ = model(input_data)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_flops=True,
profile_memory=True,
record_shapes=True,
) as prof:
_ = model(input_data)
# Print breakdown
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=20
))
# Export for visualization
prof.export_chrome_trace("trace.json")
# Calculate AI for each kernel
for event in prof.key_averages():
if event.flops > 0 and event.cuda_memory_usage > 0:
ai = event.flops / event.cuda_memory_usage
print(f"{event.key}: AI = {ai:.2f} FLOP/B")
# Example usage
model = torch.nn.Linear(4096, 4096)
input_data = torch.randn(32, 4096)
profile_for_roofline(model, input_data)
11. BLAS Operations on the Roofline
The Basic Linear Algebra Subprograms (BLAS) are organized into three levels based on their arithmetic intensity:
Level 1: Vector Operations
Level 1 BLAS performs vector-vector operations: $O(n)$ data, $O(n)$ FLOPs → AI ≈ constant
| Operation | Definition | FLOPs | Bytes (FP32) | AI |
|---|---|---|---|---|
SCOPY |
y ← x | 0 | 8n | 0 |
SSCAL |
x ← αx | n | 8n | 0.125 |
SAXPY |
y ← αx + y | 2n | 12n | 0.167 |
SDOT |
s ← x·y | 2n | 8n | 0.25 |
SNRM2 |
s ← ||x||₂ | 2n | 4n | 0.5 |
SASUM |
s ← Σ|xᵢ| | n | 4n | 0.25 |
All Level 1 BLAS operations are severely memory-bound. On an A100 (ridge point ~10 FLOP/B), they achieve only 1-5% of peak compute. Optimization focus: minimize memory traffic, maximize bandwidth utilization.
Level 2: Matrix-Vector Operations
Level 2 BLAS performs matrix-vector operations: $O(n^2)$ data, $O(n^2)$ FLOPs → AI ≈ constant
| Operation | Definition | FLOPs | Bytes (FP32) | AI |
|---|---|---|---|---|
SGEMV |
y ← αAx + βy | 2mn | 4(mn + m + n) | ≈ 0.5 |
SSYMV |
y ← αAx + βy (A symmetric) | 2n² | 4(n²/2 + 2n) | ≈ 1.0 |
STRMV |
x ← Ax (A triangular) | n² | 4(n²/2 + n) | ≈ 0.5 |
SGER |
A ← αxy' + A | 2mn | 4(mn + m + n) | ≈ 0.5 |
def analyze_gemv(m: int, n: int, dtype_bytes: int = 4):
"""GEMV: y = A @ x where A is m×n."""
flops = 2 * m * n # Each output: n muls + n adds
# Bytes: read A (m×n), read x (n), write y (m)
bytes_transferred = (m * n + n + m) * dtype_bytes
ai = flops / bytes_transferred
print(f"GEMV ({m}×{n})")
print(f" AI: {ai:.3f} FLOP/Byte")
print(f" Status: Memory-bound (ridge=10)")
# For large matrices, AI → 2/(4+4/n+4/m) → 0.5
return ai
# GEMV is always memory-bound regardless of size!
analyze_gemv(1024, 1024) # AI: 0.499
analyze_gemv(4096, 4096) # AI: 0.500
analyze_gemv(16384, 16384) # AI: 0.500
Level 3: Matrix-Matrix Operations
Level 3 BLAS performs matrix-matrix operations: $O(n^2)$ data, $O(n^3)$ FLOPs → AI grows with n
| Operation | Definition | FLOPs | Bytes (FP32) | AI |
|---|---|---|---|---|
SGEMM |
C ← αAB + βC | 2mnk | 4(mk + kn + mn) | ≈ n/3 (square) |
SSYMM |
C ← αAB + βC (A symmetric) | 2mn² | 4(n²/2 + mn + mn) | ≈ n/3 |
SSYRK |
C ← αAA' + βC | n²k | 4(nk + n²/2) | ≈ n/2 |
STRMM |
B ← αAB (A triangular) | mn² | 4(n²/2 + mn) | ≈ n/3 |
12. Deep Learning Operations
Deep learning workloads combine many operations with different characteristics. Understanding each operation's roofline position is critical for optimization.
Linear Layers & GEMM
Fully-connected layers are GEMMs: $Y = XW + b$
def analyze_linear(
batch: int, in_features: int, out_features: int,
dtype_bytes: int = 2 # FP16
):
"""Linear layer: Y = X @ W + b"""
# GEMM: (batch × in) @ (in × out) = (batch × out)
flops = 2 * batch * in_features * out_features
# Bytes: read X, read W, write Y (ignore bias - small)
bytes_x = batch * in_features * dtype_bytes
bytes_w = in_features * out_features * dtype_bytes
bytes_y = batch * out_features * dtype_bytes
bytes_total = bytes_x + bytes_w + bytes_y
ai = flops / bytes_total
# Classification
a100_ridge = 152 # FP16 Tensor Core
status = "Compute-bound" if ai > a100_ridge else "Memory-bound"
print(f"Linear({in_features}→{out_features}), batch={batch}")
print(f" AI: {ai:.1f} FLOP/B ({status})")
return ai
# Typical LLM dimensions
analyze_linear(1, 4096, 4096) # AI: 1.0 - Memory-bound!
analyze_linear(32, 4096, 4096) # AI: 30.1 - Memory-bound
analyze_linear(256, 4096, 4096) # AI: 170.7 - Compute-bound!
analyze_linear(1024, 4096, 4096) # AI: 341.3 - Compute-bound
# Key insight: batch size determines bound type!
Convolutions
Convolutions can be expressed as im2col + GEMM or direct computation:
def analyze_conv2d(
batch: int,
in_channels: int, out_channels: int,
h: int, w: int,
kernel: int = 3,
dtype_bytes: int = 2
):
"""Conv2D arithmetic intensity analysis."""
out_h, out_w = h, w # Assume same padding
# FLOPs: each output pixel needs kernel² × in_ch MACs
flops = 2 * batch * out_channels * out_h * out_w * in_channels * kernel * kernel
# Bytes
bytes_input = batch * in_channels * h * w * dtype_bytes
bytes_kernel = out_channels * in_channels * kernel * kernel * dtype_bytes
bytes_output = batch * out_channels * out_h * out_w * dtype_bytes
bytes_total = bytes_input + bytes_kernel + bytes_output
ai = flops / bytes_total
print(f"Conv2D: {in_channels}→{out_channels}, {kernel}×{kernel}, input {h}×{w}, batch={batch}")
print(f" AI: {ai:.1f} FLOP/B")
return ai
# ResNet-style convolutions
analyze_conv2d(1, 64, 64, 56, 56) # AI: 28.5
analyze_conv2d(32, 64, 64, 56, 56) # AI: 47.2
analyze_conv2d(1, 512, 512, 7, 7) # AI: 17.1
analyze_conv2d(32, 512, 512, 7, 7) # AI: 153.6 - Finally compute-bound!
Attention Mechanisms
Self-attention: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$
def analyze_attention(
batch: int, heads: int, seq_len: int, head_dim: int,
use_flash: bool = False,
dtype_bytes: int = 2
):
"""Analyze attention arithmetic intensity."""
B, H, S, D = batch, heads, seq_len, head_dim
# FLOPs: Q@K.T + softmax + scores@V
flops_qk = 2 * B * H * S * S * D # Q @ K.T
flops_softmax = 5 * B * H * S * S # exp, sum, div, etc.
flops_sv = 2 * B * H * S * S * D # scores @ V
total_flops = flops_qk + flops_softmax + flops_sv
if use_flash:
# Flash: only Q, K, V, O traverse HBM once
bytes_total = 4 * B * H * S * D * dtype_bytes
else:
# Standard: Q, K, V, O + attention scores matrix
bytes_qkvo = 4 * B * H * S * D * dtype_bytes
bytes_scores = B * H * S * S * dtype_bytes # The killer!
bytes_total = bytes_qkvo + 2 * bytes_scores # Read + write
ai = total_flops / bytes_total
mode = "Flash" if use_flash else "Standard"
print(f"{mode} Attention: B={B}, H={H}, S={S}, D={D}")
print(f" AI: {ai:.1f} FLOP/B")
return ai
# GPT-3 style: batch=1, heads=96, seq=2048, dim=128
analyze_attention(1, 96, 2048, 128, use_flash=False) # AI: 4.2
analyze_attention(1, 96, 2048, 128, use_flash=True) # AI: 68.3 (16× better!)
# Longer sequences amplify the difference
analyze_attention(1, 96, 8192, 128, use_flash=False) # AI: 1.1
analyze_attention(1, 96, 8192, 128, use_flash=True) # AI: 68.3 (62× better!)
Normalization Layers
LayerNorm, BatchNorm, RMSNorm - all are memory-bound due to low AI:
| Operation | Formula | FLOPs/elem | Bytes/elem | AI |
|---|---|---|---|---|
LayerNorm |
$(x - \mu) / \sigma \cdot \gamma + \beta$ | ~8 | 8 (FP16) | 1.0 |
RMSNorm |
$x / \text{RMS}(x) \cdot \gamma$ | ~5 | 8 | 0.625 |
BatchNorm |
$(x - \mu) / \sigma \cdot \gamma + \beta$ | ~8 | 8 | 1.0 |
Since norms are memory-bound, kernel fusion is critical. Fusing LayerNorm with the preceding or following operation avoids a full round-trip to HBM. This is why libraries like FlashAttention and Apex fused kernels exist.
Activations & Elementwise Ops
All elementwise operations have very low AI:
| Operation | FLOPs/elem | Bytes/elem (FP16) | AI |
|---|---|---|---|
ReLU |
1 (comparison) | 4 | 0.25 |
GELU |
~12 | 4 | 3.0 |
SiLU/Swish |
~4 | 4 | 1.0 |
Dropout |
2 | 4 | 0.5 |
Add (residual) |
1 | 6 | 0.17 |
13. Batch Size Effects
Batch size is the most important knob for moving operations from memory-bound to compute-bound:
import numpy as np
def analyze_batch_effect(
in_features: int,
out_features: int,
batch_sizes: list,
dtype_bytes: int = 2 # FP16
):
"""Show how batch size affects arithmetic intensity."""
# Weight bytes (constant regardless of batch)
W_bytes = in_features * out_features * dtype_bytes
print(f"Linear Layer: {in_features}→{out_features}")
print(f"Weights: {W_bytes / 1e6:.1f} MB")
print(f"\n{'Batch':>8} {'FLOPs':>12} {'Bytes':>12} {'AI':>10} {'Status':>15}")
print("-" * 60)
a100_ridge = 152 # FP16 TC
for B in batch_sizes:
# FLOPs: 2 * B * in * out
flops = 2 * B * in_features * out_features
# Bytes: X (B×in) + W (in×out) + Y (B×out)
bytes_total = (B * in_features + in_features * out_features +
B * out_features) * dtype_bytes
ai = flops / bytes_total
status = "✓ Compute" if ai > a100_ridge else "✗ Memory"
print(f"{B:>8} {flops:>12,} {bytes_total:>12,} {ai:>10.1f} {status:>15}")
# LLM-style dimensions
analyze_batch_effect(
in_features=4096,
out_features=4096,
batch_sizes=[1, 4, 16, 64, 128, 256, 512, 1024]
)
# Output:
# Linear Layer: 4096→4096
# Weights: 33.6 MB
#
# Batch FLOPs Bytes AI Status
# ------------------------------------------------------------
# 1 33,554,432 33,570,816 1.0 ✗ Memory
# 4 134,217,728 33,603,584 4.0 ✗ Memory
# 16 536,870,912 33,685,504 15.9 ✗ Memory
# 64 2,147,483,648 33,882,112 63.4 ✗ Memory
# 128 4,294,967,296 34,078,720 126.0 ✗ Memory
# 256 8,589,934,592 34,471,936 249.2 ✓ Compute
# 512 17,179,869,184 35,258,368 487.3 ✓ Compute
# 1024 34,359,738,368 36,831,232 932.9 ✓ Compute
Batch size 1 inference (autoregressive generation) is almost always memory-bound. This is why LLM inference is dominated by weight loading time, not compute. Optimization strategies: quantization (fewer bytes), speculative decoding (more FLOPs per weight load), continuous batching (larger effective batch).
Critical Batch Size
The batch size where an operation transitions from memory-bound to compute-bound:
def critical_batch_size(
in_features: int,
out_features: int,
ridge_ai: float,
dtype_bytes: int = 2
) -> int:
"""Calculate critical batch size for compute-bound operation."""
# From the AI equation:
# AI = 2*B*in*out / (B*(in+out) + in*out) * dtype_bytes
# Solve for B when AI = ridge_ai
W = in_features * out_features
factor = 2 * dtype_bytes
# Simplified: B_crit ≈ ridge_ai * W / (2 * (in + out) * dtype)
B_crit = ridge_ai * W * dtype_bytes / (2 * in_features * out_features -
ridge_ai * (in_features + out_features) * dtype_bytes)
return max(1, int(B_crit))
# A100 FP16 Tensor Core ridge point
ridge = 152 # FLOP/Byte
# Common LLM dimensions
print("Critical batch sizes (A100 FP16):")
print(f" Linear 4096→4096: B={critical_batch_size(4096, 4096, ridge)}") # ~200
print(f" Linear 4096→16384: B={critical_batch_size(4096, 16384, ridge)}") # ~160
print(f" Linear 16384→4096: B={critical_batch_size(16384, 4096, ridge)}") # ~160
print(f" Linear 768→768: B={critical_batch_size(768, 768, ridge)}") # ~200
print(f" Linear 768→3072: B={critical_batch_size(768, 3072, ridge)}") # ~160
14. Precision Effects & Tensor Cores
Modern GPUs support multiple numerical precisions, each with different peak throughput. This creates multiple rooflines on the same diagram.
Peak Compute by Precision
NVIDIA A100 peak throughput by precision:
| Precision | Bits/Element | Peak TFLOPS | Relative to FP32 | Use Case |
|---|---|---|---|---|
FP64 |
64 | 9.7 | 0.5× | Scientific computing |
FP32 |
32 | 19.5 | 1× | Baseline |
TF32 (Tensor) |
19 (stored as 32) | 156 | 8× | DL training |
FP16 (Tensor) |
16 | 312 | 16× | DL training/inference |
BF16 (Tensor) |
16 | 312 | 16× | DL training |
INT8 (Tensor) |
8 | 624 | 32× | Inference |
INT4 (Tensor) |
4 | 1248 | 64× | Inference |
Memory Bandwidth Effects
Lower precision also means fewer bytes transferred, affecting arithmetic intensity:
def analyze_precision_effect(
batch: int,
in_features: int,
out_features: int,
precisions: dict
):
"""Compare AI and performance across precisions."""
# A100 specs
peak_tflops = {
'FP32': 19.5, 'TF32': 156, 'FP16': 312,
'BF16': 312, 'INT8': 624, 'INT4': 1248
}
bytes_per_elem = {
'FP32': 4, 'TF32': 4, 'FP16': 2,
'BF16': 2, 'INT8': 1, 'INT4': 0.5
}
bw = 2039 # GB/s HBM
print(f"Linear {in_features}→{out_features}, batch={batch}")
print(f"{'Precision':>8} {'AI':>8} {'Ridge':>8} {'Bound':>12} {'Peak':>10}")
print("-" * 50)
flops = 2 * batch * in_features * out_features
for prec in ['FP32', 'TF32', 'FP16', 'INT8', 'INT4']:
elem_size = bytes_per_elem[prec]
bytes_total = (batch * in_features + in_features * out_features +
batch * out_features) * elem_size
ai = flops / bytes_total
ridge = peak_tflops[prec] * 1000 / bw # FLOP/Byte
bound = "Compute" if ai > ridge else "Memory"
# Attainable performance
if ai > ridge:
perf = peak_tflops[prec]
else:
perf = ai * bw / 1000 # TFLOPS
print(f"{prec:>8} {ai:>8.1f} {ridge:>8.1f} {bound:>12} {perf:>10.1f}")
# Example: LLM inference (small batch)
analyze_precision_effect(1, 4096, 4096, {})
# Output:
# Linear 4096→4096, batch=1
# Precision AI Ridge Bound Peak
# --------------------------------------------------
# FP32 1.0 9.6 Memory 2.0
# TF32 1.0 76.5 Memory 2.0
# FP16 2.0 153.1 Memory 4.1
# INT8 4.0 306.2 Memory 8.2
# INT4 8.0 612.4 Memory 16.3
#
# Key insight: Lower precision → higher AI → closer to compute roof!
Multi-Precision Roofline
Quantization isn't just about model size—it fundamentally changes your position on the roofline. A memory-bound FP32 kernel can become compute-bound at INT4, unlocking 32× more compute throughput while halving memory traffic.
15. Extended Roofline Models
The classic roofline focuses on compute vs memory bandwidth. Extended models add other limiting factors.
Energy Roofline
For mobile/edge deployment, energy efficiency matters as much as raw performance:
def energy_roofline(flops: int, bytes_transferred: int):
"""Calculate energy efficiency using energy roofline."""
# Energy costs (approximate for modern GPUs)
E_FLOP = 0.3e-12 # Joules per FP16 FLOP
E_BYTE = 20e-12 # Joules per byte (HBM)
energy_compute = flops * E_FLOP
energy_memory = bytes_transferred * E_BYTE
total_energy = energy_compute + energy_memory
efficiency = flops / total_energy / 1e9 # GFLOPS/Watt
# Which bound dominates?
ai = flops / bytes_transferred
energy_ridge = E_BYTE / E_FLOP # ~67 FLOP/Byte for these params
if ai < energy_ridge:
bound = "Memory-energy bound"
else:
bound = "Compute-energy bound"
print(f"Energy Efficiency: {efficiency:.1f} GFLOPS/W ({bound})")
return efficiency
# GEMM 4K×4K×4K FP16
flops = 2 * 4096**3 # 137B FLOPs
bytes_mem = 3 * 4096**2 * 2 # 100 MB
energy_roofline(flops, bytes_mem) # ~3000 GFLOPS/W (compute-bound)
# Attention softmax
flops_softmax = 5 * 2048**2 # 21M FLOPs
bytes_softmax = 2 * 2048**2 * 2 # 16 MB
energy_roofline(flops_softmax, bytes_softmax) # ~60 GFLOPS/W (memory-bound)
Communication Roofline
For distributed training, network bandwidth becomes another ceiling:
Data parallelism becomes communication-bound when: $\frac{2 \cdot \text{FLOPs}_{\text{forward}}}{N \cdot \text{ParamBytes}} < \text{CI}_{\text{ridge}}$. This is why large-scale training uses model parallelism, pipeline parallelism, and gradient compression.
16. Optimization Trajectories
Different optimizations move you in different directions on the roofline:
| Optimization | Direction | When to Use | Examples |
|---|---|---|---|
| Better BW Utilization | ↑ Vertical | Memory-bound, below roof | Coalesced access, prefetching |
| Kernel Fusion | ↗ Diagonal | Memory-bound, multiple kernels | FlashAttention, fused LayerNorm |
| Algorithmic (Tiling) | → Horizontal | Memory-bound, data reuse possible | Blocked GEMM, im2col |
| Quantization | ↗ To ceiling | Memory-bound, accuracy permits | INT8, INT4, FP8 |
| Compute Opts | ↑ To ceiling | Compute-bound | Tensor Cores, loop unrolling |
| Parallelism | ↑ Higher roof | At ceiling, need more | Multi-GPU, model parallelism |
def recommend_optimization(
current_perf: float, # GFLOP/s
arithmetic_intensity: float, # FLOP/Byte
peak_compute: float, # GFLOP/s
peak_bandwidth: float, # GB/s
) -> str:
"""Recommend optimization based on roofline position."""
ridge_point = peak_compute / peak_bandwidth
roof_perf = min(peak_compute, arithmetic_intensity * peak_bandwidth)
efficiency = current_perf / roof_perf
if arithmetic_intensity < ridge_point:
# Memory-bound region
if efficiency < 0.5:
return """Memory-bound, low efficiency:
1. Fix memory access patterns (coalescing)
2. Use faster memory tier (L2 cache)
3. Profile for bank conflicts"""
elif efficiency < 0.8:
return """Memory-bound, moderate efficiency:
1. Try kernel fusion to increase AI
2. Use tiling for data reuse
3. Consider quantization"""
else:
return """Memory-bound, high efficiency:
1. Algorithm change needed for more AI
2. Quantization is the main lever
3. Accept memory-bound nature"""
else:
# Compute-bound region
if efficiency < 0.7:
return """Compute-bound, low efficiency:
1. Use Tensor Cores (if not already)
2. Increase occupancy
3. Reduce register pressure"""
else:
return """Compute-bound, high efficiency:
1. Already well-optimized!
2. Scale with more GPUs
3. Use lower precision if acceptable"""
# Example usage
print(recommend_optimization(
current_perf=150, # 150 GFLOP/s observed
arithmetic_intensity=2.0, # 2 FLOP/Byte
peak_compute=312000, # A100 FP16
peak_bandwidth=2039 # A100 HBM
))
17. Case Studies
ResNet-50 Inference
Transformer Training
def analyze_transformer_block(
batch: int,
seq_len: int,
hidden: int,
heads: int,
ffn_mult: int = 4,
dtype_bytes: int = 2
):
"""Analyze a transformer block's operations."""
head_dim = hidden // heads
ffn_hidden = hidden * ffn_mult
ops = []
# QKV Projection
qkv_flops = 2 * batch * seq_len * hidden * 3 * hidden
qkv_bytes = (batch * seq_len * hidden + 3 * hidden * hidden +
batch * seq_len * 3 * hidden) * dtype_bytes
ops.append(('QKV Proj', qkv_flops, qkv_bytes))
# Attention (with Flash)
attn_flops = 4 * batch * heads * seq_len * seq_len * head_dim
attn_bytes = 4 * batch * seq_len * hidden * dtype_bytes # Flash: no S×S
ops.append(('Attention', attn_flops, attn_bytes))
# Output Projection
out_flops = 2 * batch * seq_len * hidden * hidden
out_bytes = (batch * seq_len * hidden + hidden * hidden +
batch * seq_len * hidden) * dtype_bytes
ops.append(('Out Proj', out_flops, out_bytes))
# LayerNorm (×2)
ln_flops = 2 * 8 * batch * seq_len * hidden
ln_bytes = 2 * 2 * batch * seq_len * hidden * dtype_bytes
ops.append(('LayerNorm', ln_flops, ln_bytes))
# FFN (up + down)
ffn_flops = 2 * batch * seq_len * hidden * ffn_hidden * 2
ffn_bytes = (batch * seq_len * hidden + hidden * ffn_hidden +
batch * seq_len * ffn_hidden + ffn_hidden * hidden +
batch * seq_len * hidden) * dtype_bytes
ops.append(('FFN', ffn_flops, ffn_bytes))
# Activations (GELU, residuals)
act_flops = 15 * batch * seq_len * ffn_hidden
act_bytes = 4 * batch * seq_len * ffn_hidden * dtype_bytes
ops.append(('Activations', act_flops, act_bytes))
print(f"Transformer Block: B={batch}, S={seq_len}, H={hidden}")
print(f"{'Operation':>12} {'GFLOP':>10} {'GB':>8} {'AI':>8} {'Bound':>10}")
print("-" * 52)
ridge = 153 # A100 FP16 TC
total_flops = 0
total_bytes = 0
for name, flops, bytes_mem in ops:
ai = flops / bytes_mem
bound = "Compute" if ai > ridge else "Memory"
print(f"{name:>12} {flops/1e9:>10.1f} {bytes_mem/1e9:>8.3f} {ai:>8.1f} {bound:>10}")
total_flops += flops
total_bytes += bytes_mem
print("-" * 52)
total_ai = total_flops / total_bytes
print(f"{'TOTAL':>12} {total_flops/1e9:>10.1f} {total_bytes/1e9:>8.3f} {total_ai:>8.1f}")
# GPT-2 style block
analyze_transformer_block(batch=32, seq_len=1024, hidden=768, heads=12)
# Output:
# Transformer Block: B=32, S=1024, H=768
# Operation GFLOP GB AI Bound
# ----------------------------------------------------
# QKV Proj 115.96 0.058 2008.5 Compute
# Attention 103.08 0.101 1024.0 Compute
# Out Proj 38.65 0.058 668.3 Compute
# LayerNorm 0.40 0.201 2.0 Memory
# FFN 309.24 0.211 1468.1 Compute
# Activations 1.51 0.402 3.8 Memory
# ----------------------------------------------------
# TOTAL 568.84 1.030 552.3
Training has 3× the FLOPs (forward + backward + optimizer) but similar memory access per parameter. This naturally increases AI, making training more compute-bound than inference.
LLM Inference
LLM inference has two distinct phases with very different roofline characteristics:
def analyze_llm_inference(
model_params_B: float, # Billions of parameters
prompt_tokens: int,
gen_tokens: int,
precision_bytes: int = 2, # FP16
peak_tflops: float = 312, # A100
bw_gb_s: float = 2039 # A100 HBM
):
"""Analyze LLM inference performance breakdown."""
params = model_params_B * 1e9
param_bytes = params * precision_bytes
# ===== PREFILL PHASE =====
# All tokens processed in parallel (like batch = prompt_tokens)
prefill_flops = 2 * params * prompt_tokens # 2 FLOPs per param per token
prefill_bytes = param_bytes + prompt_tokens * 4096 * precision_bytes # Weights + activations
prefill_ai = prefill_flops / prefill_bytes
ridge = peak_tflops * 1000 / bw_gb_s
if prefill_ai > ridge:
prefill_time = prefill_flops / (peak_tflops * 1e12)
else:
prefill_time = prefill_bytes / (bw_gb_s * 1e9)
# ===== DECODE PHASE =====
# One token at a time (batch = 1)
decode_flops_per_token = 2 * params
decode_bytes_per_token = param_bytes # Must load ALL weights for each token!
decode_ai = decode_flops_per_token / decode_bytes_per_token # Always ~1!
# Decode is always memory-bound
decode_time_per_token = decode_bytes_per_token / (bw_gb_s * 1e9)
total_decode_time = decode_time_per_token * gen_tokens
tokens_per_sec = 1 / decode_time_per_token
print(f"===== LLM Inference Analysis =====")
print(f"Model: {model_params_B}B params ({param_bytes/1e9:.1f} GB)")
print(f"Prompt: {prompt_tokens} tokens, Generate: {gen_tokens} tokens")
print()
print(f"PREFILL:")
print(f" AI: {prefill_ai:.0f} FLOP/B → {'Compute' if prefill_ai > ridge else 'Memory'}-bound")
print(f" Time: {prefill_time*1000:.1f} ms")
print()
print(f"DECODE:")
print(f" AI: {decode_ai:.1f} FLOP/B → Memory-bound (ALWAYS!)")
print(f" Per-token: {decode_time_per_token*1000:.1f} ms")
print(f" Total: {total_decode_time*1000:.0f} ms")
print(f" Throughput: {tokens_per_sec:.0f} tokens/sec")
print()
print(f"TOTAL: {(prefill_time + total_decode_time)*1000:.0f} ms")
print(f"Decode dominates: {total_decode_time/(prefill_time+total_decode_time)*100:.0f}%")
# Llama-2 7B example
analyze_llm_inference(
model_params_B=7,
prompt_tokens=512,
gen_tokens=256
)
# Output:
# ===== LLM Inference Analysis =====
# Model: 7B params (14.0 GB)
# Prompt: 512 tokens, Generate: 256 tokens
#
# PREFILL:
# AI: 512 FLOP/B → Compute-bound
# Time: 22.9 ms
#
# DECODE:
# AI: 1.0 FLOP/B → Memory-bound (ALWAYS!)
# Per-token: 6.9 ms
# Total: 1759 ms
# Throughput: 146 tokens/sec
#
# TOTAL: 1782 ms
# Decode dominates: 99%
Decode is the bottleneck. Key optimizations target this phase: Quantization (fewer bytes to load), Speculative Decoding (more FLOPs per weight load), Continuous Batching (higher effective batch), KV Cache Compression (less memory), Model Parallelism (aggregate bandwidth).
18. Conclusion & Key Takeaways
- Every kernel has two possible bottlenecks: compute or memory bandwidth. The roofline tells you which one applies.
- Arithmetic Intensity (AI) determines your bound: AI < ridge → memory-bound, AI > ridge → compute-bound.
- Most DL operations are memory-bound: Only large GEMMs with sufficient batch size hit compute ceilings.
- Batch size is the primary lever: Increasing batch moves operations from memory-bound to compute-bound.
- Quantization helps both ways: Fewer bytes transferred AND higher compute throughput.
- Kernel fusion increases effective AI: FlashAttention eliminates the memory-bound softmax bottleneck.
- LLM decode is always memory-bound: ~1 FLOP/Byte regardless of model size. Speed = bandwidth.
- Profile before optimizing: Know your position on the roofline before choosing an optimization strategy.
The roofline model transforms GPU optimization from guesswork into a systematic process. By understanding your workload's position on the roofline, you can:
- Prioritize correctly: Don't optimize compute for memory-bound kernels
- Set realistic targets: Know the theoretical maximum before chasing it
- Choose the right technique: Quantization, batching, fusion, or algorithmic change
- Know when to stop: Recognize when you've hit the ceiling
Master the roofline, and you'll write faster GPU code in less time.