1. Why Efficient Inference Matters

The gap between training and inference is often overlooked. During training, we optimize for:

But in production inference, we optimize for:

A typical state-of-the-art LLM inference scenario:

This is why model compression (pruning + quantization) is critical: 10-100x speedup is routinely achievable with minimal accuracy loss.

Inference Optimization Challenge
Training Phase Inference Phase GPU CPU • Goal: Accuracy • Time: Hours/Days • Constraint: GPU Memory • Batch Size: Large (32-256) GPU CPU Edge • Goal: Latency/Throughput • Time: Milliseconds • Constraint: Model Size • Batch Size: Small (1-8)
Training optimizes for accuracy; inference optimizes for speed, latency, memory, and energy

2. Pruning Fundamentals: Removing Unnecessary Parameters

Pruning is the process of setting certain weights to zero, removing parameters that contribute little to the model's output. The key insight: not all parameters are equally important.

Why Pruning Works

Pruning Concept: From Dense to Sparse
Dense Network (100% weights) Prune Sparse Network (30% weights) 70% fewer parameters → 2-4x speedup, reduced memory
Pruning removes 70% of connections while retaining ~95% accuracy

Unstructured Pruning: Individual Weight Removal

Unstructured pruning removes individual weights from a matrix based on their magnitude. The sparsity pattern is irregular—any weight can be removed.

How It Works

  1. Magnitude Ranking: Sort all weights by absolute value
  2. Threshold Setting: Choose a percentile (e.g., remove bottom 90% of weights)
  3. Masking: Multiply weight matrix element-wise with a binary mask
  4. Fine-tuning: Retrain the remaining weights to recover accuracy
Unstructured Pruning Process
Step 1: Original Weights W = 0.8 0.01 0.5 0.02 0.9 0.03 0.6 0.01 0.7 100% density Step 2: Sort by Magnitude 0.9 ●● (keep) 0.8 ●● (keep) 0.7 ●● (keep) 0.6 ✗ (prune) 0.5 ✗ (prune) 0.03 ✗ (prune) 0.02 ✗ (prune) 0.01 ✗ (prune) Sparsity: 60% Step 3: Pruned Matrix 0.8 0 0.5 0 0.9 0 0.6 0 0.7 40% density (sparse) Hardware Implications ✓ Reduces memory access (fewer non-zero elements) ✗ Irregular memory patterns → Hard to accelerate ✗ General sparse kernels miss data locality ✗ Older GPUs lack native sparse support; benefits mainly from CPUs or specialized hardware
Unstructured pruning removes individual weights irregularly, reducing density and memory but challenging for hardware acceleration

Pros & Cons

Structured Pruning: Channel and Filter Removal

Structured pruning removes entire channels, filters, or blocks at once. The sparsity pattern is regular and GPU-friendly.

Key Structures

Structured Pruning Types
Channel Pruning Input Output Removes 1 of 3 filters Speedup: 2-3x Memory: -50% Filter Pruning (3×3 Conv) Removes entire 3×3 filters Regular pattern GPU-friendly Head Pruning (Transformers) H1 H2 H3 H4 Removes low-importance heads Dimension reduction Maintains structure Comparison Channel/Filter Pruning Hardware Friendly High throughput
Structured pruning removes complete channels, filters, or attention heads—GPU-accelerated operations

Pros & Cons

N:M Sparsity: Hardware-Native Sparsity Patterns

N:M sparsity (also called m-way sparsity) is a compromise between unstructured and structured pruning. In each group of M consecutive elements, you keep exactly N non-zero values. The most common is 2:4 sparsity (keep 2 out of every 4 weights).

2:4 Sparsity Example

Consider a weight matrix where we apply 2:4 sparsity:

2:4 Sparsity Pattern
Original Weights (Dense) 0.8 0.1 0.5 0.2 Group 1 (2:4 target) 0.3 0.7 0.4 0.05 2:4 Prune After 2:4 Sparsity 0.8 0 0.5 0 0 0.7 0.4 0 Key Insight: • Every 4 consecutive weights: exactly 2 are non-zero • Sparsity: 50% (2 zeros out of 4) • GPU can process in specialized patterns • 1:6, 1:8 patterns also supported for more compression GPU Support: NVIDIA Ampere (A100), Ada (H100), Hopper (L40S) with Tensor Core support
2:4 sparsity: 50% structured density with GPU-native acceleration support

Why 2:4?

N:M Sparsity on NVIDIA GPUs

NVIDIA first introduced native N:M sparsity support in Ampere architecture (A100, RTX A6000) and improved it in Ada (H100, RTX Ada) and Hopper (L40S).

How It Works on GPU

2:4 Sparsity Execution on NVIDIA Tensor Cores
Dense GEMM: A × B = C (1 iteration) Time: T Process 4 values Load 4 weights 4 multiply-accumulate ops 2:4 Sparse GEMM: 2 iterations Time: T/2 Iteration 1: Process Group 1 (indices 0,2) Load 2 weights, 2 multiply-add Iteration 2: Process Group 2 (indices 1,2) Load 2 weights, 2 multiply-add How GPU Executes 2:4 Sparsity 1. Sparse Format Storage: • Dense values: [0.8, 0.5, 0.7, 0.4] (stored explicitly) • Metadata: [0, 2, 1, 2] (which indices are non-zero) 2. Tensor Core Execution: • Read metadata → Know exactly which operations to perform • Execute 2 FMAs (fused multiply-add) per metadata group • Skip zero multiplications → 2x throughput compared to dense GPU Support for 2:4 Sparsity
GPU executes sparse weight groups independently, achieving ~2x faster throughput

Supported NVIDIA GPUs (2:4 Sparsity)

Important Note

2:4 sparsity acceleration requires:

  • cuSPARSELt library or TensorRT with sparsity support
  • Weights to be structured in 2:4 pattern
  • Not all frameworks support it natively (TensorRT does, PyTorch needs plugins)

Pruning CNNs vs Transformers

While pruning principles are universal, CNNs and Transformers respond differently to sparsity due to their architectural differences.

Pruning Characteristics: CNNs vs Transformers
CNNs Architecture: Conv → ReLU → Pool → Conv... Dense final layers (classification) ✓ Highly Prunable: • Early/late layers: 80-90% • Batch norm → very sparse • Channel reduction effective • Depthwise-separable: Filter/channel 💨 Speedup Achieved: • ResNet-50: 70% pruned → 4-5x speedup • MobileNet: Already sparse → 2-3x • Dense MobileNet: 3-4x possible 64ch 128ch 256ch 512ch (prune) FC layers (80-90% sparse) Spatial locality important Transformers Architecture: Multi-head Self-Attn → FFN (x12) ALL weights are dense initially ✓ Prunable Targets: • Attention heads: 25-50% (low attn) • FFN neurons: 40-70% sparsity • Entire layers: 20-30% removable • Less position-dependent than CNN 💨 Speedup Achieved: • BERT (12L): 50% → 1.5-2x speedup • LLaMA-7B: Harder (need fine-tune) • Vision Transformer: 30-40% → 2x Attn FFN-1 FFN-2 Prune Head/layer-level decisions Aspect CNNs Transformers Pruning method Structured (channel) Mixed (head + FFN) Max sparsity 70-80% 40-50% Fine-tune needed Yes (1-5 epochs) Yes (5-20 epochs)
CNNs benefit from spatial structure pruning; Transformers from attention head and layer reduction

3. Software Stack & Tools for Sparsity

To apply sparsity in practice, you need libraries and frameworks that understand your target hardware:

Framework-Level Support

Inference Engines

Python pruning_example.py
import torch
import torch.nn.utils.prune as prune

# Load your model
model = YourModel()

# Apply structured magnitude pruning to all Conv2d layers
for module in model.modules():
    if isinstance(module, torch.nn.Conv2d):
        # 30% of channels pruned
        prune.ln_structured(
            module,
            name="weight",
            amount=0.3,
            n=2,
            dim=0  # Prune output channels
        )

# Make pruning permanent
prune.remove(module, 'weight')

# Fine-tune the pruned model
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(10):
    for batch in train_loader:
        output = model(batch)
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
Pro Tip: Early Layer Pruning

Don't prune uniformly! Early layers (close to input) can be pruned more aggressively (70-80%), while later layers need more precision. This is called layer-adaptive pruning.

Sparsity Acceleration Support Matrix
Hardware Unstructured Structured 2:4 Sparsity Speedup A100 (Ampere) Limited Full Support Excellent ~2x H100 (Hopper) No Excellent Best-in-class ~3-4x V100 (Volta) No Limited No ~1.2x CPU (AVX-512) Excellent Limited No ~3-5x Legend: 🟢 Full support = Hardware accelerated | 🔵 Limited = Software fallback | 🔴 No = Not available | Speedup = vs dense FP32 Speedup = vs dense FP32
Sparsity hardware support across NVIDIA GPU generations and CPU

4. Quantization Fundamentals: Reducing Precision

Quantization reduces the numerical precision of weights and activations from floating-point to fixed-point or integer representations. Unlike pruning which removes weights, quantization represents the same weights using fewer bits.

Number Formats: From FP32 to INT4

Neural networks are typically trained in FP32 (32-bit floating-point). For inference, we can use lower-precision formats:

Floating-Point and Integer Number Formats
FP32 (32-bit Float) Sign: 1 bit Exponent: 8 bits Mantissa: 23 bits Range: ±3.4×10³⁸ Memory: 4 bytes FP16 (16-bit Float) / BF16 Sign: 1 bit Exponent: 5 bits (FP16) or 8 (BF16) Mantissa: 10 bits (FP16) or 7 (BF16) Range: ±65,000 / ±3.4×10³⁸ Memory: 2 bytes INT8 (8-bit Integer) Signed: -128 to 127 Unsigned: 0 to 255 Linear scale + zero-point Range: Calibrated per layer Memory: 1 byte (4x compression) Memory & Throughput vs Accuracy 100% 50% 0% FP32 FP16/BF16 INT8 INT4 Ref (100% acc) ~99% acc, 2x ~96-98% acc, 4x ~90-94% acc, 8x Memory Savings vs Accuracy Loss
Precision tradeoff: lower bits = less memory but more accuracy loss

Key Formats (For Inference)

Quantization Theory: Scale and Zero-Point

Integer quantization maps floating-point values to integers using a linear quantization scheme:

Linear Quantization: Mapping FP32 to INT8
Quantization Formula q = round(x / scale) + zero_point where: x = original FP32, q = quantized INT8, scale = range mapping, zero_point = offset Example: Quantize range [−2.5, 2.5] to INT8 [−128, 127] Original FP32 Range: −2.5 +2.5 Scale = (2.5 − (−2.5)) / (127 − (−128)) = 5.0 / 255 ≈ 0.0196 Zero-point = round(−(−2.5) / 0.0196) = 127 (calibrated) Mapped INT8 Range: −128 +127 Quantization Examples: 0.0 → q = round(0.0 / 0.0196) + 127 = 0 + 127 = 127 1.0 → q = round(1.0 / 0.0196) + 127 = 51 + 127 = 178 −2.0 → q = round(−2.0 / 0.0196) + 127 = −102 + 127 = 25 2.5 → q = round(2.5 / 0.0196) + 127 = 127 + 127 = 254 De-quantization (for inference): x_approx = (q − zero_point) × scale q=51 → x = (51 − 127) × 0.0196 = −1.489 ≈ 1.0 ✓
Symmetric (zero-point=127) vs Asymmetric quantization strategies

Symmetric vs Asymmetric

Post-Training Quantization (PTQ)

PTQ is the simplest approach: train the model normally, then quantize weights/activations without retraining.

PTQ Workflow

Post-Training Quantization Workflow
Trained Model (FP32) ~100% acc 1: Calibrate Calibration Analyze weight ranges on val data 2: Compute Quantize (INT8) ~96-98% acc 3: Deploy Fast Inference 4x speedup Calibration Strategies Min-Max (Simple) Scale = (max − min) / 255 ✓ Fast, straightforward ✗ Sensitive to outliers ✗ ~94-96% accuracy KL-Divergence Find optimal clipping range to minimize loss ✓ Better accuracy (~97-98%) ✗ Slower computation Entropy/Percentile Use p99 or KL on moving distribution during inference ✓ Best for LLMs (~97-99%) ✓ Balanced cost/accuracy
PTQ: No retraining needed, but accuracy drop depends on calibration strategy

Pros & Cons

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the model to adapt to the reduced precision. The model learns to work with quantization instead of fighting it.

QAT Process

Quantization-Aware Training (QAT)
Fake Quantization During Training Forward Pass: 1. Compute FP32 result: y = w × x 2. Simulate quantization: y_q = dequant(quant(y)) 3. Use y_q for loss computation (But gradients flow through FP32 values) Backward Pass (STE): 1. Compute gradient ∂L/∂y normally 2. Straight-Through Estimator (STE): ∂L/∂w = ∂L/∂y_q ≈ ∂L/∂y (Ignore quantization in gradient) QAT vs PTQ Accuracy Comparison Model FP32 Baseline PTQ (INT8) QAT (INT8) ResNet-50 76.5% 75.8% (−0.7%) 76.3% (−0.2%) ✓ MobileNet-V2 72.0% 70.2% (−1.8%) 71.5% (−0.5%) ✓ BERT-large 93.2% (SQuAD v1.1) 91.1% (−2.1%) 92.9% (−0.3%) ✓
QAT requires 5-20 epochs of retraining but maintains 99%+ of baseline accuracy

Pros & Cons

Python qat_example.py
import torch
from torch.quantization import prepare_qat

# Load pre-trained model
model = ResNet50("pretrained")
model.eval()

# Enable QAT (insert fake quant ops)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model = prepare_qat(model)

# Fine-tune with QAT for 5 epochs
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

for epoch in range(5):
    for batch in train_loader:
        images, labels = batch
        output = model(images)
        loss = loss_fn(output, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Convert to quantized model
model.eval()
quantized_model = torch.quantization.convert(model)

# Save quantized model for inference
torch.jit.save(torch.jit.trace(quantized_model, images), "model_int8.pt")

INT8 Tensor Cores & Hardware Acceleration

Modern GPUs (NVIDIA A100, H100) have dedicated INT8 Tensor Cores that can execute quantized operations in parallel.

INT8 Tensor Core Throughput vs FP32
FP32 Tensor Core GEMM Matrix A (FP32) Matrix B (FP32) Operation: 16×16×16 = 8192 MACs Time: T Throughput: ~100 TFLOPS INT8 Tensor Core GEMM Matrix A (INT8) Matrix B (INT8) Operation: 16×16×16 = 8192 MACs Time: T/4 Throughput: ~400 TOPS (4x!) FP32 vs INT8 Throughput by GPU 312 TFLOPS 1248 TOPS (4x) A100 500 TFLOPS 4000 TOPS (8x throughput!) H100
INT8 inference achieves 4-8x higher throughput than FP32 on modern GPUs

Modern LLM Quantization Techniques

For large language models, quantization is more challenging because:

Advanced Techniques

Modern LLM Quantization Methods
Method Bits 7B Size Accuracy Time/Cost FP16 (baseline) 16 14 GB 100% Baseline INT8 (PTQ) 8 7 GB 95-97% Minutes INT8 (QAT) 8 7 GB 99-99.5% 12-48 hrs GPTQ 4-bit 4 1.7 GB 98-99.2% 100+ GPU-hrs AWQ 4-bit 4 1.7 GB 98-99.5% ✓ 1-5 GPU-hrs ✓ SmoothQuant 8 7 GB 99-99.7% ✓✓ 30-60 mins Inference Speed Comparison (7B on H100) FP16: ~15 tok/s INT8: ~30 tok/s (2x) GPTQ 4-bit: ~40 tok/s (2.7x)
Modern quantization techniques balance speed, memory, and accuracy for LLM inference

Combining Pruning + Quantization: Extreme Compression

The real magic happens when you combine pruning and quantization:

Pruning + Quantization: 10-100x Compression
FP32 Dense 14 GB 100% parameters 50% Pruning FP32 Sparse 7 GB 50% parameters INT8 INT8 Sparse 1.75 GB 25% cost Total Compression: 14GB → 1.75GB (8x) Detailed Breakdown for 7B LLM: Original (FP32 Dense): • 7 billion parameters × 4 bytes = 28 GB (with optimizer states) • For inference alone: 14 GB weight storage After 50% Pruning (FP32 Sparse): • 3.5 billion non-zero weights × 4 bytes = 14 GB → 7 GB • Plus sparse indices overhead (~1 GB) → 8 GB total After INT8 Quantization (INT8 + 50% Sparse): • 3.5 billion non-zero weights × 1 byte = 3.5 GB → 1.75 GB • Plus INT8 scale factors (~0.5 GB overhead) → 2 GB final
Combined pruning + quantization achieves 8-10x compression with 90-96% accuracy retention
Key Takeaway: Pruning + Quantization

A 7B parameter LLM can be compressed to 1.5-2 GB with pruning + INT8 quantization while retaining 95%+ accuracy. With 4-bit quantization + pruning, models fit on mobile devices!

Summary: Choosing Your Compression Strategy

Scenario Recommended Approach Expected Speedup
Fast PTQ for inference INT8 PTQ (calibrate on val data) 4x (+ structure for 2x more)
Maximum accuracy INT8 QAT (5-20 epochs) 4x (+ 99%+ accuracy)
CNN inference (mobile) Structured pruning 50% + INT8 PTQ 8-16x
LLM on edge (phone/RPi) 4-bit (GPTQ/AWQ) + pruning 8-10x smaller model
GPU cluster training 2:4 sparsity + mixed-precision FP16 2-3x throughput
Best Practice

Start with INT8 PTQ (takes minutes, simple). If accuracy drops >1%, move to QAT. Only use 4-bit for LLMs where calibration is hard.