COMPLETE REFERENCE · HARDWARE TO SOFTWARE

Single Instruction,
Multiple Data

A comprehensive, from-scratch guide to SIMD — the hardware and software technique that makes modern processors process 16, 32, or 64 values in the same time it takes to process one. Covers x86 SSE/AVX/AVX-512, ARM NEON/SVE, intrinsics, memory alignment, masking, shuffle operations, vectorization, and real-world performance.

x86 SSE/AVX/AVX-512 ARM NEON / SVE C Intrinsics Compiler Auto-vec Memory Alignment Performance Analysis

Contents

01Fundamentals & Mental Model
02Hardware Implementation
03ISA Extensions History
04Intel Intrinsics Deep Dive
05Memory, Alignment & Loads
06Masking & Predicates
07Shuffle, Permute & Blend
08Vectorized Algorithms
09Compiler Auto-Vectorization
10Pitfalls & Anti-Patterns
11ARM NEON & SVE
12Performance & Profiling

Chapter 01

Fundamentals & Mental Model

Before writing a single intrinsic call, you need to deeply understand what problem SIMD solves, why it exists, and what the hardware is actually doing when you use it.

Flynn's Taxonomy — where SIMD lives

In 1966, Michael Flynn classified computing architectures along two axes: how many instruction streams exist simultaneously, and how many data streams they operate on. This gives four categories:

Category	Instructions	Data Streams	Example
SISD	1	1	Classic scalar CPU — one add per instruction
SIMD	1	Many	SSE/AVX — one instruction adds 8 floats at once
MISD	Many	1	Theoretical (pipeline stages, some fault-tolerant systems)
MIMD	Many	Many	Multi-core CPU, GPU compute shaders

SIMD sits in the sweet spot of hardware cost vs. throughput gain. Unlike MIMD (which requires full duplicated execution units, caches, instruction decoders, and branch predictors), SIMD reuses a single instruction decoder, a single program counter, and a single fetch/decode pipeline — only the execution units and register file are widened. You pay roughly linear hardware cost for linear throughput gain.

The scalar bottleneck — why SIMD was invented

By the early 1990s, CPU clock speeds were rising, but multimedia workloads — audio DSP, video codecs, image processing — were hitting a wall. These workloads have a distinct pattern: the same arithmetic operation is applied to thousands of independent values stored contiguously in memory. For example, adjusting the brightness of an image means adding a constant to every single pixel value. On scalar hardware:

Scalar vs SIMD — processing 8 floats

SISD — 8 separate ADD instructions

r0← load a[0]

r0= r0 + b[0]

r1← load a[1]

r1= r1 + b[1]

… repeat 6 more times …

8 LOAD + 8 ADD + 8 STORE
= 24 instructions

SIMD — 1 VADDPS instruction

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

a[6]

a[7]

b[0]

b[1]

b[2]

b[3]

b[4]

b[5]

b[6]

b[7]

↓

ymm2 = vaddps ymm0, ymm1

r[0]

r[1]

r[2]

r[3]

r[4]

r[5]

r[6]

r[7]

1 VMOVUPS + 1 VADDPS + 1 VMOVUPS
= 3 instructions · 8× throughput

Key terminology you'll see everywhere

Terminology Glossary

Vector register — a wide CPU register (64–512 bits) that holds multiple data elements simultaneously.
Lane / element — one logical slot within a vector register. A 256-bit register holding 32-bit floats has 8 lanes.
Width — the total bit-size of a vector register (128, 256, 512 bits).
Element type — the data type packed into each lane: int8, int16, int32, int64, float32 (single precision), float64 (double).
Vectorization — the process (manual or automatic) of transforming scalar loops into SIMD operations.
ISA extension — an optional addition to a CPU's instruction set. SSE, AVX, and AVX-512 are ISA extensions.
Intrinsic — a C/C++ function that maps one-to-one to a single assembly instruction. Used to write SIMD code in C without inline assembly.
Throughput vs latency — latency is how many cycles an instruction takes to complete; throughput is how many can be in flight simultaneously (reciprocal throughput = minimum cycles between two of the same instruction).
CPUID — a CPU instruction that reports which ISA extensions the current processor supports at runtime.

The four operations every SIMD system must provide

Regardless of architecture — x86, ARM, RISC-V V extension, PowerPC AltiVec — every SIMD ISA must provide:

1. Arithmetic — element-wise add, subtract, multiply, divide, fused multiply-add (FMA), min, max, abs, negate. These are the workhorses.

2. Memory — loading a chunk of memory into a vector register (load), storing a register back to memory (store). Plus variants for unaligned access, non-temporal (streaming) stores that bypass the cache, and gather/scatter for non-contiguous access.

3. Shuffle/permute — rearranging elements within or between registers. This is the most powerful and most complex part of SIMD. Many algorithms that seem non-vectorizable become tractable once you understand shuffles.

4. Comparison and masking — comparing elements and producing a mask (a bitmask or predicate register) that controls which lanes participate in subsequent operations. Essential for handling boundary conditions, conditional code, and variable-length data.

Chapter 02

Hardware Implementation

Understanding the physical silicon helps you reason about why certain operations are fast, why others have latency penalties, and what the CPU is doing under the hood.

The vector execution unit

A modern out-of-order CPU like Intel's Golden Cove or AMD's Zen 4 contains multiple independent execution ports. Each port has attached execution units. Scalar integer operations typically have 4–6 ports feeding several ALUs. Vector operations have dedicated ports — typically 2–3 on modern designs — each feeding a full-width vector ALU.

The key insight: the hardware cost of a wide vector add is nearly identical to a narrow one. The transistor overhead of widening an adder from 128 to 256 bits is roughly proportional to the bit width, but the instruction fetch/decode overhead, the register renaming overhead, the scheduler overhead — all the expensive fixed costs — remain constant. This is why SIMD is so efficient: you amortize those fixed costs over more work.

Register files — what physically exists on die

When you use SSE, you use registers named xmm0–xmm15 (16 registers on x86-64, each 128 bits). When you use AVX/AVX2, you use ymm0–ymm15 (256 bits each). xmm registers are the lower 128 bits of the corresponding ymm register — they are the same physical register, just addressed at different widths.

AVX-512 extends this to zmm0–zmm31 (512 bits each, 32 registers total), doubling both the register count and the width. This additional register capacity reduces register pressure significantly — one of the biggest benefits of AVX-512 besides raw width.

x86 Vector Register Evolution — width and naming

MMX 64b

mm0

8×int8 or 4×int16 or 2×int32 — 1997

SSE 128b

xmm0

4×f32 or 2×f64 or 16×int8 — 1999

AVX 256b

xmm0 (low)

ymm0 high

8×f32 or 4×f64 — 2011

AVX-512 512b

xmm0

ymm0 hi

zmm0 high 256 bits

16×f32 or 8×f64 · 32 regs — 2017

The VEX/EVEX encoding prefix — why it matters

Legacy SSE instructions use the original x86 encoding: no prefix, or a 66/F2/F3 byte prefix. These instructions write only 128 bits and preserve the upper 128 bits of the ymm register. This sounds harmless but causes a major performance hazard: the CPU must preserve the upper bits, which means a dependency on the previous writer of that register. This is called the SSE/AVX transition penalty.

The VEX prefix (introduced with AVX in 2011) solves this. VEX-encoded instructions always zero-extend to the full register width — writing to xmm0 via a VEX instruction zeroes the upper 128 bits of ymm0/zmm0. No false dependency. This is why you should always use the VEX-encoded v variants: vaddps instead of addps, vmovups instead of movups.

AVX-512 goes further with EVEX: a 4-byte encoding that adds mask registers, broadcasting, and additional opcodes — all discussed in later chapters.

Frequency scaling — the SIMD penalty that most guides miss

On Intel processors (Sandy Bridge through Ice Lake, and some others), enabling 512-bit operations causes the CPU to enter a lower-frequency state called AVX-512 license level. This is because 512-bit units generate more heat. The frequency reduction can be severe — sometimes 300–700 MHz on Skylake-SP. On many workloads, this makes AVX-512 slower than AVX2 unless the computation is highly arithmetic-intensive.

Alder Lake (12th gen Intel) removed this penalty for most operations. AMD Zen 4 implements 512-bit operations natively without frequency penalties. Always benchmark on the target hardware; never assume wider = faster.

Hardware Reality Check

The theoretical peak throughput of SIMD (e.g., "8× faster for 256-bit float32") is almost never achieved in practice. Real speedup depends on: memory bandwidth saturation, instruction throughput vs latency, pipeline dependencies, and branch mispredictions around loop edges. Realistic speedups of 3–6× for well-vectorized code are common; anything above 6× is exceptional.

Chapter 03

ISA Extensions History

x86 SIMD has evolved through a series of backward-compatible extensions. Each one built on the previous, adding width, new data types, or new operations.

MMX

Intel · 1997 · 64-bit

First x86 SIMD. Reused the FPU register file (mm0–mm7). Integers only: 8×int8, 4×int16, 2×int32. No floats. Fatal flaw: sharing FPU registers required explicit EMMS to switch back to FP. Now obsolete.

SSE

Intel Pentium III · 1999 · 128-bit

Introduced 8 dedicated 128-bit xmm registers. Float32 only (4×f32). Added MOVAPS, ADDPS, MULPS, etc. Also introduced PREFETCH instructions for cache hints. The foundation everything else is built on.

SSE2

Intel Pentium 4 · 2001 · 128-bit

Critical extension. Added float64 (2×f64), all integer widths (16×int8, 8×int16, 4×int32, 2×int64). This is the baseline for all x86-64 code — every x86-64 CPU supports SSE2. Replaced MMX.

SSE3

Intel Prescott · 2004 · 128-bit

Added horizontal operations (HADD, HSUB) and LDDQU for unaligned loads. Also MOVSHDUP/MOVSLDUP for complex number arithmetic.

SSSE3

Intel Core 2 · 2007 · 128-bit

Supplemental SSE3. The critical addition: PSHUFB — a byte-granularity shuffle controlled by a vector index. Unlocks enormous algorithmic flexibility. Also added PMULHRSW, PHADDW, and sign operations.

SSE4.1 / 4.2

Intel Penryn · 2007 · 128-bit

Added DPPS (dot product), BLENDPS (blend/select), PTEST, PMAXSD, PMULLD (32×32→32 multiply). SSE4.2 added string compare (PCMPISTRI) and CRC32.

AVX

Intel Sandy Bridge · 2011 · 256-bit

Doubled width to 256 bits (ymm0–ymm15). Introduced VEX encoding (zero-extension, 3-operand non-destructive form). Float32/64 only in AVX; integers still 128-bit until AVX2. Critical milestone.

AVX2

Intel Haswell · 2013 · 256-bit

Extended 256-bit operations to all integer types. Added gather instructions (VGATHERDPS), broadcast (VPBROADCASTD), variable shifts (VPSLLVD), and permute across full 256-bit registers. The current practical sweet spot.

FMA3

Intel Haswell · 2013

Fused multiply-add: a*b+c in a single instruction with only one rounding step. Critical for throughput-bound code. VFMADD213PS, VFMADD231PS etc. — the number encodes the order of operands.

AVX-512

Intel Knights Landing · 2017 · 512-bit

512-bit zmm0–zmm31 (32 registers). EVEX encoding adds opmask registers (k0–k7) for per-element masking, embedded broadcast, static rounding control. Many sub-extensions: F, BW, DQ, VL, VNNI, VBMI, etc.

AVX-VNNI

Intel Alder Lake · 2021

Vector Neural Network Instructions. VPDPBUSD computes 4-element dot products of int8 values, accumulating into int32. Critical for deep learning inference on CPUs. Available without full AVX-512.

AVX10

Intel Granite Rapids · 2024+

Unifies AVX-512 across P-cores and E-cores. Defines versioned capability levels (AVX10/1, AVX10/2) to reduce the fragmentation caused by AVX-512 sub-extensions. The future direction for Intel SIMD.

Runtime CPU feature detection

You cannot assume any extension beyond SSE2 is available. Code compiled with -mavx2 and run on a Sandy Bridge CPU will crash with an illegal instruction fault. The correct approach is runtime CPUID detection:

C — CPUID feature detection

#include <stdint.h>
#include <stdbool.h>

static inline void cpuid(uint32_t leaf, uint32_t subleaf,
                           uint32_t *eax, uint32_t *ebx,
                           uint32_t *ecx, uint32_t *edx) {
    __asm__ volatile(
        "cpuid"
        : "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx)
        : "0"(leaf), "2"(subleaf)
    );
}

typedef struct {
    bool sse2, sse3, ssse3, sse41, sse42;
    bool avx, avx2, fma;
    bool avx512f, avx512bw, avx512vl, avx512vnni;
} CpuFeatures;

CpuFeatures detect_cpu_features(void) {
    CpuFeatures f = {0};
    uint32_t eax, ebx, ecx, edx;

    /* Leaf 1: basic features */
    cpuid(1, 0, &eax, &ebx, &ecx, &edx);
    f.sse2  = (edx >> 26) & 1;   /* EDX bit 26 */
    f.sse3  = (ecx >> 0)  & 1;   /* ECX bit 0  */
    f.ssse3 = (ecx >> 9)  & 1;   /* ECX bit 9  */
    f.sse41 = (ecx >> 19) & 1;
    f.sse42 = (ecx >> 20) & 1;
    f.avx   = (ecx >> 28) & 1;
    f.fma   = (ecx >> 12) & 1;

    /* Leaf 7: extended features */
    cpuid(7, 0, &eax, &ebx, &ecx, &edx);
    f.avx2      = (ebx >> 5)  & 1;
    f.avx512f   = (ebx >> 16) & 1;
    f.avx512bw  = (ebx >> 30) & 1;
    f.avx512vl  = (ebx >> 31) & 1;
    f.avx512vnni= (ecx >> 11) & 1;

    return f;
}

Practical Baseline

For most production code targeting general x86-64 deployments: SSE2 is guaranteed (part of x86-64 ABI). SSSE3 is safe for code targeting hardware from 2008+. AVX2+FMA is safe for code targeting hardware from 2015+. AVX-512 requires explicit deployment targeting (data centers, specific HPC hardware).

Chapter 04

Intel Intrinsics Deep Dive

Intrinsics are the primary API for writing SIMD code in C/C++. They look like function calls but compile to single instructions. Understanding the naming convention is essential.

The naming convention — decoded

Every Intel SIMD intrinsic follows a systematic naming scheme. Once you internalize it, you can read any intrinsic name and know exactly what it does.

_mm256_

width prefix:
_mm_ = 128b
_mm256_ = 256b
_mm512_ = 512b

add

operation:
add, sub, mul,
div, load, store,
shuffle, cmp…

_ps

element suffix:
_ps = packed f32
_pd = packed f64
_epi32 = int32
_epu16 = uint16

Full example: _mm256_add_ps(a, b) means: 256-bit wide, add operation, packed single-precision floats. It compiles to a single VADDPS ymm, ymm, ymm instruction.

Element type suffixes — complete reference

Suffix	Meaning	Lanes in 256b	Lanes in 128b
`_ps`	Packed single-precision float (float32)	8	4
`_pd`	Packed double-precision float (float64)	4	2
`_ss`	Scalar single — operates on lowest lane only	—	1
`_sd`	Scalar double — lowest lane only	—	1
`_epi8`	Packed signed 8-bit integers	32	16
`_epi16`	Packed signed 16-bit integers	16	8
`_epi32`	Packed signed 32-bit integers	8	4
`_epi64`	Packed signed 64-bit integers	4	2
`_epu8`	Packed unsigned 8-bit integers	32	16
`_epu16`	Packed unsigned 16-bit integers	16	8
`_epu32`	Packed unsigned 32-bit integers	8	4
`_si128`	Generic 128-bit integer (untyped)	—	1
`_si256`	Generic 256-bit integer (untyped)	—	—

The type system — m128, m256, __m512

Intel intrinsics use opaque vector types. You don't access lanes directly — you use intrinsics to manipulate them. The types:

__m128 — 128-bit vector of float32 (used with _mm_*_ps)
__m128d — 128-bit vector of float64 (_mm_*_pd)
__m128i — 128-bit vector, any integer type (_mm_*_epiN)
__m256, __m256d, __m256i — 256-bit equivalents
__m512, __m512d, __m512i — 512-bit equivalents

These are not pointers. They are value types stored in registers. Passing them by value in function arguments is fine — the ABI passes vector types in xmm/ymm/zmm registers directly (System V AMD64 ABI passes up to 8 __m256 by value in ymm registers).

Essential operations with code examples

C++ — AVX2 Arithmetic and Logic

#include <immintrin.h>   // all Intel SIMD intrinsics

/* ── Set / Initialize ── */
__m256 zeros  = _mm256_setzero_ps();       // all 8 lanes = 0.0f
__m256 ones   = _mm256_set1_ps(1.0f);     // broadcast scalar to all lanes
__m256 custom = _mm256_set_ps             // set individual values
    (7.f, 6.f, 5.f, 4.f, 3.f, 2.f, 1.f, 0.f);
// CAUTION: _mm256_set_ps args are in REVERSE lane order (lane 7 first)

/* ── Arithmetic ── */
__m256 a, b, c;
c = _mm256_add_ps(a, b);               // c[i] = a[i] + b[i]
c = _mm256_mul_ps(a, b);               // c[i] = a[i] * b[i]
c = _mm256_div_ps(a, b);               // c[i] = a[i] / b[i]  (slow ~11 cycles)
c = _mm256_sqrt_ps(a);                 // c[i] = sqrt(a[i])   (slow ~13 cycles)
c = _mm256_rcp_ps(a);                  // c[i] ≈ 1/a[i]  (12-bit approx, fast)
c = _mm256_rsqrt_ps(a);               // c[i] ≈ 1/sqrt(a[i]) (12-bit approx)

/* ── Fused Multiply-Add (FMA) — requires FMA3 ── */
// a*b + c   (213: a=dst, b*c+dst notation: second*third+first)
c = _mm256_fmadd_ps(a, b, c);          // c[i] = a[i]*b[i] + c[i]
c = _mm256_fmsub_ps(a, b, c);          // c[i] = a[i]*b[i] - c[i]
c = _mm256_fnmadd_ps(a, b, c);         // c[i] = -(a[i]*b[i]) + c[i]

/* ── Integer arithmetic (256-bit) ── */
__m256i ia, ib, ic;
ic = _mm256_add_epi32(ia, ib);         // 8×int32 add (wraps on overflow)
ic = _mm256_adds_epi16(ia, ib);        // 16×int16 saturating add (clamps)
ic = _mm256_mullo_epi32(ia, ib);       // low 32 bits of 32×32 multiply
ic = _mm256_mulhi_epi16(ia, ib);       // high 16 bits of 16×16 multiply

/* ── Comparison — returns mask vector ── */
__m256 mask = _mm256_cmp_ps(a, b, _CMP_LT_OQ); // per-element: a[i] < b[i]
// mask[i] = 0xFFFFFFFF if true, 0x00000000 if false
c = _mm256_blendv_ps(a, b, mask);      // c[i] = mask[i] ? b[i] : a[i]

/* ── Bitwise ── */
c = _mm256_and_ps(a, b);               // bitwise AND (reinterprets as bits)
c = _mm256_or_ps(a, b);
c = _mm256_xor_ps(a, b);
c = _mm256_andnot_ps(a, b);            // (~a) & b

/* ── Horizontal ops (cross-lane, expensive) ── */
c = _mm256_hadd_ps(a, b);  // [a0+a1, a2+a3, b0+b1, b2+b3, a4+a5, a6+a7, b4+b5, b6+b7]
// HADD is often slow — prefer a sequence of vertical ops + shuffles

The FMA naming confusion — 132, 213, 231

FMA3 has three variants of each operation, differing only in which register is used as the addend/destination. The number encodes the operand order for the a*b+c operation:

VFMADD132PS dst, src1, src2 → dst = dst * src2 + src1
VFMADD213PS dst, src1, src2 → dst = src1 * dst + src2
VFMADD231PS dst, src1, src2 → dst = src1 * src2 + dst

The number (1, 2, 3) indicates which operand position is used as the multiplicand (1), multiplier (2), and addend (3). In practice, the compiler picks the variant that minimizes register moves. When writing intrinsics by hand, use _mm256_fmadd_ps which takes (a, b, c) and computes a*b+c.

Chapter 05

Memory, Alignment & Load/Store

Memory operations are where most SIMD performance is made or lost. Understanding alignment, streaming stores, and gather/scatter is essential.

Alignment — why it matters at the hardware level

A memory access is aligned when its address is a multiple of its size. A 256-bit (32-byte) load is aligned when its address is divisible by 32. The reason alignment matters: SIMD loads are implemented in hardware as single cache-line transfers. A cache line on all modern x86 processors is 64 bytes. An aligned 32-byte load always falls within a single cache line. An unaligned 32-byte load may straddle two cache lines, requiring two separate cache-line reads and a hardware merge — roughly a 1–3 cycle penalty.

On very old hardware (pre-Nehalem), misaligned SIMD loads could fault. On modern CPUs (Sandy Bridge+), misaligned loads are handled in hardware with only a small penalty. Still, always align your data when you control allocation.

Memory alignment — cache line crossing penalty

cache line 0

cache line 1

aligned 256b load (within one cache line)

unaligned load (crosses cache line boundary)

C — Aligned allocation and load/store

#include <immintrin.h>
#include <stdlib.h>

/* ── Aligned allocation ── */
float *data;
posix_memalign((void**)&data, 32, n * sizeof(float)); // Linux
// or use C11: aligned_alloc(32, n * sizeof(float))
// or C++17: std::aligned_alloc

/* Hint the compiler about alignment: */
data = __builtin_assume_aligned(data, 32);  // GCC/Clang
// or MSVC: __assume(((uintptr_t)data & 31) == 0);

/* ── Load variants ── */
__m256 v;

// Aligned load — address MUST be 32-byte aligned (undefined behavior if not)
v = _mm256_load_ps(data);       // VMOVAPS — fastest

// Unaligned load — works at any address, tiny penalty on misaligned
v = _mm256_loadu_ps(data + 1);  // VMOVUPS — always safe

// Broadcast a single float to all 8 lanes
v = _mm256_broadcast_ss(data);  // VBROADCASTSS — [d[0],d[0],...,d[0]]
__m128 lane = _mm_load_ps(data);
__m256 rep  = _mm256_broadcast_ps(&lane); // [xmm,xmm] — 128b broadcast to 256b

/* ── Store variants ── */
_mm256_store_ps(data, v);       // aligned store
_mm256_storeu_ps(data + 1, v); // unaligned store

/* Non-temporal streaming store — bypasses cache, for write-once data */
_mm256_stream_ps(data, v);      // VMOVNTPS — MUST be aligned, weakly ordered
_mm_sfence();                   // required fence after streaming stores
// Use _mm256_stream_ps when writing large buffers you won't read back soon.
// Eliminates cache pollution: cache lines are not loaded before being overwritten.

/* ── Masked load/store (AVX) — loads only lanes where mask bit is set ── */
__m256i mask = _mm256_set_epi32(-1,-1,-1,-1,0,0,0,0); // upper 4 active
v = _mm256_maskload_ps(data, mask);    // lanes with mask sign bit=1 are loaded
_mm256_maskstore_ps(data, mask, v);    // stores only active lanes

Gather and scatter — non-contiguous memory access

Gather (introduced in AVX2) loads elements from arbitrary memory addresses given a vector of indices. This is essential for indirect memory access patterns — lookup tables, sparse arrays, permutations. However, gathers are significantly slower than contiguous loads because each lane may touch a different cache line.

Scatter (AVX-512) is the reverse: storing elements at arbitrary addresses given a vector of indices.

Gather — loading non-contiguous elements

Memory array

indices

[0,3,5,7]

→

Gathered vector

C — Gather/Scatter (AVX2 / AVX-512)

/* ── AVX2 Gather ── */
float   base[]   = { 10.f, 20.f, 30.f, 40.f, 50.f, 60.f, 70.f, 80.f };
__m256i indices  = _mm256_set_epi32(7,5,3,1, 6,4,2,0); // reversed lane order
__m256i all_ones = _mm256_set1_epi32(-1);  // mask: all lanes active
__m256  src      = _mm256_setzero_ps();      // fallback for inactive lanes

__m256 gathered = _mm256_mask_i32gather_ps(
    src,         // passthrough value for masked-off lanes
    base,        // base pointer
    indices,     // vector of int32 indices
    all_ones,    // mask (sign bit = active lane)
    4            // scale: byte offset = index * scale. Values: 1,2,4,8
);
// gathered = [base[0], base[2], base[4], base[6], base[1], base[3], base[5], base[7]]

/* ── AVX-512 Scatter ── */
__m512i zidx  = _mm512_set_epi32(15,13,11,9, 7,5,3,1,
                                   14,12,10,8, 6,4,2,0);
_mm512_i32scatter_ps(output, zidx, data_vec, 4); // writes to scattered addresses
// Scatter latency is high (~20+ cycles) due to potential address conflicts.
// Always check there are no duplicate scatter indices — result is undefined.

Gather Performance Warning

AVX2 gather is often slower than a scalar loop for small arrays that fit in L1 cache, because the gather instruction has high latency (≈20 cycles) and limited throughput. Only use gather when (1) the array doesn't fit in L1/L2, (2) the access pattern is highly irregular, AND (3) you can execute several independent gathers in parallel to hide latency. Benchmark before committing to gather.

Chapter 06

Masking & Predicates

Masking is how SIMD handles conditional logic. Instead of branching, you compute results for all lanes and use a mask to select which results are kept.

Blend masking (SSE4.1 / AVX)

Before AVX-512, masking was done with blend operations. You compute the comparison result as a vector of 0s and 0xFFFFFF…s, then use blendv to select between two vectors element-wise. This is equivalent to a vectorized ternary operator.

AVX blend masking — vectorized conditional

a (original)

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

b (new)

mask (a>4.0)

blendv: select b[i] if mask[i]=1, else a[i]

result

1.0

2.0

3.0

4.0

C — Masking patterns: blend, clamp, abs

/* ── Pattern: vectorized clamp (min/max) ── */
static inline __m256 clamp_ps(__m256 x, __m256 lo, __m256 hi) {
    x = _mm256_max_ps(x, lo);    // x = max(x, lo)
    x = _mm256_min_ps(x, hi);    // x = min(x, hi)
    return x;
}

/* ── Pattern: vectorized abs for float ── */
static inline __m256 abs_ps(__m256 x) {
    // Float sign is the MSB. Clear it by ANDing with 0x7FFFFFFF.
    __m256 mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x7FFFFFFF));
    return _mm256_and_ps(x, mask);  // strip sign bit
}

/* ── Pattern: vectorized negate ── */
static inline __m256 neg_ps(__m256 x) {
    __m256 sign_mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x80000000));
    return _mm256_xor_ps(x, sign_mask);  // flip sign bit
}

/* ── Pattern: conditional with two outcomes ── */
// Scalar: result[i] = (x[i] > 0) ? a[i] : b[i]
__m256 conditional_select(__m256 x, __m256 a, __m256 b) {
    __m256 mask = _mm256_cmp_ps(x, _mm256_setzero_ps(), _CMP_GT_OQ);
    return _mm256_blendv_ps(b, a, mask); // a where mask=1, b where mask=0
}

AVX-512 opmask registers — a paradigm shift

AVX-512 introduced opmask registers: k0–k7. These are compact 8-bit to 64-bit integer registers where each bit corresponds to one lane of the vector operation. Instead of wasting a full vector register for masking, you now use a tiny dedicated mask register.

The payoff: almost every AVX-512 instruction has a built-in {k1} masking suffix. You don't need separate blend instructions — masking is embedded directly in the arithmetic instruction. You can also use zero-masking ({k1}{z}) where inactive lanes are zeroed, or merge-masking where they retain their previous value.

C — AVX-512 opmask operations

/* ── AVX-512 opmask examples ── */
__m512  a, b, c;
__mmask16 k;    // 16-bit mask for 16×f32 (512-bit register)

/* Compare to produce a mask (not a vector of floats!) */
k = _mm512_cmp_ps_mask(a, b, _CMP_LT_OQ);
// k is now a 16-bit integer where bit i=1 if a[i] < b[i]

/* Masked add: only adds where mask bit = 1, merges src otherwise */
c = _mm512_mask_add_ps(
    c,   // src: pass-through for inactive lanes (merge masking)
    k,   // mask
    a, b // operands
);

/* Zero-masked add: inactive lanes are set to 0 */
c = _mm512_maskz_add_ps(k, a, b);

/* Mask arithmetic (scalar bitwise ops on mask registers) */
__mmask16 k2 = _mm512_cmp_ps_mask(a, _mm512_setzero_ps(), _CMP_GT_OQ);
__mmask16 both = k & k2;    // lanes where a0
__mmask16 either = k | k2;  // lanes where a0
int n_active = _mm_popcnt_u32(k);  // count active lanes

/* Convert mask to integer (scalar) and back */
uint32_t scalar_mask = (uint32_t)k;
__mmask16 from_scalar = (__mmask16)scalar_mask;

/* Masked store: only write active lanes to memory (safe for loop tails!) */
_mm512_mask_storeu_ps(ptr, k, data);  // writes only where k bit = 1

Loop Tail Handling with Masks

One of the most powerful uses of AVX-512 masking is clean loop tail handling. Instead of scalar fallback code for the last <16 elements, create a tail mask: __mmask16 tail = (1u << remaining) - 1, then use masked load/compute/store with this mask. This eliminates all scalar fallback code and is the recommended AVX-512 idiom.

Chapter 07

Shuffle, Permute & Blend

Shuffles are the most powerful — and most confusing — part of SIMD. They let you rearrange data within and between registers, enabling algorithms that seem inherently sequential.

The shuffle taxonomy

SIMD shuffle instructions form a complex family. Understanding the distinctions is critical for choosing the right instruction and understanding performance:

In-lane vs. cross-lane — Some shuffle instructions only rearrange elements within a 128-bit lane of a 256-bit register. The 256-bit register is treated as two independent 128-bit halves. Cross-lane operations can move elements between the two halves but are typically slower.

Immediate vs. variable index — Some shuffles use a compile-time immediate (fixed at compile time): _mm_shuffle_ps(a, b, 0b11001001). Others use a runtime vector index: _mm256_permutevar8x32_ps(a, idx). Variable shuffles are more flexible but may have higher latency.

Within-register vs. two-register — Some shuffles pick elements from a single source. Others interleave or blend from two sources.

The 128-bit lane boundary in AVX2

A fundamental quirk of AVX2: most operations work within the 128-bit lane boundary. A 256-bit register is effectively treated as two independent 128-bit registers for most shuffle operations. VPSHUFB (byte shuffle) operates within each 128-bit lane independently, for example. To move data across the 128-bit boundary, you need VPERM2F128, VPERMPD, or VPERMPS.

C — Shuffle & Permute reference

/* ── PSHUFD: rearrange 32-bit integers (in-lane, immediate) ── */ // Immediate format: two 2-bit selectors per output lane (bits 0..7 → lanes 3..0) // _MM_SHUFFLE(z, y, x, w) macro: builds the immediate __m128i a = _mm_set_epi32(3, 2, 1, 0); __m128i r = _mm_shuffle_epi32(a, _MM_SHUFFLE(0,1,2,3)); // r = [3,2,1,0] → [0,1,2,3] — reversed /* ── SHUFPS: blend elements from two 128-bit float vectors ── */ __m128 fa = _mm_set_ps(3.f,2.f,1.f,0.f); __m128 fb = _mm_set_ps(7.f,6.f,5.f,4.f); __m128 fr = _mm_shuffle_ps(fa, fb, _MM_SHUFFLE(3,2,1,0)); // lower 2 lanes from fa, upper 2 from fb /* ── PSHUFB (SSSE3): byte-granularity shuffle, in-lane, variable ── */ // Index byte 0..15: which source byte goes to each destination byte // If index bit 7 = 1: output lane is zeroed __m128i data = _mm_set_epi8(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); __m128i ctrl = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); __m128i shuf = _mm_shuffle_epi8(data, ctrl); // byte-reverse /* ── AVX2 VPERMD: fully variable 32-bit element permute across 256b ── */ __m256i src = _mm256_set_epi32(7,6,5,4,3,2,1,0); __m256i idx = _mm256_set_epi32(0,2,4,6,1,3,5,7); // pick any src lane __m256i out = _mm256_permutevar8x32_epi32(src, idx); // Crosses the 128-bit lane boundary! Indices 0..7 select from any lane in src. // Very useful — this is the "full permute" instruction AVX2 was missing until Haswell. /* ── Unpack / interleave ── */ __m128i lo = _mm_unpacklo_epi32(a, b); // [a0,b0,a1,b1] __m128i hi = _mm_unpackhi_epi32(a, b); // [a2,b2,a3,b3] // Unpack is the fundamental building block for transposing data (AoS ↔ SoA) /* ── Blend (immediate) ── */ // BLENDPS: pick each float32 from a or b based on immediate bit __m128 blended = _mm_blend_ps(fa, fb, 0b1010); // bit 0=0: take fa[0], bit 1=1: take fb[1], etc. // [fa[0], fb[1], fa[2], fb[3]]

AoS vs SoA — the layout problem that shuffles solve

One of the most important applications of SIMD shuffles is data layout transformation. Most code naturally uses Array of Structures (AoS) layout: [{x,y,z}, {x,y,z}, {x,y,z}, ...]. But SIMD prefers Structure of Arrays (SoA) layout: {[x,x,x,...], [y,y,y,...], [z,z,z,...]} where each component is contiguous in memory.

When your data is AoS (often unavoidable due to cache locality needs in scalar code), you must transpose it into SoA before SIMD processing and transpose back afterward. A 4×4 float transpose using unpack and blend operations is a classic example:

C — 4×4 float matrix transpose with SIMD (SSE)

// Input: 4 rows of 4 floats (row-major) // Output: transposed in-place using xmm registers static inline void transpose4x4(__m128 *r0, __m128 *r1, __m128 *r2, __m128 *r3) { // Step 1: interleave pairs of rows __m128 t0 = _mm_unpacklo_ps(*r0, *r1); // [r0[0],r1[0],r0[1],r1[1]] __m128 t1 = _mm_unpackhi_ps(*r0, *r1); // [r0[2],r1[2],r0[3],r1[3]] __m128 t2 = _mm_unpacklo_ps(*r2, *r3); // [r2[0],r3[0],r2[1],r3[1]] __m128 t3 = _mm_unpackhi_ps(*r2, *r3); // [r2[2],r3[2],r2[3],r3[3]] // Step 2: interleave pairs of pairs *r0 = _mm_movelh_ps(t0, t2); // [r0[0],r1[0],r2[0],r3[0]] — column 0 *r1 = _mm_movehl_ps(t2, t0); // [r0[1],r1[1],r2[1],r3[1]] — column 1 *r2 = _mm_movelh_ps(t1, t3); // [r0[2],r1[2],r2[2],r3[2]] — column 2 *r3 = _mm_movehl_ps(t3, t1); // [r0[3],r1[3],r2[3],r3[3]] — column 3 } // 8 instructions total. The macro _MM_TRANSPOSE4_PS does exactly this.

Chapter 08

Vectorized Algorithms

This chapter shows how real algorithms are vectorized — with the reasoning behind every decision.

Dot product (horizontal reduction)

The dot product of two vectors requires a reduction: summing all lane results into a single scalar. Reductions are expensive because SIMD is fundamentally parallel — combining across lanes requires shuffles. The pattern is to progressively halve the register, adding pairs:

C — Vectorized dot product with horizontal reduction

float dot_product_avx(const float* a, const float* b, int n) { __m256 acc0 = _mm256_setzero_ps(); // 4 accumulators to break FMA chain __m256 acc1 = _mm256_setzero_ps(); // dependency: each FMA waits ~4 cycles __m256 acc2 = _mm256_setzero_ps(); // using 4 hides that latency __m256 acc3 = _mm256_setzero_ps(); int i = 0; for (; i <= n - 32; i += 32) { // process 32 floats per iteration acc0 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i), _mm256_loadu_ps(b+i), acc0); acc1 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i+ 8), _mm256_loadu_ps(b+i+ 8), acc1); acc2 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i+16), _mm256_loadu_ps(b+i+16), acc2); acc3 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i+24), _mm256_loadu_ps(b+i+24), acc3); } /* Combine accumulators */ acc0 = _mm256_add_ps(acc0, acc1); acc2 = _mm256_add_ps(acc2, acc3); acc0 = _mm256_add_ps(acc0, acc2); /* ── Horizontal reduction: 8 floats → 1 float ── */ // Step 1: add upper 128b half to lower half __m128 hi = _mm256_extractf128_ps(acc0, 1); // lanes 4..7 __m128 lo = _mm256_castps256_ps128(acc0); // lanes 0..3 (zero-cost cast) __m128 sum = _mm_add_ps(lo, hi); // [0+4, 1+5, 2+6, 3+7] // Step 2: hadd pairs sum = _mm_hadd_ps(sum, sum); // [(0+4)+(1+5), (2+6)+(3+7), ...] sum = _mm_hadd_ps(sum, sum); // [total, ...] float result = _mm_cvtss_f32(sum); // extract lowest lane as scalar /* Scalar tail for remaining elements */ for (; i < n; i++) result += a[i] * b[i]; return result; }

String processing — SIMD strcmp

SSE4.2 introduced PCMPISTRI and PCMPISTRM — hardware string comparison instructions designed for pattern matching. They can find null terminators, compare character sets, and find substrings in a single instruction. PCMPISTRI returns an index; PCMPISTRM returns a mask. The control byte selects the comparison mode.

For general SIMD string processing without SSE4.2, you process 16 or 32 characters at a time using PSHUFB for table lookups:

C — SIMD lowercase conversion (16 chars at once, SSSE3)

// Convert ASCII uppercase to lowercase for 16 bytes simultaneously void to_lower_sse(char* s, int n) { __m128i upper_A = _mm_set1_epi8('A' - 1); // 0x40 __m128i upper_Z = _mm_set1_epi8('Z'); // 0x5A __m128i lo_delta = _mm_set1_epi8(32); // 'a' - 'A' = 32 int i = 0; for (; i <= n - 16; i += 16) { __m128i chunk = _mm_loadu_si128((__m128i_u*)(s + i)); // Compute mask: byte is uppercase if 'A'-1 < byte <= 'Z' __m128i above_A = _mm_cmpgt_epi8(chunk, upper_A); __m128i below_Z = _mm_cmpgt_epi8(upper_Z, chunk); // Z > chunk means chunk < Z // Note: _mm_cmple_epi8 doesn't exist, use Z >= chunk via cmpgt(Z+1, chunk) __m128i is_upper = _mm_and_si128(above_A, below_Z); // &: both conditions // Only add 32 to uppercase bytes (AND mask with delta) __m128i delta = _mm_and_si128(is_upper, lo_delta); chunk = _mm_add_epi8(chunk, delta); _mm_storeu_si128((__m128i_u*)(s + i), chunk); } for (; i < n; i++) s[i] |= (tolower((unsigned char)s[i])); }

Chapter 09

Compiler Auto-Vectorization

Modern compilers (GCC, Clang, MSVC) can automatically generate SIMD code from scalar loops. Understanding when and why this works — and when it fails — is essential.

What the vectorizer needs

Auto-vectorization is a loop transformation. The compiler analyzes a scalar loop and asks: can I replace N iterations of this loop with 1 iteration processing N elements at once? For this to be safe and profitable, the compiler needs to verify:

1. No loop-carried dependencies — Iteration N must not read a value written by iteration N-1. a[i] = a[i-1] + 1 cannot be vectorized because each element depends on the previous. a[i] = b[i] * 2 can be vectorized.

2. No pointer aliasing — If two pointers might point to overlapping memory, the compiler can't reorder the accesses. Use restrict to tell the compiler the pointers don't alias.

3. Known or analyzable trip count — The compiler needs to know the loop executes a reasonable number of times. A loop over N elements where N is known at compile time or can be propagated from a function parameter is easily vectorized.

4. Simple control flow — Branches inside the loop body break vectorization unless the compiler can convert them to predicated SIMD operations (blend/mask).

Compiler flags for vectorization

Shell — Compiler flags

# GCC / Clang — enable auto-vectorization and target ISA gcc -O2 -march=native -fvectorize -ftree-vectorize myfile.c # -O2 enables basic vectorization; -O3 enables more aggressive # -march=native: use all features of the current CPU # -march=x86-64-v3: target AVX2+FMA baseline (common server minimum) # -march=x86-64-v4: target AVX-512 # Get vectorization report (what was/wasn't vectorized and why) gcc -O2 -march=native -fopt-info-vec-optimized -fopt-info-vec-missed myfile.c clang -O2 -march=native -Rpass=loop-vectorize -Rpass-missed=loop-vectorize myfile.c # Enable FMA and AVX2 explicitly gcc -O2 -mavx2 -mfma myfile.c # MSVC cl /O2 /arch:AVX2 /Qvec-report:2 myfile.c

C — Idioms that help the auto-vectorizer

/* ── Use restrict to eliminate aliasing concerns ── */ void add_arrays(float *restrict c, const float *restrict a, const float *restrict b, int n) { for (int i = 0; i < n; i++) c[i] = a[i] + b[i]; // vectorizes cleanly } /* ── Alignment hints ── */ __attribute__((aligned(32))) float buf[1024]; // static aligned array // or: void f(float *__restrict__ __attribute__((aligned(32))) p); /* ── Loop that auto-vectorizes to AVX2 (with -mavx2 -mfma) ── */ void saxpy(float *restrict y, const float *restrict x, float a, int n) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; // → VFMADD231PS } /* ── Loop that does NOT auto-vectorize (loop-carried dependency) ── */ void prefix_sum(float *a, int n) { for (int i = 1; i < n; i++) a[i] += a[i-1]; // NOT vectorizable as-is // Vectorizing prefix sums requires a different algorithm (parallel scan) } /* ── GCC pragma to force/prevent vectorization ── */ #pragma GCC ivdep // "ignore vector dependencies" — trust me, no deps for (int i = 0; i < n; i++) a[i] = b[i] + c[i]; #pragma GCC unroll 4 // manually unroll loop (helps expose ILP to vectorizer) for (int i = 0; i < n; i++) a[i] *= 2.f; /* Clang OpenMP SIMD (portable) */ #pragma omp simd for (int i = 0; i < n; i++) c[i] = a[i] * b[i];

Godbolt — Your Best Friend

Always verify auto-vectorization by inspecting the generated assembly. Compiler Explorer (godbolt.org) lets you type C code and see the assembly output with color-coded mapping. Look for vmovups, vaddps, vfmadd — these confirm vectorization happened. A loop with only movss and addss is scalar and was not vectorized.

Chapter 10

Pitfalls & Anti-Patterns

SIMD code has failure modes that don't exist in scalar code. These are the most common ways to write code that's slower or subtly wrong.

1. The SSE/AVX transition penalty

Mixing legacy SSE instructions (without the v prefix) with VEX-encoded AVX instructions in the same code path triggers a false dependency hazard. The CPU must preserve the upper 128 bits of ymm registers when executing legacy SSE, which creates a stall when switching between SSE and AVX. Always use the VEX-encoded variants. GCC and Clang do this automatically with -mavx. Check the assembly for unexpected vzeroupper instructions — they're a sign the compiler is cleaning up after this issue.

2. Excessive horizontal operations

HADD, HSUB, and HMAX operations look appealing but are often slow. They are typically implemented as a shuffle followed by a vertical operation, and some CPUs implement them directly but with higher latency than two separate vertical ops. Prefer accumulating in multiple vertical accumulators and doing a final reduction at the end.

3. Denormal floats — catastrophic slowdown

IEEE 754 denormal (subnormal) numbers are values very close to zero. Some CPUs handle denormals in microcode — 50–100× slower than normal float operations. This is a real-world catastrophe in audio DSP and ML. The fix: set the FPU to flush-to-zero mode:

C — Denormal handling

#include <immintrin.h> /* ── Flush denormals to zero — critical for audio/ML performance ── */ static inline void disable_denormals(void) { // MXCSR bits: bit 15 = FZ (Flush-to-Zero), bit 6 = DAZ (Denormals-Are-Zero) _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // outputs: flush to zero _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // inputs: treat as zero // Equivalent: _mm_setcsr(_mm_getcsr() | 0x8040); } /* Check if a float value is denormal */ static inline bool is_denormal(float x) { uint32_t bits; memcpy(&bits, &x, 4); return ((bits & 0x7F800000) == 0) && ((bits & 0x007FFFFF) != 0); }

4. Type punning — use memcpy, not pointer casts

A common pattern is treating the bits of a float as an integer (e.g., for the abs trick above). Never use a pointer cast — that's undefined behavior in C and C++ due to strict aliasing rules. Use memcpy, which the compiler optimizes to a register move:

C — Type punning: correct vs incorrect

// WRONG: undefined behavior (strict aliasing violation) float f = 1.0f; uint32_t i = *(uint32_t*)&f; // ← DO NOT DO THIS // CORRECT: use memcpy float f2 = 1.0f; uint32_t i2; memcpy(&i2, &f2, 4); // compiles to a single register move // ALSO CORRECT: C++ std::bit_cast (C++20) uint32_t i3 = std::bit_cast<uint32_t>(f2); // ALSO CORRECT: union (C only, not C++) union { float f; uint32_t i; } pun; pun.f = 1.0f; uint32_t i4 = pun.i; // valid in C99, undefined in C++ (but works in practice)

5. Assuming faster = always use widest register

Wider is not always faster. As mentioned: AVX-512 can cause frequency downclocking on older Intel. Even AVX2 vs SSE4: on a lightly-loaded pipeline, SSE4 instructions may have lower latency and higher throughput per operation because the execution units are fully utilized at the narrower width. Profile before widening.

6. Not unrolling enough — hiding FMA latency

FMA has a latency of 4–5 cycles on most architectures but a throughput of 0.5 cycles (2/cycle on port 0+1). If your FMA loop has a data dependency chain — each iteration's FMA uses the previous iteration's output as its accumulator — you're bottlenecked on latency (1 FMA per 4–5 cycles) instead of throughput (2 FMAs per cycle). The fix is to use 4–8 independent accumulators, as shown in the dot product example above.

Chapter 11

ARM NEON & SVE

ARM's SIMD ecosystem is different from x86 in important ways — particularly SVE's scalable vector model, which is a paradigm shift.

NEON — ARM's fixed-width SIMD

NEON (Advanced SIMD) is ARM's equivalent of SSE/AVX. It provides 128-bit vector registers (v0–v31, also accessible as 64-bit d0–d31 and 32-bit s0–s31). Unlike x86, NEON was designed into AArch64 from the start — it's mandatory on all AArch64 implementations, not an optional extension.

NEON intrinsics use a different naming convention: v<op><dtype>_<lane_type>. For example: vaddq_f32 (add, q=128-bit, f32 elements). The q suffix means 128-bit ("quad"); without it, it's 64-bit ("double").

C — ARM NEON intrinsics

#include <arm_neon.h> /* ── NEON types (analogous to Intel __m128) ── */ float32x4_t a, b, c; // 4×float32 (128-bit) int16x8_t ia, ib; // 8×int16 (128-bit) uint8x16_t ua; // 16×uint8 (128-bit) float32x2_t da; // 2×float32 (64-bit, "doubleword") /* ── Load / Store ── */ a = vld1q_f32(ptr); // unaligned 128-bit load (aligned auto-selected) vst1q_f32(ptr, a); // store a = vdupq_n_f32(1.0f); // broadcast (dup = duplicate) a = vld1q_dup_f32(ptr); // load and broadcast /* Multi-register load (de-interleave) — unique NEON feature */ float32x4x3_t rgb = vld3q_f32(ptr); // loads 12 floats, de-interleaves RGB // rgb.val[0] = R channel, rgb.val[1] = G, rgb.val[2] = B // This is the NEON killer feature for AoS→SoA — no manual transpose needed! /* ── Arithmetic ── */ c = vaddq_f32(a, b); // c[i] = a[i] + b[i] c = vmulq_f32(a, b); // c[i] = a[i] * b[i] c = vmlaq_f32(c, a, b); // c[i] = c[i] + a[i]*b[i] (mul-accumulate) c = vmaxq_f32(a, b); // c[i] = max(a[i], b[i]) c = vrecpeq_f32(a); // 1/a[i] approximation /* FMA (AArch64 only, not ARMv7 NEON) */ c = vfmaq_f32(c, a, b); // c[i] = c[i] + a[i]*b[i] (true FMA) c = vfmaq_lane_f32(c, a, da, 0); // multiply each a[i] by da[0], accumulate /* ── Comparison ── */ uint32x4_t mask = vcgtq_f32(a, b); // a > b, returns 0xFFFFFFFF or 0 c = vbslq_f32(mask, a, b); // blend: c[i] = mask[i] ? a[i] : b[i] /* ── Shuffle / permute ── */ uint8x16_t tbl = vtbl1q_u8(src, idx); // table lookup (like x86 PSHUFB) c = vextq_f32(a, b, 2); // [a2,a3,b0,b1] — extract and shift c = vrev64q_f32(a); // reverse pairs: [a1,a0,a3,a2] c = vtrn1q_f32(a, b); // [a0,b0,a2,b2] — transpose even /* Horizontal reduce */ float total = vaddvq_f32(a); // sum all 4 lanes → scalar (AArch64)

SVE — Scalable Vector Extension — a revolution

SVE (and its successor SVE2) is ARM's answer to a fundamental problem: SIMD width keeps changing (128→256→512 bits), but code written for one width doesn't automatically benefit from wider hardware. SVE decouples the algorithm from the hardware width.

The core concept: SVE vector length is unknown at compile time. You query the hardware length at runtime using svcntw() (count of 32-bit elements). You write loops that process vl elements per iteration where vl is this hardware value. The same binary runs on SVE hardware with 128, 256, 512, or 2048-bit vectors without recompilation.

C — ARM SVE scalable vectorization

#include <arm_sve.h> void saxpy_sve(float* y, const float* x, float a, uint64_t n) { // svcntw() = number of float32 lanes — determined at RUNTIME by the hardware // On 128-bit SVE: svcntw() = 4 // On 512-bit SVE: svcntw() = 16 // Your code doesn't change. svfloat32_t va = svdup_n_f32(a); // broadcast a to all SVE lanes uint64_t i = 0; for (; i < n; i += svcntw()) { // svwhilelt_b32: create predicate mask for i .. i+vl-1, limited to n svbool_t pg = svwhilelt_b32(i, n); // handles loop tail automatically! svfloat32_t vx = svld1_f32(pg, x + i); // masked load svfloat32_t vy = svld1_f32(pg, y + i); // masked load vy = svmla_f32_z(pg, vy, va, vx); // y = a*x + y (FMA), zeroed mask svst1_f32(pg, y + i, vy); // masked store (safe at tail) } // No scalar tail needed! svwhilelt automatically limits the last iteration. } // This single function runs at optimal width on any SVE implementation: // Cortex-A55 (128b), Fujitsu A64FX (512b), AWS Graviton, Apple M-series, etc.

SVE: The Future Model

SVE's length-agnostic model is where SIMD is heading. RISC-V's V extension uses the same philosophy (called "vsetvl"). It solves the "compile once, run on all future hardware widths" problem that has plagued x86 SIMD for decades. If you're writing new code targeting ARM servers (AWS Graviton, Ampere Altra), SVE is the preferred approach.

Chapter 12

Performance Analysis & Profiling

Understanding performance requires measurement. Here are the tools, metrics, and mental models for analyzing SIMD code.

Peak throughput calculation

To know whether your code is achieving its potential, first calculate the theoretical peak. For a 3.5 GHz CPU with AVX2:

Float32 throughput = 3.5 GHz × 2 FMA units × 8 floats/unit × 2 ops/FMA = 112 GFLOPS

If your vectorized kernel achieves 30 GFLOPS, you're at 27% of peak. That's normal — memory bandwidth limits most real workloads. The arithmetic intensity (FLOPS per byte loaded) determines whether you're compute-bound or memory-bound. Use the roofline model: plot compute intensity vs. bandwidth ceiling.

Key profiling tools

Tool Platform What it measures

Intel VTune Profiler Intel CPUs Hotspots, vectorization, port utilization, cache misses, memory bandwidth

AMD uProf / CodeAnalyst AMD CPUs PMU counters, instruction mix, IPC

perf stat Linux x86/ARM Instruction count, IPC, cache misses, branch mispredictions

LIKWID Linux x86 FLOP counters, memory bandwidth, SIMD vectorization ratio

IACA / LLVM-MCA Any (static analysis) Static throughput/latency estimation without running code

Intel SDE x86 (emulator) AVX-512 on non-AVX-512 hardware, instruction mix analysis

Arm Streamline ARM NEON utilization, IPC, memory subsystem

Godbolt + objdump All Assembly inspection, verifying vectorization happened

Typical achievable speedups by workload

Dense matrix multiply

~8–12×

Image convolution

~6–9×

Dot product (float)

~5–7×

String search

~4–7×

Sort (vectorized)

~2–4×

JSON/CSV parsing

~2–4×

Gather-heavy code

~0.5–1.5×

Amdahl's Law — the fundamental limit

If only 50% of your code is vectorizable, the maximum speedup from perfect SIMD is 2×, regardless of how much you accelerate the vectorized portion. Profile first, find the true hotspot, then vectorize. SIMD is pointless on code that spends 5% of time in a loop.

The memory bandwidth ceiling

On a Zen 4 system with DDR5-5600 memory, peak bandwidth is ~90 GB/s. A dense float32 SAXPY (y = a*x + y) requires reading 8 bytes and writing 4 bytes per element: 12 bytes/element. At peak bandwidth that's 7.5 billion elements/second = 7.5 GFLOPS (2 ops per element). The CPU can do 300+ GFLOPS of float32. So SAXPY is fully memory-bound at 2.5% of compute peak. No amount of SIMD optimization changes this — the bottleneck is the memory bus.

The rule: vectorize compute-bound kernels first. Identify whether your hotspot is memory-bound (bandwidth-limited) or compute-bound (FLOP-limited) before writing a single intrinsic. perf stat -e instructions,cycles,cache-misses and comparing the resulting IPC (instructions per cycle) to theoretical peak gives you an immediate answer.

Further Reading

Intel Intrinsics Guide — intrinsics.guide — searchable reference for every intrinsic with latency/throughput data.
Agner Fog's optimization manuals — agner.org — the definitive reference for x86 instruction tables and microarchitecture details.
ISPC compiler — Intel SPMD Program Compiler — high-level SIMD programming model that generates ISA-portable code.
Highway library — google/highway — portable SIMD C++ library abstracting x86/ARM/RISC-V.
simdjson / simdutf — production examples of aggressive SIMD in JSON parsing and UTF-8 validation.

Tool	Platform	What it measures
Intel VTune Profiler	Intel CPUs	Hotspots, vectorization, port utilization, cache misses, memory bandwidth
AMD uProf / CodeAnalyst	AMD CPUs	PMU counters, instruction mix, IPC
perf stat	Linux x86/ARM	Instruction count, IPC, cache misses, branch mispredictions
LIKWID	Linux x86	FLOP counters, memory bandwidth, SIMD vectorization ratio
IACA / LLVM-MCA	Any (static analysis)	Static throughput/latency estimation without running code
Intel SDE	x86 (emulator)	AVX-512 on non-AVX-512 hardware, instruction mix analysis
Arm Streamline	ARM	NEON utilization, IPC, memory subsystem
Godbolt + objdump	All	Assembly inspection, verifying vectorization happened

Single Instruction,Multiple Data

Fundamentals & Mental Model

Flynn's Taxonomy — where SIMD lives

The scalar bottleneck — why SIMD was invented

Key terminology you'll see everywhere

The four operations every SIMD system must provide

Hardware Implementation

The vector execution unit

Register files — what physically exists on die

The VEX/EVEX encoding prefix — why it matters

Frequency scaling — the SIMD penalty that most guides miss

ISA Extensions History

Runtime CPU feature detection

Intel Intrinsics Deep Dive

The naming convention — decoded

Element type suffixes — complete reference

The type system — __m128, __m256, __m512

Essential operations with code examples

The FMA naming confusion — 132, 213, 231

Memory, Alignment & Load/Store

Alignment — why it matters at the hardware level

Gather and scatter — non-contiguous memory access

Masking & Predicates

Blend masking (SSE4.1 / AVX)

AVX-512 opmask registers — a paradigm shift

Shuffle, Permute & Blend

The shuffle taxonomy

The 128-bit lane boundary in AVX2

AoS vs SoA — the layout problem that shuffles solve

Vectorized Algorithms

Dot product (horizontal reduction)

String processing — SIMD strcmp

Compiler Auto-Vectorization

What the vectorizer needs

Compiler flags for vectorization

Pitfalls & Anti-Patterns

1. The SSE/AVX transition penalty

2. Excessive horizontal operations

3. Denormal floats — catastrophic slowdown

4. Type punning — use memcpy, not pointer casts

5. Assuming faster = always use widest register

6. Not unrolling enough — hiding FMA latency

ARM NEON & SVE

NEON — ARM's fixed-width SIMD

SVE — Scalable Vector Extension — a revolution

Performance Analysis & Profiling

Peak throughput calculation

Key profiling tools

Typical achievable speedups by workload

Amdahl's Law — the fundamental limit

The memory bandwidth ceiling

Single Instruction,
Multiple Data

The type system — m128, m256, __m512