Single Instruction,
Multiple Data
A comprehensive, from-scratch guide to SIMD — the hardware and software technique that makes modern processors process 16, 32, or 64 values in the same time it takes to process one. Covers x86 SSE/AVX/AVX-512, ARM NEON/SVE, intrinsics, memory alignment, masking, shuffle operations, vectorization, and real-world performance.
- 01Fundamentals & Mental Model
- 02Hardware Implementation
- 03ISA Extensions History
- 04Intel Intrinsics Deep Dive
- 05Memory, Alignment & Loads
- 06Masking & Predicates
- 07Shuffle, Permute & Blend
- 08Vectorized Algorithms
- 09Compiler Auto-Vectorization
- 10Pitfalls & Anti-Patterns
- 11ARM NEON & SVE
- 12Performance & Profiling
Fundamentals & Mental Model
Before writing a single intrinsic call, you need to deeply understand what problem SIMD solves, why it exists, and what the hardware is actually doing when you use it.
Flynn's Taxonomy — where SIMD lives
In 1966, Michael Flynn classified computing architectures along two axes: how many instruction streams exist simultaneously, and how many data streams they operate on. This gives four categories:
| Category | Instructions | Data Streams | Example |
|---|---|---|---|
| SISD | 1 | 1 | Classic scalar CPU — one add per instruction |
| SIMD | 1 | Many | SSE/AVX — one instruction adds 8 floats at once |
| MISD | Many | 1 | Theoretical (pipeline stages, some fault-tolerant systems) |
| MIMD | Many | Many | Multi-core CPU, GPU compute shaders |
SIMD sits in the sweet spot of hardware cost vs. throughput gain. Unlike MIMD (which requires full duplicated execution units, caches, instruction decoders, and branch predictors), SIMD reuses a single instruction decoder, a single program counter, and a single fetch/decode pipeline — only the execution units and register file are widened. You pay roughly linear hardware cost for linear throughput gain.
The scalar bottleneck — why SIMD was invented
By the early 1990s, CPU clock speeds were rising, but multimedia workloads — audio DSP, video codecs, image processing — were hitting a wall. These workloads have a distinct pattern: the same arithmetic operation is applied to thousands of independent values stored contiguously in memory. For example, adjusting the brightness of an image means adding a constant to every single pixel value. On scalar hardware:
= 24 instructions
= 3 instructions · 8× throughput
Key terminology you'll see everywhere
Vector register — a wide CPU register (64–512 bits) that holds multiple data elements simultaneously.
Lane / element — one logical slot within a vector register. A 256-bit register holding 32-bit floats has 8 lanes.
Width — the total bit-size of a vector register (128, 256, 512 bits).
Element type — the data type packed into each lane: int8, int16, int32, int64, float32 (single precision), float64 (double).
Vectorization — the process (manual or automatic) of transforming scalar loops into SIMD operations.
ISA extension — an optional addition to a CPU's instruction set. SSE, AVX, and AVX-512 are ISA extensions.
Intrinsic — a C/C++ function that maps one-to-one to a single assembly instruction. Used to write SIMD code in C without inline assembly.
Throughput vs latency — latency is how many cycles an instruction takes to complete; throughput is how many can be in flight simultaneously (reciprocal throughput = minimum cycles between two of the same instruction).
CPUID — a CPU instruction that reports which ISA extensions the current processor supports at runtime.
The four operations every SIMD system must provide
Regardless of architecture — x86, ARM, RISC-V V extension, PowerPC AltiVec — every SIMD ISA must provide:
1. Arithmetic — element-wise add, subtract, multiply, divide, fused multiply-add (FMA), min, max, abs, negate. These are the workhorses.
2. Memory — loading a chunk of memory into a vector register (load), storing a register back to memory (store). Plus variants for unaligned access, non-temporal (streaming) stores that bypass the cache, and gather/scatter for non-contiguous access.
3. Shuffle/permute — rearranging elements within or between registers. This is the most powerful and most complex part of SIMD. Many algorithms that seem non-vectorizable become tractable once you understand shuffles.
4. Comparison and masking — comparing elements and producing a mask (a bitmask or predicate register) that controls which lanes participate in subsequent operations. Essential for handling boundary conditions, conditional code, and variable-length data.
Hardware Implementation
Understanding the physical silicon helps you reason about why certain operations are fast, why others have latency penalties, and what the CPU is doing under the hood.
The vector execution unit
A modern out-of-order CPU like Intel's Golden Cove or AMD's Zen 4 contains multiple independent execution ports. Each port has attached execution units. Scalar integer operations typically have 4–6 ports feeding several ALUs. Vector operations have dedicated ports — typically 2–3 on modern designs — each feeding a full-width vector ALU.
The key insight: the hardware cost of a wide vector add is nearly identical to a narrow one. The transistor overhead of widening an adder from 128 to 256 bits is roughly proportional to the bit width, but the instruction fetch/decode overhead, the register renaming overhead, the scheduler overhead — all the expensive fixed costs — remain constant. This is why SIMD is so efficient: you amortize those fixed costs over more work.
Register files — what physically exists on die
When you use SSE, you use registers named xmm0–xmm15 (16 registers on x86-64, each 128 bits). When you use AVX/AVX2, you use ymm0–ymm15 (256 bits each). xmm registers are the lower 128 bits of the corresponding ymm register — they are the same physical register, just addressed at different widths.
AVX-512 extends this to zmm0–zmm31 (512 bits each, 32 registers total), doubling both the register count and the width. This additional register capacity reduces register pressure significantly — one of the biggest benefits of AVX-512 besides raw width.
The VEX/EVEX encoding prefix — why it matters
Legacy SSE instructions use the original x86 encoding: no prefix, or a 66/F2/F3 byte prefix. These instructions write only 128 bits and preserve the upper 128 bits of the ymm register. This sounds harmless but causes a major performance hazard: the CPU must preserve the upper bits, which means a dependency on the previous writer of that register. This is called the SSE/AVX transition penalty.
The VEX prefix (introduced with AVX in 2011) solves this. VEX-encoded instructions always zero-extend to the full register width — writing to xmm0 via a VEX instruction zeroes the upper 128 bits of ymm0/zmm0. No false dependency. This is why you should always use the VEX-encoded v variants: vaddps instead of addps, vmovups instead of movups.
AVX-512 goes further with EVEX: a 4-byte encoding that adds mask registers, broadcasting, and additional opcodes — all discussed in later chapters.
Frequency scaling — the SIMD penalty that most guides miss
On Intel processors (Sandy Bridge through Ice Lake, and some others), enabling 512-bit operations causes the CPU to enter a lower-frequency state called AVX-512 license level. This is because 512-bit units generate more heat. The frequency reduction can be severe — sometimes 300–700 MHz on Skylake-SP. On many workloads, this makes AVX-512 slower than AVX2 unless the computation is highly arithmetic-intensive.
Alder Lake (12th gen Intel) removed this penalty for most operations. AMD Zen 4 implements 512-bit operations natively without frequency penalties. Always benchmark on the target hardware; never assume wider = faster.
The theoretical peak throughput of SIMD (e.g., "8× faster for 256-bit float32") is almost never achieved in practice. Real speedup depends on: memory bandwidth saturation, instruction throughput vs latency, pipeline dependencies, and branch mispredictions around loop edges. Realistic speedups of 3–6× for well-vectorized code are common; anything above 6× is exceptional.
ISA Extensions History
x86 SIMD has evolved through a series of backward-compatible extensions. Each one built on the previous, adding width, new data types, or new operations.
EMMS to switch back to FP. Now obsolete.MOVAPS, ADDPS, MULPS, etc. Also introduced PREFETCH instructions for cache hints. The foundation everything else is built on.HADD, HSUB) and LDDQU for unaligned loads. Also MOVSHDUP/MOVSLDUP for complex number arithmetic.PSHUFB — a byte-granularity shuffle controlled by a vector index. Unlocks enormous algorithmic flexibility. Also added PMULHRSW, PHADDW, and sign operations.DPPS (dot product), BLENDPS (blend/select), PTEST, PMAXSD, PMULLD (32×32→32 multiply). SSE4.2 added string compare (PCMPISTRI) and CRC32.VGATHERDPS), broadcast (VPBROADCASTD), variable shifts (VPSLLVD), and permute across full 256-bit registers. The current practical sweet spot.a*b+c in a single instruction with only one rounding step. Critical for throughput-bound code. VFMADD213PS, VFMADD231PS etc. — the number encodes the order of operands.VPDPBUSD computes 4-element dot products of int8 values, accumulating into int32. Critical for deep learning inference on CPUs. Available without full AVX-512.Runtime CPU feature detection
You cannot assume any extension beyond SSE2 is available. Code compiled with -mavx2 and run on a Sandy Bridge CPU will crash with an illegal instruction fault. The correct approach is runtime CPUID detection:
#include <stdint.h> #include <stdbool.h> static inline void cpuid(uint32_t leaf, uint32_t subleaf, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) { __asm__ volatile( "cpuid" : "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx) : "0"(leaf), "2"(subleaf) ); } typedef struct { bool sse2, sse3, ssse3, sse41, sse42; bool avx, avx2, fma; bool avx512f, avx512bw, avx512vl, avx512vnni; } CpuFeatures; CpuFeatures detect_cpu_features(void) { CpuFeatures f = {0}; uint32_t eax, ebx, ecx, edx; /* Leaf 1: basic features */ cpuid(1, 0, &eax, &ebx, &ecx, &edx); f.sse2 = (edx >> 26) & 1; /* EDX bit 26 */ f.sse3 = (ecx >> 0) & 1; /* ECX bit 0 */ f.ssse3 = (ecx >> 9) & 1; /* ECX bit 9 */ f.sse41 = (ecx >> 19) & 1; f.sse42 = (ecx >> 20) & 1; f.avx = (ecx >> 28) & 1; f.fma = (ecx >> 12) & 1; /* Leaf 7: extended features */ cpuid(7, 0, &eax, &ebx, &ecx, &edx); f.avx2 = (ebx >> 5) & 1; f.avx512f = (ebx >> 16) & 1; f.avx512bw = (ebx >> 30) & 1; f.avx512vl = (ebx >> 31) & 1; f.avx512vnni= (ecx >> 11) & 1; return f; }
For most production code targeting general x86-64 deployments: SSE2 is guaranteed (part of x86-64 ABI). SSSE3 is safe for code targeting hardware from 2008+. AVX2+FMA is safe for code targeting hardware from 2015+. AVX-512 requires explicit deployment targeting (data centers, specific HPC hardware).
Intel Intrinsics Deep Dive
Intrinsics are the primary API for writing SIMD code in C/C++. They look like function calls but compile to single instructions. Understanding the naming convention is essential.
The naming convention — decoded
Every Intel SIMD intrinsic follows a systematic naming scheme. Once you internalize it, you can read any intrinsic name and know exactly what it does.
_mm_ = 128b
_mm256_ = 256b
_mm512_ = 512b
add, sub, mul,
div, load, store,
shuffle, cmp…
_ps = packed f32
_pd = packed f64
_epi32 = int32
_epu16 = uint16
Full example: _mm256_add_ps(a, b) means: 256-bit wide, add operation, packed single-precision floats. It compiles to a single VADDPS ymm, ymm, ymm instruction.
Element type suffixes — complete reference
| Suffix | Meaning | Lanes in 256b | Lanes in 128b |
|---|---|---|---|
_ps | Packed single-precision float (float32) | 8 | 4 |
_pd | Packed double-precision float (float64) | 4 | 2 |
_ss | Scalar single — operates on lowest lane only | — | 1 |
_sd | Scalar double — lowest lane only | — | 1 |
_epi8 | Packed signed 8-bit integers | 32 | 16 |
_epi16 | Packed signed 16-bit integers | 16 | 8 |
_epi32 | Packed signed 32-bit integers | 8 | 4 |
_epi64 | Packed signed 64-bit integers | 4 | 2 |
_epu8 | Packed unsigned 8-bit integers | 32 | 16 |
_epu16 | Packed unsigned 16-bit integers | 16 | 8 |
_epu32 | Packed unsigned 32-bit integers | 8 | 4 |
_si128 | Generic 128-bit integer (untyped) | — | 1 |
_si256 | Generic 256-bit integer (untyped) | — | — |
The type system — __m128, __m256, __m512
Intel intrinsics use opaque vector types. You don't access lanes directly — you use intrinsics to manipulate them. The types:
__m128 — 128-bit vector of float32 (used with _mm_*_ps)
__m128d — 128-bit vector of float64 (_mm_*_pd)
__m128i — 128-bit vector, any integer type (_mm_*_epiN)
__m256, __m256d, __m256i — 256-bit equivalents
__m512, __m512d, __m512i — 512-bit equivalents
These are not pointers. They are value types stored in registers. Passing them by value in function arguments is fine — the ABI passes vector types in xmm/ymm/zmm registers directly (System V AMD64 ABI passes up to 8 __m256 by value in ymm registers).
Essential operations with code examples
#include <immintrin.h> // all Intel SIMD intrinsics /* ── Set / Initialize ── */ __m256 zeros = _mm256_setzero_ps(); // all 8 lanes = 0.0f __m256 ones = _mm256_set1_ps(1.0f); // broadcast scalar to all lanes __m256 custom = _mm256_set_ps // set individual values (7.f, 6.f, 5.f, 4.f, 3.f, 2.f, 1.f, 0.f); // CAUTION: _mm256_set_ps args are in REVERSE lane order (lane 7 first) /* ── Arithmetic ── */ __m256 a, b, c; c = _mm256_add_ps(a, b); // c[i] = a[i] + b[i] c = _mm256_mul_ps(a, b); // c[i] = a[i] * b[i] c = _mm256_div_ps(a, b); // c[i] = a[i] / b[i] (slow ~11 cycles) c = _mm256_sqrt_ps(a); // c[i] = sqrt(a[i]) (slow ~13 cycles) c = _mm256_rcp_ps(a); // c[i] ≈ 1/a[i] (12-bit approx, fast) c = _mm256_rsqrt_ps(a); // c[i] ≈ 1/sqrt(a[i]) (12-bit approx) /* ── Fused Multiply-Add (FMA) — requires FMA3 ── */ // a*b + c (213: a=dst, b*c+dst notation: second*third+first) c = _mm256_fmadd_ps(a, b, c); // c[i] = a[i]*b[i] + c[i] c = _mm256_fmsub_ps(a, b, c); // c[i] = a[i]*b[i] - c[i] c = _mm256_fnmadd_ps(a, b, c); // c[i] = -(a[i]*b[i]) + c[i] /* ── Integer arithmetic (256-bit) ── */ __m256i ia, ib, ic; ic = _mm256_add_epi32(ia, ib); // 8×int32 add (wraps on overflow) ic = _mm256_adds_epi16(ia, ib); // 16×int16 saturating add (clamps) ic = _mm256_mullo_epi32(ia, ib); // low 32 bits of 32×32 multiply ic = _mm256_mulhi_epi16(ia, ib); // high 16 bits of 16×16 multiply /* ── Comparison — returns mask vector ── */ __m256 mask = _mm256_cmp_ps(a, b, _CMP_LT_OQ); // per-element: a[i] < b[i] // mask[i] = 0xFFFFFFFF if true, 0x00000000 if false c = _mm256_blendv_ps(a, b, mask); // c[i] = mask[i] ? b[i] : a[i] /* ── Bitwise ── */ c = _mm256_and_ps(a, b); // bitwise AND (reinterprets as bits) c = _mm256_or_ps(a, b); c = _mm256_xor_ps(a, b); c = _mm256_andnot_ps(a, b); // (~a) & b /* ── Horizontal ops (cross-lane, expensive) ── */ c = _mm256_hadd_ps(a, b); // [a0+a1, a2+a3, b0+b1, b2+b3, a4+a5, a6+a7, b4+b5, b6+b7] // HADD is often slow — prefer a sequence of vertical ops + shuffles
The FMA naming confusion — 132, 213, 231
FMA3 has three variants of each operation, differing only in which register is used as the addend/destination. The number encodes the operand order for the a*b+c operation:
VFMADD132PS dst, src1, src2 → dst = dst * src2 + src1
VFMADD213PS dst, src1, src2 → dst = src1 * dst + src2
VFMADD231PS dst, src1, src2 → dst = src1 * src2 + dst
The number (1, 2, 3) indicates which operand position is used as the multiplicand (1), multiplier (2), and addend (3). In practice, the compiler picks the variant that minimizes register moves. When writing intrinsics by hand, use _mm256_fmadd_ps which takes (a, b, c) and computes a*b+c.
Memory, Alignment & Load/Store
Memory operations are where most SIMD performance is made or lost. Understanding alignment, streaming stores, and gather/scatter is essential.
Alignment — why it matters at the hardware level
A memory access is aligned when its address is a multiple of its size. A 256-bit (32-byte) load is aligned when its address is divisible by 32. The reason alignment matters: SIMD loads are implemented in hardware as single cache-line transfers. A cache line on all modern x86 processors is 64 bytes. An aligned 32-byte load always falls within a single cache line. An unaligned 32-byte load may straddle two cache lines, requiring two separate cache-line reads and a hardware merge — roughly a 1–3 cycle penalty.
On very old hardware (pre-Nehalem), misaligned SIMD loads could fault. On modern CPUs (Sandy Bridge+), misaligned loads are handled in hardware with only a small penalty. Still, always align your data when you control allocation.
#include <immintrin.h> #include <stdlib.h> /* ── Aligned allocation ── */ float *data; posix_memalign((void**)&data, 32, n * sizeof(float)); // Linux // or use C11: aligned_alloc(32, n * sizeof(float)) // or C++17: std::aligned_alloc /* Hint the compiler about alignment: */ data = __builtin_assume_aligned(data, 32); // GCC/Clang // or MSVC: __assume(((uintptr_t)data & 31) == 0); /* ── Load variants ── */ __m256 v; // Aligned load — address MUST be 32-byte aligned (undefined behavior if not) v = _mm256_load_ps(data); // VMOVAPS — fastest // Unaligned load — works at any address, tiny penalty on misaligned v = _mm256_loadu_ps(data + 1); // VMOVUPS — always safe // Broadcast a single float to all 8 lanes v = _mm256_broadcast_ss(data); // VBROADCASTSS — [d[0],d[0],...,d[0]] __m128 lane = _mm_load_ps(data); __m256 rep = _mm256_broadcast_ps(&lane); // [xmm,xmm] — 128b broadcast to 256b /* ── Store variants ── */ _mm256_store_ps(data, v); // aligned store _mm256_storeu_ps(data + 1, v); // unaligned store /* Non-temporal streaming store — bypasses cache, for write-once data */ _mm256_stream_ps(data, v); // VMOVNTPS — MUST be aligned, weakly ordered _mm_sfence(); // required fence after streaming stores // Use _mm256_stream_ps when writing large buffers you won't read back soon. // Eliminates cache pollution: cache lines are not loaded before being overwritten. /* ── Masked load/store (AVX) — loads only lanes where mask bit is set ── */ __m256i mask = _mm256_set_epi32(-1,-1,-1,-1,0,0,0,0); // upper 4 active v = _mm256_maskload_ps(data, mask); // lanes with mask sign bit=1 are loaded _mm256_maskstore_ps(data, mask, v); // stores only active lanes
Gather and scatter — non-contiguous memory access
Gather (introduced in AVX2) loads elements from arbitrary memory addresses given a vector of indices. This is essential for indirect memory access patterns — lookup tables, sparse arrays, permutations. However, gathers are significantly slower than contiguous loads because each lane may touch a different cache line.
Scatter (AVX-512) is the reverse: storing elements at arbitrary addresses given a vector of indices.
/* ── AVX2 Gather ── */ float base[] = { 10.f, 20.f, 30.f, 40.f, 50.f, 60.f, 70.f, 80.f }; __m256i indices = _mm256_set_epi32(7,5,3,1, 6,4,2,0); // reversed lane order __m256i all_ones = _mm256_set1_epi32(-1); // mask: all lanes active __m256 src = _mm256_setzero_ps(); // fallback for inactive lanes __m256 gathered = _mm256_mask_i32gather_ps( src, // passthrough value for masked-off lanes base, // base pointer indices, // vector of int32 indices all_ones, // mask (sign bit = active lane) 4 // scale: byte offset = index * scale. Values: 1,2,4,8 ); // gathered = [base[0], base[2], base[4], base[6], base[1], base[3], base[5], base[7]] /* ── AVX-512 Scatter ── */ __m512i zidx = _mm512_set_epi32(15,13,11,9, 7,5,3,1, 14,12,10,8, 6,4,2,0); _mm512_i32scatter_ps(output, zidx, data_vec, 4); // writes to scattered addresses // Scatter latency is high (~20+ cycles) due to potential address conflicts. // Always check there are no duplicate scatter indices — result is undefined.
AVX2 gather is often slower than a scalar loop for small arrays that fit in L1 cache, because the gather instruction has high latency (≈20 cycles) and limited throughput. Only use gather when (1) the array doesn't fit in L1/L2, (2) the access pattern is highly irregular, AND (3) you can execute several independent gathers in parallel to hide latency. Benchmark before committing to gather.
Masking & Predicates
Masking is how SIMD handles conditional logic. Instead of branching, you compute results for all lanes and use a mask to select which results are kept.
Blend masking (SSE4.1 / AVX)
Before AVX-512, masking was done with blend operations. You compute the comparison result as a vector of 0s and 0xFFFFFF…s, then use blendv to select between two vectors element-wise. This is equivalent to a vectorized ternary operator.
/* ── Pattern: vectorized clamp (min/max) ── */ static inline __m256 clamp_ps(__m256 x, __m256 lo, __m256 hi) { x = _mm256_max_ps(x, lo); // x = max(x, lo) x = _mm256_min_ps(x, hi); // x = min(x, hi) return x; } /* ── Pattern: vectorized abs for float ── */ static inline __m256 abs_ps(__m256 x) { // Float sign is the MSB. Clear it by ANDing with 0x7FFFFFFF. __m256 mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x7FFFFFFF)); return _mm256_and_ps(x, mask); // strip sign bit } /* ── Pattern: vectorized negate ── */ static inline __m256 neg_ps(__m256 x) { __m256 sign_mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x80000000)); return _mm256_xor_ps(x, sign_mask); // flip sign bit } /* ── Pattern: conditional with two outcomes ── */ // Scalar: result[i] = (x[i] > 0) ? a[i] : b[i] __m256 conditional_select(__m256 x, __m256 a, __m256 b) { __m256 mask = _mm256_cmp_ps(x, _mm256_setzero_ps(), _CMP_GT_OQ); return _mm256_blendv_ps(b, a, mask); // a where mask=1, b where mask=0 }
AVX-512 opmask registers — a paradigm shift
AVX-512 introduced opmask registers: k0–k7. These are compact 8-bit to 64-bit integer registers where each bit corresponds to one lane of the vector operation. Instead of wasting a full vector register for masking, you now use a tiny dedicated mask register.
The payoff: almost every AVX-512 instruction has a built-in {k1} masking suffix. You don't need separate blend instructions — masking is embedded directly in the arithmetic instruction. You can also use zero-masking ({k1}{z}) where inactive lanes are zeroed, or merge-masking where they retain their previous value.
/* ── AVX-512 opmask examples ── */ __m512 a, b, c; __mmask16 k; // 16-bit mask for 16×f32 (512-bit register) /* Compare to produce a mask (not a vector of floats!) */ k = _mm512_cmp_ps_mask(a, b, _CMP_LT_OQ); // k is now a 16-bit integer where bit i=1 if a[i] < b[i] /* Masked add: only adds where mask bit = 1, merges src otherwise */ c = _mm512_mask_add_ps( c, // src: pass-through for inactive lanes (merge masking) k, // mask a, b // operands ); /* Zero-masked add: inactive lanes are set to 0 */ c = _mm512_maskz_add_ps(k, a, b); /* Mask arithmetic (scalar bitwise ops on mask registers) */ __mmask16 k2 = _mm512_cmp_ps_mask(a, _mm512_setzero_ps(), _CMP_GT_OQ); __mmask16 both = k & k2; // lanes where a0 __mmask16 either = k | k2; // lanes where a0 int n_active = _mm_popcnt_u32(k); // count active lanes /* Convert mask to integer (scalar) and back */ uint32_t scalar_mask = (uint32_t)k; __mmask16 from_scalar = (__mmask16)scalar_mask; /* Masked store: only write active lanes to memory (safe for loop tails!) */ _mm512_mask_storeu_ps(ptr, k, data); // writes only where k bit = 1
One of the most powerful uses of AVX-512 masking is clean loop tail handling. Instead of scalar fallback code for the last <16 elements, create a tail mask: __mmask16 tail = (1u << remaining) - 1, then use masked load/compute/store with this mask. This eliminates all scalar fallback code and is the recommended AVX-512 idiom.
Shuffle, Permute & Blend
Shuffles are the most powerful — and most confusing — part of SIMD. They let you rearrange data within and between registers, enabling algorithms that seem inherently sequential.
The shuffle taxonomy
SIMD shuffle instructions form a complex family. Understanding the distinctions is critical for choosing the right instruction and understanding performance:
In-lane vs. cross-lane — Some shuffle instructions only rearrange elements within a 128-bit lane of a 256-bit register. The 256-bit register is treated as two independent 128-bit halves. Cross-lane operations can move elements between the two halves but are typically slower.
Immediate vs. variable index — Some shuffles use a compile-time immediate (fixed at compile time): _mm_shuffle_ps(a, b, 0b11001001). Others use a runtime vector index: _mm256_permutevar8x32_ps(a, idx). Variable shuffles are more flexible but may have higher latency.
Within-register vs. two-register — Some shuffles pick elements from a single source. Others interleave or blend from two sources.
The 128-bit lane boundary in AVX2
A fundamental quirk of AVX2: most operations work within the 128-bit lane boundary. A 256-bit register is effectively treated as two independent 128-bit registers for most shuffle operations. VPSHUFB (byte shuffle) operates within each 128-bit lane independently, for example. To move data across the 128-bit boundary, you need VPERM2F128, VPERMPD, or VPERMPS.
/* ── PSHUFD: rearrange 32-bit integers (in-lane, immediate) ── */ // Immediate format: two 2-bit selectors per output lane (bits 0..7 → lanes 3..0) // _MM_SHUFFLE(z, y, x, w) macro: builds the immediate __m128i a = _mm_set_epi32(3, 2, 1, 0); __m128i r = _mm_shuffle_epi32(a, _MM_SHUFFLE(0,1,2,3)); // r = [3,2,1,0] → [0,1,2,3] — reversed /* ── SHUFPS: blend elements from two 128-bit float vectors ── */ __m128 fa = _mm_set_ps(3.f,2.f,1.f,0.f); __m128 fb = _mm_set_ps(7.f,6.f,5.f,4.f); __m128 fr = _mm_shuffle_ps(fa, fb, _MM_SHUFFLE(3,2,1,0)); // lower 2 lanes from fa, upper 2 from fb /* ── PSHUFB (SSSE3): byte-granularity shuffle, in-lane, variable ── */ // Index byte 0..15: which source byte goes to each destination byte // If index bit 7 = 1: output lane is zeroed __m128i data = _mm_set_epi8(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); __m128i ctrl = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); __m128i shuf = _mm_shuffle_epi8(data, ctrl); // byte-reverse /* ── AVX2 VPERMD: fully variable 32-bit element permute across 256b ── */ __m256i src = _mm256_set_epi32(7,6,5,4,3,2,1,0); __m256i idx = _mm256_set_epi32(0,2,4,6,1,3,5,7); // pick any src lane __m256i out = _mm256_permutevar8x32_epi32(src, idx); // Crosses the 128-bit lane boundary! Indices 0..7 select from any lane in src. // Very useful — this is the "full permute" instruction AVX2 was missing until Haswell. /* ── Unpack / interleave ── */ __m128i lo = _mm_unpacklo_epi32(a, b); // [a0,b0,a1,b1] __m128i hi = _mm_unpackhi_epi32(a, b); // [a2,b2,a3,b3] // Unpack is the fundamental building block for transposing data (AoS ↔ SoA) /* ── Blend (immediate) ── */ // BLENDPS: pick each float32 from a or b based on immediate bit __m128 blended = _mm_blend_ps(fa, fb, 0b1010); // bit 0=0: take fa[0], bit 1=1: take fb[1], etc. // [fa[0], fb[1], fa[2], fb[3]]
AoS vs SoA — the layout problem that shuffles solve
One of the most important applications of SIMD shuffles is data layout transformation. Most code naturally uses Array of Structures (AoS) layout: [{x,y,z}, {x,y,z}, {x,y,z}, ...]. But SIMD prefers Structure of Arrays (SoA) layout: {[x,x,x,...], [y,y,y,...], [z,z,z,...]} where each component is contiguous in memory.
When your data is AoS (often unavoidable due to cache locality needs in scalar code), you must transpose it into SoA before SIMD processing and transpose back afterward. A 4×4 float transpose using unpack and blend operations is a classic example:
// Input: 4 rows of 4 floats (row-major) // Output: transposed in-place using xmm registers static inline void transpose4x4(__m128 *r0, __m128 *r1, __m128 *r2, __m128 *r3) { // Step 1: interleave pairs of rows __m128 t0 = _mm_unpacklo_ps(*r0, *r1); // [r0[0],r1[0],r0[1],r1[1]] __m128 t1 = _mm_unpackhi_ps(*r0, *r1); // [r0[2],r1[2],r0[3],r1[3]] __m128 t2 = _mm_unpacklo_ps(*r2, *r3); // [r2[0],r3[0],r2[1],r3[1]] __m128 t3 = _mm_unpackhi_ps(*r2, *r3); // [r2[2],r3[2],r2[3],r3[3]] // Step 2: interleave pairs of pairs *r0 = _mm_movelh_ps(t0, t2); // [r0[0],r1[0],r2[0],r3[0]] — column 0 *r1 = _mm_movehl_ps(t2, t0); // [r0[1],r1[1],r2[1],r3[1]] — column 1 *r2 = _mm_movelh_ps(t1, t3); // [r0[2],r1[2],r2[2],r3[2]] — column 2 *r3 = _mm_movehl_ps(t3, t1); // [r0[3],r1[3],r2[3],r3[3]] — column 3 } // 8 instructions total. The macro _MM_TRANSPOSE4_PS does exactly this.
Vectorized Algorithms
This chapter shows how real algorithms are vectorized — with the reasoning behind every decision.
Dot product (horizontal reduction)
The dot product of two vectors requires a reduction: summing all lane results into a single scalar. Reductions are expensive because SIMD is fundamentally parallel — combining across lanes requires shuffles. The pattern is to progressively halve the register, adding pairs:
float dot_product_avx(const float* a, const float* b, int n) { __m256 acc0 = _mm256_setzero_ps(); // 4 accumulators to break FMA chain __m256 acc1 = _mm256_setzero_ps(); // dependency: each FMA waits ~4 cycles __m256 acc2 = _mm256_setzero_ps(); // using 4 hides that latency __m256 acc3 = _mm256_setzero_ps(); int i = 0; for (; i <= n - 32; i += 32) { // process 32 floats per iteration acc0 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i), _mm256_loadu_ps(b+i), acc0); acc1 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i+ 8), _mm256_loadu_ps(b+i+ 8), acc1); acc2 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i+16), _mm256_loadu_ps(b+i+16), acc2); acc3 = _mm256_fmadd_ps(_mm256_loadu_ps(a+i+24), _mm256_loadu_ps(b+i+24), acc3); } /* Combine accumulators */ acc0 = _mm256_add_ps(acc0, acc1); acc2 = _mm256_add_ps(acc2, acc3); acc0 = _mm256_add_ps(acc0, acc2); /* ── Horizontal reduction: 8 floats → 1 float ── */ // Step 1: add upper 128b half to lower half __m128 hi = _mm256_extractf128_ps(acc0, 1); // lanes 4..7 __m128 lo = _mm256_castps256_ps128(acc0); // lanes 0..3 (zero-cost cast) __m128 sum = _mm_add_ps(lo, hi); // [0+4, 1+5, 2+6, 3+7] // Step 2: hadd pairs sum = _mm_hadd_ps(sum, sum); // [(0+4)+(1+5), (2+6)+(3+7), ...] sum = _mm_hadd_ps(sum, sum); // [total, ...] float result = _mm_cvtss_f32(sum); // extract lowest lane as scalar /* Scalar tail for remaining elements */ for (; i < n; i++) result += a[i] * b[i]; return result; }
String processing — SIMD strcmp
SSE4.2 introduced PCMPISTRI and PCMPISTRM — hardware string comparison instructions designed for pattern matching. They can find null terminators, compare character sets, and find substrings in a single instruction. PCMPISTRI returns an index; PCMPISTRM returns a mask. The control byte selects the comparison mode.
For general SIMD string processing without SSE4.2, you process 16 or 32 characters at a time using PSHUFB for table lookups:
// Convert ASCII uppercase to lowercase for 16 bytes simultaneously void to_lower_sse(char* s, int n) { __m128i upper_A = _mm_set1_epi8('A' - 1); // 0x40 __m128i upper_Z = _mm_set1_epi8('Z'); // 0x5A __m128i lo_delta = _mm_set1_epi8(32); // 'a' - 'A' = 32 int i = 0; for (; i <= n - 16; i += 16) { __m128i chunk = _mm_loadu_si128((__m128i_u*)(s + i)); // Compute mask: byte is uppercase if 'A'-1 < byte <= 'Z' __m128i above_A = _mm_cmpgt_epi8(chunk, upper_A); __m128i below_Z = _mm_cmpgt_epi8(upper_Z, chunk); // Z > chunk means chunk < Z // Note: _mm_cmple_epi8 doesn't exist, use Z >= chunk via cmpgt(Z+1, chunk) __m128i is_upper = _mm_and_si128(above_A, below_Z); // &: both conditions // Only add 32 to uppercase bytes (AND mask with delta) __m128i delta = _mm_and_si128(is_upper, lo_delta); chunk = _mm_add_epi8(chunk, delta); _mm_storeu_si128((__m128i_u*)(s + i), chunk); } for (; i < n; i++) s[i] |= (tolower((unsigned char)s[i])); }
Compiler Auto-Vectorization
Modern compilers (GCC, Clang, MSVC) can automatically generate SIMD code from scalar loops. Understanding when and why this works — and when it fails — is essential.
What the vectorizer needs
Auto-vectorization is a loop transformation. The compiler analyzes a scalar loop and asks: can I replace N iterations of this loop with 1 iteration processing N elements at once? For this to be safe and profitable, the compiler needs to verify:
1. No loop-carried dependencies — Iteration N must not read a value written by iteration N-1. a[i] = a[i-1] + 1 cannot be vectorized because each element depends on the previous. a[i] = b[i] * 2 can be vectorized.
2. No pointer aliasing — If two pointers might point to overlapping memory, the compiler can't reorder the accesses. Use restrict to tell the compiler the pointers don't alias.
3. Known or analyzable trip count — The compiler needs to know the loop executes a reasonable number of times. A loop over N elements where N is known at compile time or can be propagated from a function parameter is easily vectorized.
4. Simple control flow — Branches inside the loop body break vectorization unless the compiler can convert them to predicated SIMD operations (blend/mask).
Compiler flags for vectorization
# GCC / Clang — enable auto-vectorization and target ISA gcc -O2 -march=native -fvectorize -ftree-vectorize myfile.c # -O2 enables basic vectorization; -O3 enables more aggressive # -march=native: use all features of the current CPU # -march=x86-64-v3: target AVX2+FMA baseline (common server minimum) # -march=x86-64-v4: target AVX-512 # Get vectorization report (what was/wasn't vectorized and why) gcc -O2 -march=native -fopt-info-vec-optimized -fopt-info-vec-missed myfile.c clang -O2 -march=native -Rpass=loop-vectorize -Rpass-missed=loop-vectorize myfile.c # Enable FMA and AVX2 explicitly gcc -O2 -mavx2 -mfma myfile.c # MSVC cl /O2 /arch:AVX2 /Qvec-report:2 myfile.c
/* ── Use restrict to eliminate aliasing concerns ── */ void add_arrays(float *restrict c, const float *restrict a, const float *restrict b, int n) { for (int i = 0; i < n; i++) c[i] = a[i] + b[i]; // vectorizes cleanly } /* ── Alignment hints ── */ __attribute__((aligned(32))) float buf[1024]; // static aligned array // or: void f(float *__restrict__ __attribute__((aligned(32))) p); /* ── Loop that auto-vectorizes to AVX2 (with -mavx2 -mfma) ── */ void saxpy(float *restrict y, const float *restrict x, float a, int n) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; // → VFMADD231PS } /* ── Loop that does NOT auto-vectorize (loop-carried dependency) ── */ void prefix_sum(float *a, int n) { for (int i = 1; i < n; i++) a[i] += a[i-1]; // NOT vectorizable as-is // Vectorizing prefix sums requires a different algorithm (parallel scan) } /* ── GCC pragma to force/prevent vectorization ── */ #pragma GCC ivdep // "ignore vector dependencies" — trust me, no deps for (int i = 0; i < n; i++) a[i] = b[i] + c[i]; #pragma GCC unroll 4 // manually unroll loop (helps expose ILP to vectorizer) for (int i = 0; i < n; i++) a[i] *= 2.f; /* Clang OpenMP SIMD (portable) */ #pragma omp simd for (int i = 0; i < n; i++) c[i] = a[i] * b[i];
Always verify auto-vectorization by inspecting the generated assembly. Compiler Explorer (godbolt.org) lets you type C code and see the assembly output with color-coded mapping. Look for vmovups, vaddps, vfmadd — these confirm vectorization happened. A loop with only movss and addss is scalar and was not vectorized.
Pitfalls & Anti-Patterns
SIMD code has failure modes that don't exist in scalar code. These are the most common ways to write code that's slower or subtly wrong.
1. The SSE/AVX transition penalty
Mixing legacy SSE instructions (without the v prefix) with VEX-encoded AVX instructions in the same code path triggers a false dependency hazard. The CPU must preserve the upper 128 bits of ymm registers when executing legacy SSE, which creates a stall when switching between SSE and AVX. Always use the VEX-encoded variants. GCC and Clang do this automatically with -mavx. Check the assembly for unexpected vzeroupper instructions — they're a sign the compiler is cleaning up after this issue.
2. Excessive horizontal operations
HADD, HSUB, and HMAX operations look appealing but are often slow. They are typically implemented as a shuffle followed by a vertical operation, and some CPUs implement them directly but with higher latency than two separate vertical ops. Prefer accumulating in multiple vertical accumulators and doing a final reduction at the end.
3. Denormal floats — catastrophic slowdown
IEEE 754 denormal (subnormal) numbers are values very close to zero. Some CPUs handle denormals in microcode — 50–100× slower than normal float operations. This is a real-world catastrophe in audio DSP and ML. The fix: set the FPU to flush-to-zero mode:
#include <immintrin.h> /* ── Flush denormals to zero — critical for audio/ML performance ── */ static inline void disable_denormals(void) { // MXCSR bits: bit 15 = FZ (Flush-to-Zero), bit 6 = DAZ (Denormals-Are-Zero) _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // outputs: flush to zero _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // inputs: treat as zero // Equivalent: _mm_setcsr(_mm_getcsr() | 0x8040); } /* Check if a float value is denormal */ static inline bool is_denormal(float x) { uint32_t bits; memcpy(&bits, &x, 4); return ((bits & 0x7F800000) == 0) && ((bits & 0x007FFFFF) != 0); }
4. Type punning — use memcpy, not pointer casts
A common pattern is treating the bits of a float as an integer (e.g., for the abs trick above). Never use a pointer cast — that's undefined behavior in C and C++ due to strict aliasing rules. Use memcpy, which the compiler optimizes to a register move:
// WRONG: undefined behavior (strict aliasing violation) float f = 1.0f; uint32_t i = *(uint32_t*)&f; // ← DO NOT DO THIS // CORRECT: use memcpy float f2 = 1.0f; uint32_t i2; memcpy(&i2, &f2, 4); // compiles to a single register move // ALSO CORRECT: C++ std::bit_cast (C++20) uint32_t i3 = std::bit_cast<uint32_t>(f2); // ALSO CORRECT: union (C only, not C++) union { float f; uint32_t i; } pun; pun.f = 1.0f; uint32_t i4 = pun.i; // valid in C99, undefined in C++ (but works in practice)
5. Assuming faster = always use widest register
Wider is not always faster. As mentioned: AVX-512 can cause frequency downclocking on older Intel. Even AVX2 vs SSE4: on a lightly-loaded pipeline, SSE4 instructions may have lower latency and higher throughput per operation because the execution units are fully utilized at the narrower width. Profile before widening.
6. Not unrolling enough — hiding FMA latency
FMA has a latency of 4–5 cycles on most architectures but a throughput of 0.5 cycles (2/cycle on port 0+1). If your FMA loop has a data dependency chain — each iteration's FMA uses the previous iteration's output as its accumulator — you're bottlenecked on latency (1 FMA per 4–5 cycles) instead of throughput (2 FMAs per cycle). The fix is to use 4–8 independent accumulators, as shown in the dot product example above.
ARM NEON & SVE
ARM's SIMD ecosystem is different from x86 in important ways — particularly SVE's scalable vector model, which is a paradigm shift.
NEON — ARM's fixed-width SIMD
NEON (Advanced SIMD) is ARM's equivalent of SSE/AVX. It provides 128-bit vector registers (v0–v31, also accessible as 64-bit d0–d31 and 32-bit s0–s31). Unlike x86, NEON was designed into AArch64 from the start — it's mandatory on all AArch64 implementations, not an optional extension.
NEON intrinsics use a different naming convention: v<op><dtype>_<lane_type>. For example: vaddq_f32 (add, q=128-bit, f32 elements). The q suffix means 128-bit ("quad"); without it, it's 64-bit ("double").
#include <arm_neon.h> /* ── NEON types (analogous to Intel __m128) ── */ float32x4_t a, b, c; // 4×float32 (128-bit) int16x8_t ia, ib; // 8×int16 (128-bit) uint8x16_t ua; // 16×uint8 (128-bit) float32x2_t da; // 2×float32 (64-bit, "doubleword") /* ── Load / Store ── */ a = vld1q_f32(ptr); // unaligned 128-bit load (aligned auto-selected) vst1q_f32(ptr, a); // store a = vdupq_n_f32(1.0f); // broadcast (dup = duplicate) a = vld1q_dup_f32(ptr); // load and broadcast /* Multi-register load (de-interleave) — unique NEON feature */ float32x4x3_t rgb = vld3q_f32(ptr); // loads 12 floats, de-interleaves RGB // rgb.val[0] = R channel, rgb.val[1] = G, rgb.val[2] = B // This is the NEON killer feature for AoS→SoA — no manual transpose needed! /* ── Arithmetic ── */ c = vaddq_f32(a, b); // c[i] = a[i] + b[i] c = vmulq_f32(a, b); // c[i] = a[i] * b[i] c = vmlaq_f32(c, a, b); // c[i] = c[i] + a[i]*b[i] (mul-accumulate) c = vmaxq_f32(a, b); // c[i] = max(a[i], b[i]) c = vrecpeq_f32(a); // 1/a[i] approximation /* FMA (AArch64 only, not ARMv7 NEON) */ c = vfmaq_f32(c, a, b); // c[i] = c[i] + a[i]*b[i] (true FMA) c = vfmaq_lane_f32(c, a, da, 0); // multiply each a[i] by da[0], accumulate /* ── Comparison ── */ uint32x4_t mask = vcgtq_f32(a, b); // a > b, returns 0xFFFFFFFF or 0 c = vbslq_f32(mask, a, b); // blend: c[i] = mask[i] ? a[i] : b[i] /* ── Shuffle / permute ── */ uint8x16_t tbl = vtbl1q_u8(src, idx); // table lookup (like x86 PSHUFB) c = vextq_f32(a, b, 2); // [a2,a3,b0,b1] — extract and shift c = vrev64q_f32(a); // reverse pairs: [a1,a0,a3,a2] c = vtrn1q_f32(a, b); // [a0,b0,a2,b2] — transpose even /* Horizontal reduce */ float total = vaddvq_f32(a); // sum all 4 lanes → scalar (AArch64)
SVE — Scalable Vector Extension — a revolution
SVE (and its successor SVE2) is ARM's answer to a fundamental problem: SIMD width keeps changing (128→256→512 bits), but code written for one width doesn't automatically benefit from wider hardware. SVE decouples the algorithm from the hardware width.
The core concept: SVE vector length is unknown at compile time. You query the hardware length at runtime using svcntw() (count of 32-bit elements). You write loops that process vl elements per iteration where vl is this hardware value. The same binary runs on SVE hardware with 128, 256, 512, or 2048-bit vectors without recompilation.
#include <arm_sve.h> void saxpy_sve(float* y, const float* x, float a, uint64_t n) { // svcntw() = number of float32 lanes — determined at RUNTIME by the hardware // On 128-bit SVE: svcntw() = 4 // On 512-bit SVE: svcntw() = 16 // Your code doesn't change. svfloat32_t va = svdup_n_f32(a); // broadcast a to all SVE lanes uint64_t i = 0; for (; i < n; i += svcntw()) { // svwhilelt_b32: create predicate mask for i .. i+vl-1, limited to n svbool_t pg = svwhilelt_b32(i, n); // handles loop tail automatically! svfloat32_t vx = svld1_f32(pg, x + i); // masked load svfloat32_t vy = svld1_f32(pg, y + i); // masked load vy = svmla_f32_z(pg, vy, va, vx); // y = a*x + y (FMA), zeroed mask svst1_f32(pg, y + i, vy); // masked store (safe at tail) } // No scalar tail needed! svwhilelt automatically limits the last iteration. } // This single function runs at optimal width on any SVE implementation: // Cortex-A55 (128b), Fujitsu A64FX (512b), AWS Graviton, Apple M-series, etc.
SVE's length-agnostic model is where SIMD is heading. RISC-V's V extension uses the same philosophy (called "vsetvl"). It solves the "compile once, run on all future hardware widths" problem that has plagued x86 SIMD for decades. If you're writing new code targeting ARM servers (AWS Graviton, Ampere Altra), SVE is the preferred approach.
Performance Analysis & Profiling
Understanding performance requires measurement. Here are the tools, metrics, and mental models for analyzing SIMD code.
Peak throughput calculation
To know whether your code is achieving its potential, first calculate the theoretical peak. For a 3.5 GHz CPU with AVX2:
Float32 throughput = 3.5 GHz × 2 FMA units × 8 floats/unit × 2 ops/FMA = 112 GFLOPS
If your vectorized kernel achieves 30 GFLOPS, you're at 27% of peak. That's normal — memory bandwidth limits most real workloads. The arithmetic intensity (FLOPS per byte loaded) determines whether you're compute-bound or memory-bound. Use the roofline model: plot compute intensity vs. bandwidth ceiling.
Key profiling tools
| Tool | Platform | What it measures |
|---|---|---|
| Intel VTune Profiler | Intel CPUs | Hotspots, vectorization, port utilization, cache misses, memory bandwidth |
| AMD uProf / CodeAnalyst | AMD CPUs | PMU counters, instruction mix, IPC |
| perf stat | Linux x86/ARM | Instruction count, IPC, cache misses, branch mispredictions |
| LIKWID | Linux x86 | FLOP counters, memory bandwidth, SIMD vectorization ratio |
| IACA / LLVM-MCA | Any (static analysis) | Static throughput/latency estimation without running code |
| Intel SDE | x86 (emulator) | AVX-512 on non-AVX-512 hardware, instruction mix analysis |
| Arm Streamline | ARM | NEON utilization, IPC, memory subsystem |
| Godbolt + objdump | All | Assembly inspection, verifying vectorization happened |
Typical achievable speedups by workload
Amdahl's Law — the fundamental limit
If only 50% of your code is vectorizable, the maximum speedup from perfect SIMD is 2×, regardless of how much you accelerate the vectorized portion. Profile first, find the true hotspot, then vectorize. SIMD is pointless on code that spends 5% of time in a loop.
The memory bandwidth ceiling
On a Zen 4 system with DDR5-5600 memory, peak bandwidth is ~90 GB/s. A dense float32 SAXPY (y = a*x + y) requires reading 8 bytes and writing 4 bytes per element: 12 bytes/element. At peak bandwidth that's 7.5 billion elements/second = 7.5 GFLOPS (2 ops per element). The CPU can do 300+ GFLOPS of float32. So SAXPY is fully memory-bound at 2.5% of compute peak. No amount of SIMD optimization changes this — the bottleneck is the memory bus.
The rule: vectorize compute-bound kernels first. Identify whether your hotspot is memory-bound (bandwidth-limited) or compute-bound (FLOP-limited) before writing a single intrinsic. perf stat -e instructions,cycles,cache-misses and comparing the resulting IPC (instructions per cycle) to theoretical peak gives you an immediate answer.
Intel Intrinsics Guide — intrinsics.guide — searchable reference for every intrinsic with latency/throughput data.
Agner Fog's optimization manuals — agner.org — the definitive reference for x86 instruction tables and microarchitecture details.
ISPC compiler — Intel SPMD Program Compiler — high-level SIMD programming model that generates ISA-portable code.
Highway library — google/highway — portable SIMD C++ library abstracting x86/ARM/RISC-V.
simdjson / simdutf — production examples of aggressive SIMD in JSON parsing and UTF-8 validation.