Complete Reference · ISPC 1.30 · Zero Background Assumed

Intel® ISPC Programming
From Scratch to Advanced

This guide explains not just what to write, but why it works — what happens inside the CPU, why certain patterns are fast or slow, and how every ISPC concept maps to real hardware. No prior knowledge of SIMD, vectorization, or parallel programming is assumed.

ISPC v1.30AVX2 / AVX-512ARM NEONIntel Xe GPULLVM backend

01 Why ISPC Exists — The Problem It Solves

Modern CPUs are surprisingly capable of doing the same arithmetic operation on many pieces of data at once — not by running multiple threads, but by a feature built into the very arithmetic units themselves. This feature is called SIMD (Single Instruction, Multiple Data). But reaching that capability from regular C or C++ has historically been a nightmare. ISPC exists to fix that.

The problem with regular C

Imagine you want to multiply a million floats by 2. In C you write:

for (int i = 0; i < N; i++) {
    output[i] = input[i] * 2.0f;
}C

This looks perfectly fine. But it runs one multiplication per clock cycle. Your CPU, if it supports AVX2, could have done eight multiplications per clock cycle instead — all for the same energy and roughly the same time. You're leaving 7 out of 8 operations on the table, every single iteration.

The compiler might auto-vectorize this loop (turn it into SIMD automatically), but auto-vectorization is notoriously fragile. Add a pointer that might alias, a branch inside the loop, an indirect array access, or a function call, and the compiler gives up and falls back to scalar code — silently, with no warning.

The problem with intrinsics

The "professional" way to use SIMD is via intrinsics — special C functions that map directly to SIMD assembly instructions. Here is the same multiply-by-two, written with AVX2 intrinsics:

#include <immintrin.h>

for (int i = 0; i < N; i += 8) {
    __m256 chunk = _mm256_loadu_ps(&input[i]);        // load 8 floats
    __m256 two   = _mm256_set1_ps(2.0f);             // broadcast 2.0 to all 8 lanes
    __m256 result = _mm256_mul_ps(chunk, two);         // 8 multiplies at once
    _mm256_storeu_ps(&output[i], result);              // store 8 results
}C

This is completely unreadable. You've replaced the obvious intent (multiply by 2) with loads of cryptic function names. If you want to add a branch, you now need to understand mask registers and blend instructions. If you want to target AVX-512 instead of AVX2, you rewrite everything. And there are over a thousand intrinsics to learn.

What ISPC gives you

ISPC lets you write something that looks almost exactly like the original scalar C — and compiles it into SIMD code that's as good as, or better than, handwritten intrinsics:

export void double_array(uniform float input[], uniform float output[], uniform int N) {
    foreach (i = 0 ... N) {
        output[i] = input[i] * 2.0f;
    }
}ispc

That's it. One new keyword (foreach), and two qualifiers you'll learn shortly (export and uniform). The compiler handles the rest — loading 8 floats at once, doing 8 multiplications, storing 8 results, handling the tail if N isn't divisible by 8, and adapting to whatever SIMD width your target CPU supports.

💡 The Core Promise of ISPC

You write code that looks like it operates on one element at a time. ISPC compiles it so that it actually runs on 4, 8, or 16 elements simultaneously — transparently.

The speedup is real and large. On a CPU with 8-wide AVX2, you get roughly 5–6× faster code compared to scalar C for compute-bound loops. On AVX-512 with 16-wide vectors, you can approach 10–12×. These aren't theoretical — they are consistently achieved in production code at Intel, Pixar, DreamWorks, and game studios worldwide.

02 What SIMD Is — Inside the CPU

To understand ISPC deeply, you need to understand what's actually happening in the CPU's arithmetic units. This is the foundation that explains why certain ISPC patterns are fast and others are slow.

Scalar arithmetic: one operation at a time

In a normal scalar CPU, when you write a + b for two floats, the CPU loads a into a 32-bit register, loads b into another 32-bit register, runs the addition, and stores the result. One instruction, one result.

SIMD registers: wide containers for multiple values

SIMD CPUs have special wide registers. An AVX2 CPU, for example, has 256-bit registers (named ymm0 through ymm15). You can think of each register as a container that holds eight 32-bit floats side by side:

Normal 32-bit float register (xmm0): ┌──────────────────────────────────┐ │ 3.14159… │ → holds 1 float └──────────────────────────────────┘ 32 bits AVX2 256-bit register (ymm0): ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐ │ 1.0 │ 2.0 │ 3.0 │ 4.0 │ 5.0 │ 6.0 │ 7.0 │ 8.0 │ └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘ lane: 0 1 2 3 4 5 6 7 256 bits = 8 × 32-bit floats

Each of the 8 slots is called a lane. Now here's the key insight: when you execute one SIMD add instruction on two such registers, you get eight additions for the price of one instruction:

ymm0: │ 1.0 │ 2.0 │ 3.0 │ 4.0 │ 5.0 │ 6.0 │ 7.0 │ 8.0 │ + ymm1: │ 0.5 │ 0.5 │ 0.5 │ 0.5 │ 0.5 │ 0.5 │ 0.5 │ 0.5 │ = ymm2: │ 1.5 │ 2.5 │ 3.5 │ 4.5 │ 5.5 │ 6.5 │ 7.5 │ 8.5 │ 1 instruction → 8 additions. All happen in parallel, in the same clock cycle.

Different CPUs, different widths

Not all CPUs have the same width. This is crucial because ISPC has to target whatever hardware is available:

ISA	Register width	Floats per register	Available on
SSE4	128 bits	4 floats	Any x86 from ~2007+
AVX / AVX2	256 bits	8 floats	Intel Haswell+ (2013+), AMD Zen+
AVX-512	512 bits	16 floats	Skylake-X, Ice Lake, Zen 4+
ARM NEON	128 bits	4 floats	All ARM Cortex-A / Apple Silicon

ISPC abstracts over this width. When you write foreach, it generates the right instructions for whichever target you compile for. The same source file works on SSE4, AVX2, and AVX-512.

🧠 Analogy

Think of SIMD like a cashier with a multi-item barcode scanner vs. a regular single-item scanner. The multi-item scanner (SIMD) can scan 8 items in the same time it takes the regular scanner to scan 1. The cashier (CPU core) does the same physical action — but processes far more items per second. ISPC is the system that decides how to pack items onto the scanner optimally.

Why compilers struggle with auto-vectorization

Compilers try to auto-vectorize, but they must be conservative. They can only vectorize when they can prove it's safe. Common blockers:

Pointer aliasing — if two pointers might point to overlapping memory, the compiler can't reorder or batch reads and writes
Data-dependent branches — if the loop body has an if that depends on the data, the compiler often gives up
Function calls — opaque functions break vectorization unless the compiler can inline them
Indirect accesses — reading from a[b[i]] (a random location per element) requires special gather instructions the compiler rarely emits

ISPC sidesteps this by using a programming model where the vectorization is explicit in the language semantics, not inferred from scalar code.

03 The SPMD Programming Model

ISPC is built on a programming model called SPMD — Single Program, Multiple Data. It's the same model used by GPU shaders (GLSL, HLSL), CUDA, and OpenCL. Understanding it is the mental key to everything in ISPC.

What SPMD means in plain English

You write one program. It looks like it's doing one thing. But behind the scenes, the hardware runs multiple copies of that program simultaneously, each working on different data.

Think of it like a restaurant kitchen. You give the same recipe (the program) to four cooks (the program instances). Each cook follows the same steps, but they're working on different plates of food (the data). They all do the same operations, in the same order — just on different ingredients.

This is fundamentally different from multi-threading, where different threads might run completely different code at the same time.

How this maps to SIMD hardware

ISPC maps each "copy" of your program (called a program instance) to one lane of a SIMD register. If you compile for AVX2 (8-wide), your program runs as 8 simultaneous instances, each in its own lane:

You write: ISPC runs it as: float v = data[i]; → lane0: v = data[0] float result = v * 2.0f; lane1: v = data[1] ← all happen simultaneously output[i] = result; lane2: v = data[2] in one SIMD instruction lane3: v = data[3] lane4: v = data[4] lane5: v = data[5] lane6: v = data[6] lane7: v = data[7]

Each program instance has a different value of i (its position in the data), so data[i] reads from different memory locations. But the multiplication operation is the same for all of them — so the CPU does all 8 multiplications with one instruction.

The key insight: you think scalar, the CPU runs SIMD

You don't write "multiply these 8 floats." You write "multiply this float." ISPC figures out how to run 8 copies of your scalar-looking program as actual SIMD instructions. This is the "Implicit" in "Intel Implicit SPMD Program Compiler."

Compare to explicit SIMD (intrinsics), where you write the 8-at-a-time version directly. Or auto-vectorization, where the compiler tries to infer it. SPMD is the middle ground: you declare the parallelism implicitly through the language model, and the compiler makes it concrete.

04 Installation

Linux

# Snap (always gets the latest release automatically)
sudo snap install ispc

# Ubuntu / Debian apt
sudo apt install ispc   # may be an older version

# Manual install from GitHub (always up-to-date)
wget https://github.com/ispc/ispc/releases/latest/download/ispc-v1.30.0-linux.tar.gz
tar xzf ispc-v1.30.0-linux.tar.gz
export PATH="$PWD/ispc-v1.30.0-linux/bin:$PATH"bash

macOS

brew install ispcbash

Windows

# winget
winget install Intel.ISPC

# Or download the zip from GitHub releases, extract, add bin/ to PATH
# Also install Visual C++ Redistributable (required for the runtime)powershell

Verify the installation

ispc --version
# Should print: Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.30.0 ...

ispc --help       # lists all flags
ispc --help-dev   # more internal/debug flagsbash

You'll also need a C++ compiler (GCC, Clang, or MSVC) to compile the C++ host code that calls your ISPC kernels.

05 Your First ISPC Program — Explained Line by Line

Let's build a complete, working ISPC program from scratch and understand exactly what every piece does and why it's needed.

The problem we're solving

We want to compute the square root of a million numbers as fast as possible. Scalar C would do this with a loop. ISPC will do 8 square roots per instruction (on AVX2).

The ISPC kernel — sqrt_kernel.ispc

// This is the ISPC kernel file. It compiles separately from your C++ code.

export void fast_sqrt(
        uniform float input[],    // pointer to input array (shared by all instances)
        uniform float output[],   // pointer to output array
        uniform int   count) {    // how many elements to process

    foreach (i = 0 ... count) {
        output[i] = sqrt(input[i]);
    }
}ispc

Let's go through each new thing:

export — This makes the function callable from C or C++. Without it, the function is internal to ISPC and can't be called from outside. Think of it like declaring a public API. Every function that C++ will call must be exported.

uniform float input[] — The word uniform means this value is the same across all program instances. The pointer itself doesn't change — all 8 instances (lanes) share the same pointer. It's still an array; each instance will read from a different offset within that array. We'll explain uniform vs varying in detail in Chapter 8.

foreach (i = 0 ... count) — This is ISPC's main parallel loop construct. It distributes the range 0 to count-1 across the program instances automatically. On an 8-wide target, the first "iteration" processes elements 0–7 simultaneously, the next processes 8–15, and so on. If count isn't a multiple of 8, ISPC handles the leftover tail automatically.

sqrt(input[i]) — Inside a foreach, each instance has a different value of i. So input[i] reads from a different element per instance. The ISPC sqrt function compiles to a SIMD square root instruction — all 8 square roots happen in one instruction.

The C++ host code — main.cpp

#include <cstdlib>
#include <cstdio>
#include <cmath>
#include "sqrt_kernel_ispc.h"  // auto-generated by ISPC, declares fast_sqrt

int main() {
    const int N = 1024 * 1024;  // 1 million elements

    // Allocate 64-byte aligned memory (important for SIMD performance)
    float* input  = (float*)aligned_alloc(64, N * sizeof(float));
    float* output = (float*)aligned_alloc(64, N * sizeof(float));

    for (int i = 0; i < N; i++) input[i] = (float)i;

    // Call the ISPC kernel — this runs SIMD code under the hood
    ispc::fast_sqrt(input, output, N);

    // Verify a few results
    for (int i = 0; i < 5; i++)
        printf("sqrt(%d) = %f\n", i, output[i]);

    free(input); free(output);
}C++

The call ispc::fast_sqrt(input, output, N) looks completely normal. The ISPC namespace is added automatically by the generated header. Inside, the function is running 8-wide SIMD instructions. From C++'s perspective, it's just a function call.

What the generated header looks like

When you compile with -h sqrt_kernel_ispc.h, ISPC generates this header for you:

#pragma once
#include <stdint.h>

namespace ispc {
    extern "C" {
        // ISPC 'uniform float' maps to C 'float'
        // ISPC 'uniform int' maps to C 'int32_t'
        void fast_sqrt(float* input, float* output, int32_t count);
    }
}C++

This is how the type mapping works: ISPC uniform types map directly to their C equivalents. The varying types (which we'll learn later) are SIMD registers and can't be passed across the C boundary directly.

06 Compilation — The Build Process

Unlike C++ where one compiler handles everything, ISPC uses a two-step process: first compile the ISPC kernel, then link it with your C++ code.

Step 1: Compile the ISPC kernel

# Basic compilation
ispc sqrt_kernel.ispc \
    -o sqrt_kernel.o \          # output object file (linked later)
    -h sqrt_kernel_ispc.h \     # output C++ header (included by main.cpp)
    --target=avx2-i32x8 \       # compile for 8-wide AVX2
    -O2                         # optimization level (same as GCC)bash

Step 2: Compile the C++ host and link

# Compile C++ host code
g++ -O2 -c main.cpp -o main.o

# Link everything together
g++ main.o sqrt_kernel.o -o my_program -lpthread

# Run
./my_programbash

The -lpthread is needed because ISPC's task system (for multi-core parallelism) uses pthreads internally. Even if you're not using tasks, link it to avoid missing symbol errors.

What the --target flag does

The target tells ISPC two things: which instruction set to use, and how many program instances (gang width) to run simultaneously. The format is <isa>-<element-type>x<width>:

# Common targets
--target=sse4-i32x4       # SSE4.1, 4 int32 lanes → 4-wide gang
--target=avx2-i32x8       # AVX2, 8 int32 lanes → 8-wide gang
--target=avx2-i32x16      # AVX2, double-pumped → 16-wide gang
--target=avx512skx-x16    # AVX-512 Skylake-X, 16 lanes
--target=neon-i32x4       # ARM NEON, 4 lanes

# Auto-detect best on current machine
--target=host             # use whatever the host CPU supportsbash

CMake integration

For real projects, you don't want to type these commands manually. ISPC ships CMake helpers:

# CMakeLists.txt
cmake_minimum_required(VERSION 3.14)
project(MyISPCProject)

find_package(ispc REQUIRED)

# This compiles the .ispc file and adds it as a library
add_ispc_library(sqrt_lib sqrt_kernel.ispc
    TARGET avx2-i32x8      # SIMD target
    HEADER sqrt_ispc.h     # generated header
)

add_executable(my_program main.cpp)
target_link_libraries(my_program sqrt_lib)cmake

07 Gangs, Program Instances, and programIndex

Now let's go deeper into the execution model. These concepts are the foundation for understanding everything else in ISPC.

The gang: a group of program instances running together

When your C++ code calls an ISPC export function, ISPC doesn't run your code once — it runs a gang of program instances. Each instance executes the same instructions, but on different data. The number of instances in a gang is called the gang width, and it equals the SIMD width of your target.

On AVX2 (8-wide), a gang contains 8 program instances. They all start at the beginning of your function at the same time and run together — like a group of soldiers doing the same drill in formation.

programIndex: each instance knows its lane number

Each program instance has a built-in read-only variable called programIndex. It's different in each lane — lane 0 has programIndex == 0, lane 1 has programIndex == 1, and so on up to programCount - 1.

This is what lets each instance work on different data:

export void show_lanes(uniform float output[]) {
    // programIndex is different for every lane
    // This runs ONE gang, so it writes programCount values
    output[programIndex] = (float)programIndex;
}
// On AVX2 (8-wide): output = [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]ispc

programCount is also a built-in — it's a compile-time constant equal to the gang width. On AVX2-i32x8, it's always 8.

How foreach uses this internally

The foreach loop is really a shorthand for a pattern using programIndex and programCount. This is what happens "under the hood":

// What you write:
foreach (i = 0 ... 1000) {
    output[i] = input[i] * 2.0f;
}

// What ISPC conceptually does (simplified):
for (uniform int base = 0; base < 1000; base += programCount) {
    // Each instance gets a different value of i
    int i = base + programIndex;   // varying: different per lane
    if (i < 1000) {               // handle tail (if 1000 not divisible by 8)
        output[i] = input[i] * 2.0f;
    }
}ispc

In the first "iteration" of the outer loop: base=0, so i is [0,1,2,3,4,5,6,7] across the 8 lanes. All 8 loads, multiplies, and stores happen simultaneously. In the second "iteration": base=8, so i is [8,9,10,11,12,13,14,15]. And so on.

ℹ️ Always prefer foreach over manual loops

The manual version above has a bug risk (the tail handling is easy to get wrong) and may not optimize as well. Always use foreach for data-parallel loops. The manual version is shown here only to explain what foreach is doing.

The "gang" crosses the whole function call

The gang doesn't just exist for one loop — it's alive for the entire duration of the ISPC function call. If your ISPC function calls a helper function, the gang carries through. Local variables are stored per-lane (each lane has its own copy of varying variables). This is why ISPC programs look serial but actually have hidden parallelism everywhere.

08 Uniform vs Varying — The Most Important Distinction

This is the single most impactful concept in ISPC. Getting it wrong is the most common source of bugs and performance problems. Let's build a complete mental model from scratch.

The core question: "Is this value the same across all lanes?"

Every value in ISPC is either:

uniform: One copy exists, shared by all program instances (all lanes see the same value)
varying: One copy per lane — each of the 8 instances has its own independent value

The default in ISPC is varying. If you don't write a qualifier, the value is varying.

What they look like in hardware

uniform float u = 3.14f; ┌──────────────────────────────────────────┐ │ 3.14 │ → 1 scalar register or immediate └──────────────────────────────────────────┘ All 8 lanes see the same 3.14. varying float v; (each lane has a different value) ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐ │ 1.0 │ 2.5 │ -0.3 │ 7.1 │ 0.0 │ 4.4 │ 9.9 │ 3.3 │ └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘ lane: 0 1 2 3 4 5 6 7 → 1 SIMD register (ymm0 on AVX2) holds all 8 values at once

A concrete example that shows why this matters

export void scale_array(uniform float data[],    // pointer = same for all lanes
                          uniform int   count,     // length = same for all lanes
                          uniform float factor) {  // scale = same for all lanes

    foreach (i = 0 ... count) {
        // 'i' here is varying: lane0 has i=0, lane1 has i=1, etc.
        // 'data[i]' is varying: each lane reads from a different address
        float val = data[i];  // 'val' is varying (different per lane)

        // 'factor' is uniform: one value, broadcast to all 8 lanes
        // 'val * factor' = SIMD multiply: ymm_val * broadcast(factor)
        data[i] = val * factor;
    }
}ispc

Here, factor is uniform because it's a scale factor — the same number applies to every element. The CPU has one copy of it and broadcasts it across the SIMD multiply. val is varying because each lane is processing a different array element.

Why uniform matters for performance

Operations on uniform values use scalar instructions. Operations on varying values use SIMD instructions. The issue is not just instruction count — it's that certain things are much cheaper (or only possible) with uniform values:

Operation	If condition is uniform	If condition is varying
`if (cond) { ... }`	A simple branch — one instruction. Only one path executes.	Both the true and false branches may run, with masking. Twice the work.
Loop bound `while (cond)`	Loop exits as soon as the scalar condition is false. Normal loop.	Loop continues until ALL lanes are false. Extra iterations for slow lanes.
Function call target	One function call.	Up to 8 different function calls (one per lane), serialized.
Array index	One load + broadcast. Fast.	8 independent loads (a gather). Slow — see Chapter 12.

The rules for uniform vs varying

Rule 1: Parameters from C++ are always uniform. C++ doesn't know about lanes — it hands you one value, which is the same for all lanes.

Rule 2: Inside foreach, the loop variable i is automatically varying — that's the whole point; each lane gets a different i.

Rule 3: Any arithmetic involving a varying value produces a varying result. It's "contagious" — once you mix a uniform and a varying, the result is varying.

Rule 4: You can convert uniform → varying implicitly (just broadcast). You cannot convert varying → uniform implicitly — you need an explicit reduction function like reduce_add().

uniform float u = 5.0f;
float          v = data[programIndex];  // varying

float result1 = u + v;    // OK: uniform + varying = varying (broadcast u first)
float result2 = v * v;    // OK: varying * varying = varying

uniform float bad = v;    // ERROR: can't put 8 different values into 1 uniform

uniform float total = reduce_add(v);  // OK: sum all 8 lane values → one uniformispc

When to mark things uniform

The compiler will give you a warning if you use something varying where it expects uniform, so you won't silently get wrong results. But for performance, the rule is: mark something uniform whenever you know it's the same across all lanes. Common candidates:

Array lengths, counts, iteration limits
Scale factors, thresholds, configuration values
Pointers to data that is read (not indexed per-lane)
Any value computed from uniform inputs only
Loop counters in outer (non-foreach) loops

09 The Execution Mask — How Branching Works in SIMD

This is one of the trickiest parts of SIMD programming. When not all lanes should execute the same code (because of an if statement, for example), the CPU uses a mask to selectively apply results. ISPC manages this mask automatically, but understanding it explains many performance characteristics.

The problem: SIMD can't truly branch

In scalar code, an if statement causes the CPU to either execute the true branch or the false branch — never both. That's what a branch instruction does.

In SIMD, you have 8 lanes running together. They can't physically take different branches — they share a single instruction pointer. So what happens when 4 lanes need the true branch and 4 need the false branch?

The answer is: both branches run, but results from the "wrong" branch are thrown away using a mask.

What the mask looks like

The execution mask is a bitmask — one bit per lane. When a bit is 1, that lane is "active" (its results are written). When a bit is 0, that lane is "masked off" (its results are discarded, or it avoids certain operations like memory writes).

int x = data[programIndex]; // x = [3, -1, 7, -2, 5, -4, 8, -6] if (x > 0) { // Which lanes is x > 0? → [1, 0, 1, 0, 1, 0, 1, 0] // ↑ // this is the execution mask // 1 = active, 0 = masked off result[programIndex] = x * 2; // Only lanes 0, 2, 4, 6 write. Lanes 1, 3, 5, 7 are masked. // But all 8 lanes run the multiply instruction! // The masked lanes just don't commit their results. }

Predication: how masked writes work

Modern SIMD instructions support predicated writes (masked stores). When a lane is masked off, the store instruction knows not to write that lane's result to memory. The arithmetic still runs, but the result is silently discarded. This is safe but not free — the CPU still spends time on the instruction even for masked lanes.

The "both branches run" consequence

foreach (i = 0 ... N) {
    float val = data[i];

    if (val > 0.0f) {
        // Lanes where val > 0 are active here
        // Lanes where val ≤ 0 are masked
        output[i] = sqrt(val);   // SIMD sqrt runs for all 8 lanes
                                   // but only active lanes write results
    } else {
        // Now the other lanes are active
        output[i] = 0.0f;
    }
    // After the if-else: all lanes are active again
}ispc

The consequence: if you have an expensive operation (like a transcendental function) inside an if that only applies to some lanes, it still runs for all lanes. The masked lanes just don't write their results. So divergent branches can be costly.

Gang convergence: ISPC's implicit guarantee

ISPC guarantees that all program instances in a gang will converge at the end of every structured control flow statement (end of if, end of loop body, etc.). This means:

You never have to worry about lanes getting "stuck" in a branch forever
After an if-else, all lanes are active again
This is called the implicit reconvergence guarantee, and it's what makes ISPC's model "implicit" SPMD

⚠️ Performance Implication

Because both sides of a varying branch run with masking, the cost of a branch is roughly: cost(true branch) + cost(false branch) rather than max(cost(true), cost(false)) as in scalar code. When both branches are expensive, consider restructuring your algorithm to reduce divergence.

10 Control Flow in Depth

Regular if: works, but costs both branches

foreach (i = 0 ... N) {
    float v = data[i];
    if (v < 0.0f) {    // varying condition → masking
        v = -v;          // negation runs for all lanes, masked writes
    }
    output[i] = v;
}ispc

Coherent if (cif): checks the mask first

Sometimes you know that in practice, all lanes in a gang will almost always agree on a branch. For example, if you're processing particles and most of them are active vs. inactive, in most gangs either all 8 are active or all 8 are inactive (rarely 4 and 4). In that case, you can use cif (coherent if):

foreach (i = 0 ... N) {
    float v = data[i];
    cif (v < 0.0f) {    // coherent if: checks if ALL lanes agree
        v = -v;           // if all 8 lanes are negative: run unmasked (full speed)
                          // if some lanes differ: falls back to masked execution
    }
    output[i] = v;
}ispc

cif generates a runtime check: "are all active lanes going the same direction?" If yes, it skips the masking overhead entirely. If not, it falls back to the normal masked path. The overhead of the check is tiny (a few instructions to examine the mask register), so cif is almost always at least as fast as regular if, and often much faster.

Similarly, cfor, cwhile, and cdo are coherent versions of loops. They check whether the loop condition is uniform across all lanes on each iteration.

foreach_active: serialize over lanes

foreach_active is a special construct that runs its body once per active lane, but serially (one lane at a time). It "un-vectorizes" for a specific section of code. This is useful when you need to do something inherently serial — like inserting into a hash table — but only for the lanes where some condition is true.

foreach (i = 0 ... N) {
    float val = data[i];

    if (val > threshold) {
        // Some lanes are active here. We want to add each active lane's
        // value to a hash table — but that can't be done in parallel.
        foreach_active (lane) {
            // 'lane' is the index of the currently executing instance (uniform)
            // This body runs once for lane 0, then lane 2, then lane 5, etc.
            // (whichever are active in the outer if)
            // Inside here, everything is UNIFORM — scalar code
            hashtable_insert(keys[lane], val);   // val is now uniform (current lane's value)
        }
    }
}ispc

ℹ️ When to use foreach_active

Use it when you have a data-dependent operation that can't be parallelized — like random writes to a shared data structure, hash table inserts, linked list manipulations. The key insight: even though the body runs serially, you've still gained efficiency because the outer foreach loaded and processed data 8-at-a-time; only the "special case" work is serialized.

foreach_unique: de-duplicate varying values

foreach_unique is a more specialized version. It iterates over all the distinct values of a varying variable, running the body once per unique value with all lanes sharing that value active:

foreach (i = 0 ... N) {
    // Suppose each element belongs to one of a few buckets:
    int bucket = data[i] / bucket_size;  // varying: lane0=2, lane1=2, lane2=5, lane3=2...

    // Process all elements that share the same bucket value together
    foreach_unique (b in bucket) {
        // 'b' is uniform (the current unique bucket value being processed)
        // All lanes where bucket==b are active
        // This block runs once for b=2 (with lanes 0,1,3 active),
        // then once for b=5 (with lane 2 active), etc.
        update_bucket_stats(b, ...);
    }
}ispc

Loops with varying conditions

When a loop condition is varying, the loop continues until all active lanes are done — even if some lanes satisfied the exit condition many iterations ago:

foreach (i = 0 ... N) {
    float z = start[i];   // varying starting point per lane

    // This loop keeps running until ALL lanes have z > threshold
    // Lane 0 might finish in 3 iterations, lane 5 in 20 iterations
    // The loop runs 20 times total — lanes that finished early just get masked
    while (z < threshold) {
        z = z * z + 0.1f;
    }

    output[i] = z;
}
// This is actually correct for Mandelbrot-like computations!
// But if you know most lanes finish at similar times, cwhile is faster:
    cwhile (z < threshold) {
        z = z * z + 0.1f;
    }ispc

11 How Memory Loads Work in SIMD

Memory access patterns have an enormous impact on SIMD performance — often more than the arithmetic itself. Understanding this is essential to writing fast ISPC code.

The ideal case: sequential (contiguous) access

The most efficient SIMD memory access is when all 8 lanes read from consecutive memory locations. This is called a sequential or contiguous load. The CPU can handle this with a single instruction that loads a whole cache line into a SIMD register.

Memory: [1.0][2.0][3.0][4.0][5.0][6.0][7.0][8.0][9.0][10.0]... Lane: 0 1 2 3 4 5 6 7 Reads: ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ └─────┴─────┴─────┴─────┴─────┴─────┴─────┘ 1 instruction: vmovaps ymm0, [ptr] loads all 8 values at once. FAST.

This is what happens when you access data[i] inside a foreach loop where i is a simple counter. Lane 0 reads data[0], lane 1 reads data[1], etc. — eight consecutive addresses, one load instruction.

What determines whether a load is sequential

The key is whether the addresses computed by each lane are consecutive. If they are, ISPC emits a fast vector load. The load type is determined by analyzing the address expression:

data[programIndex] — lane 0 reads index 0, lane 1 reads index 1, ..., sequential ✅
data[i] inside foreach (i = 0 ... N) — same thing, sequential ✅
data[i * 2] — stride 2, lane 0 reads index 0, lane 1 reads index 2, ... strided (less efficient but sometimes vectorizable)
data[indices[programIndex]] — each lane reads from a random location → gather ❌ (expensive)

12 Gather and Scatter — What They Are and Why They're Slow

Gather and scatter are the SIMD equivalents of reading from and writing to non-consecutive memory locations. They are the most important performance concept to understand in ISPC, and the most common source of unexpected slowdowns.

What is a gather?

A gather happens when each lane in a gang reads from a different, non-consecutive memory address. For example, if you have an array data[] and an array of indices idx[], and you do data[idx[i]] inside a foreach, each lane is computing a different index into data — and those indices can be anywhere in memory.

data: [0.1][0.2][0.3][0.4][0.5][0.6][0.7][0.8][0.9][1.0][1.1][1.2] [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] idx (varying): [3, 8, 1, 10, 5, 0, 7, 11] ← different per lane Gather: each lane reads from its own arbitrary address: lane 0 → data[3] = 0.4 lane 1 → data[8] = 0.9 ← these are scattered around in memory lane 2 → data[1] = 0.2 ← no cache locality, potential cache misses lane 3 → data[10] = 1.1 lane 4 → data[5] = 0.6 lane 5 → data[0] = 0.1 lane 6 → data[7] = 0.8 lane 7 → data[11] = 1.2 Result SIMD register: [0.4, 0.9, 0.2, 1.1, 0.6, 0.1, 0.8, 1.2] Cost: potentially 8 separate cache misses. 4-10× slower than sequential.

Hardware does have instructions for gathers (e.g., vgatherdps on AVX2) — these can do all 8 reads in one instruction — but they're still much slower than a single contiguous load because the data is scattered in memory, causing more cache misses.

What is a scatter?

A scatter is the write equivalent of a gather. It happens when each lane tries to write to a different, non-consecutive memory address. It has the same performance implications as a gather, plus it must be careful about lanes writing to the same address (a race condition).

Writing to arbitrary locations (scatter): result[idx[programIndex]] = val; lane 0 writes val0 → result[3] lane 1 writes val1 → result[8] ← these stores are scattered lane 2 writes val2 → result[1] ... If two lanes write to the same index, the result is undefined! This is a data race — ISPC does not protect against this.

How to recognize gather/scatter in your code

The rule is simple: if the index used to access an array is varying (different per lane), you get a gather (for reads) or scatter (for writes). If the index is uniform or a simple linear function of programIndex, you get a fast sequential or strided access.

uniform float data[1000];
uniform int   indices[1000];

foreach (i = 0 ... 1000) {
    // ✅ FAST: i is a sequential varying index → sequential (strided) load
    float v1 = data[i];

    // ❌ SLOW GATHER: indices[i] is varying with unpredictable values
    // Each lane computes a different index, memory locations are random
    float v2 = data[indices[i]];

    // ❌ SLOW GATHER: programIndex is varying and data access is indirect
    float v3 = data[indices[programIndex]];

    // ✅ FAST: uniform index → all lanes read the same value (broadcast)
    uniform int j = 42;
    float v4 = data[j];   // one load, broadcast to all 8 lanes
}ispc

ISPC will still compile code with gathers — it's not an error

ISPC silently generates gather/scatter instructions when your access patterns require them. You won't get an error or warning by default. This is by design — gathers are sometimes unavoidable. But they can be 4–10× slower than sequential accesses for large data. You can find them by looking at the assembly output (--emit-asm) and searching for vgather instructions.

How to avoid gathers

The primary strategy is to restructure your data so that elements accessed together in the same gang are also contiguous in memory. This leads us directly to the AOS vs SOA discussion in the next chapter.

When you truly need random access, a gather is fine — just be aware of its cost and don't try to avoid it by hoisting it out of the loop if it's actually needed per-element.

13 Array of Structures vs Structure of Arrays

This is the most impactful data layout choice in SIMD programming. It determines whether your memory accesses are sequential (fast) or gathered (slow).

Array of Structures (AOS) — the natural C way

The way you'd naturally define a particle in C:

struct Particle {
    float x, y, z;     // position
    float vx, vy, vz;  // velocity
    float mass;
};

Particle particles[N];C

In memory, this looks like:

Memory layout (AOS): [x0][y0][z0][vx0][vy0][vz0][m0] [x1][y1][z1][vx1][vy1][vz1][m1] [x2]... ───────────── particle 0 ──────── ─────────── particle 1 ───────── ─────

Now imagine you want to update all X positions. In a SIMD loop, you'd want to load x0, x1, x2, x3, x4, x5, x6, x7 (eight X values) into a SIMD register. But look at the memory layout: x0, x1, x2, x3... are not consecutive! Between x0 and x1, there are y0, z0, vx0, vy0, vz0, m0 — 6 other floats. So the addresses you need are: ptr+0, ptr+7, ptr+14, ptr+21, ... (stride 7). That's a gather!

Structure of Arrays (SOA) — the SIMD-friendly way

The SOA layout separates each field into its own array:

struct Particles {   // Note: singular struct, plural name
    float x[N];
    float y[N];
    float z[N];
    float vx[N];
    float vy[N];
    float vz[N];
    float mass[N];
};

Particles particles;C

In memory:

Memory layout (SOA): [x0][x1][x2][x3][x4][x5][x6][x7][x8]...[xN] [y0][y1][y2][y3][y4][y5][y6][y7][y8]...[yN] [z0][z1][z2][z3]... ...

Now when you want to load 8 X values, they're at consecutive addresses: x0, x1, x2, x3, x4, x5, x6, x7 — perfect for a single SIMD vector load. Zero gathers needed.

ISPC's soa keyword: the best of both worlds

Writing SOA manually is tedious because you lose the natural particle.x syntax. ISPC's soa<N> qualifier lets you declare a struct that looks like AOS but is stored as SOA:

// Declare the struct as SOA with a tile size matching your gang width (8 for AVX2)
soa<8> struct Particle {
    float x, y, z;
    float vx, vy, vz;
    float mass;
};

// Memory layout with soa<8>:
// [x0..x7][y0..y7][z0..z7][vx0..vx7]...[x8..x15][y8..y15]...
// Grouped in tiles of 8

export void integrate(uniform soa<8> Particle particles[], uniform int N,
                        uniform float dt) {
    foreach (i = 0 ... N) {
        // Despite looking like field access, these are sequential loads!
        // particles[i].x loads x0,x1,...,x7 from consecutive addresses
        particles[i].x += particles[i].vx * dt;
        particles[i].y += particles[i].vy * dt;
        particles[i].z += particles[i].vz * dt;
    }
}ispc

The tile size must match the gang width

The 8 in soa<8> is the tile size — how many elements are grouped together in SOA layout. For best results, this should match your programCount. If you compile for AVX2 (gang width 8), use soa<8>. For AVX-512 (gang width 16), use soa<16>.

AOS-to-SOA conversion at runtime

Sometimes your data comes from C++ in AOS format (because C++ code uses natural struct arrays). ISPC provides built-in functions to convert:

// You have AOS data: [x0,y0,z0,w0, x1,y1,z1,w1, x2,y2,z2,w2, ...]
// You want SOA:      [x0..xN, y0..yN, z0..zN, w0..wN]

uniform float aos[N*4];  // input: interleaved
uniform float soa_x[N], soa_y[N], soa_z[N], soa_w[N];  // output: separated

// Converts 4-component AOS to SOA
aos_to_soa4(aos, soa_x, soa_y, soa_z, soa_w, N);

// Process in SOA...

// Convert back if needed
soa_to_aos4(soa_x, soa_y, soa_z, soa_w, aos, N);ispc

14 Data Alignment — Why It Matters

SIMD load instructions have alignment requirements. Understanding why alignment matters helps you avoid subtle performance bugs.

What alignment means

A value is "N-byte aligned" if its memory address is a multiple of N. For example, an array of 8 floats (32 bytes) is 32-byte aligned if it starts at an address like 0x100, 0x120, 0x140 — any multiple of 32. It's unaligned if it starts at 0x104 or 0x11A.

SIMD vector loads are most efficient when the data starts at an address aligned to the vector width:

SSE4 (16 bytes): needs 16-byte alignment
AVX2 (32 bytes): needs 32-byte alignment for aligned load instructions
AVX-512 (64 bytes): needs 64-byte alignment for aligned loads

Modern CPUs can handle unaligned loads, but they may be slower (especially when the data crosses a cache line boundary).

How to allocate aligned memory

// C++ (C11 / POSIX)
float* data = (float*)aligned_alloc(64, N * sizeof(float));
// 64 = cache line size = also the AVX-512 alignment. Safe for all SIMD widths.

// C++ standard (C++17)
float* data = new (std::align_val_t(64)) float[N];

// Don't forget to use the matching free:
free(data);      // for aligned_alloc
operator delete[] (data, std::align_val_t(64));  // for aligned newC++

Telling ISPC about alignment

// In ISPC, declare aligned local arrays with __attribute__
uniform float __attribute__((aligned(64))) scratch[64];

// Or use the aligned attribute (v1.27+)
__attribute__((aligned(64))) uniform float buf[64];ispc

💡 Practical Advice

Always allocate ISPC-processed data with 64-byte alignment (one cache line). This works for all targets from SSE4 to AVX-512, ensures no cache-line splits, and costs nothing in practice since malloc already aligns to 16 bytes — you're just being more careful about larger SIMD widths.

15 Types in Depth

Primitive types

ISPC Type	Size	C Equivalent	Notes
`bool`	8-bit	`bool`	true=1, false=0 (matches C ABI since v1.23)
`int8` / `uint8`	8-bit	`int8_t` / `uint8_t`	Byte-wide integers
`int16` / `uint16`	16-bit	`int16_t` / `uint16_t`
`int` / `int32` / `uint32`	32-bit	`int32_t`	`int` is always 32-bit in ISPC (unlike C)
`int64` / `uint64`	64-bit	`int64_t`
`float16`	16-bit	(no direct C equiv)	Half precision; requires AVX2+ or conversion
`float`	32-bit	`float`	IEEE 754 single precision
`double`	64-bit	`double`	IEEE 754 double precision

An important difference from C: in ISPC, int is always 32 bits. There is no long ambiguity. Use int64 when you need 64-bit integers explicitly.

Short vector types: fixed-size mathematical vectors

ISPC provides a family of fixed-size vector types written as type<N>. These are completely independent of the SIMD gang width — they're just convenient containers for 2, 3, or 4-component math vectors (like positions, colors, quaternions).

float<3> position;      // 3-component float vector (x, y, z)
float<4> color;         // RGBA
int<2>   pixel_coord;   // 2D pixel position

// Access by name or index
position.x = 1.0f;
position.y = 2.0f;
position.z = 3.0f;
float w = color[3];   // alpha channel

// These can be uniform or varying:
uniform float<3> world_up = {0.0f, 1.0f, 0.0f}; // one vec3 for all lanes
float<3>         per_lane_pos;  // varying: each lane has its own vec3ispc

Short vectors and the standard library (v1.28+)

As of v1.28, short vector functions moved to a dedicated header that you must include:

#include "short_vec.isph"   // needed for short vector operations in v1.28+

uniform float<3> a = {1.0f, 2.0f, 3.0f};
uniform float<3> b = {4.0f, 5.0f, 6.0f};

uniform float<3> sum    = a + b;         // {5, 7, 9}
uniform float<3> scaled = a * 2.0f;      // {2, 4, 6}
uniform float<3> m      = max(a, b);     // {4, 5, 6} (element-wise)
uniform float<3> rooted = sqrt(a);      // {1.0, 1.414, 1.732}

// Dot product (manual, no built-in dot for short vecs)
uniform float dot = a.x*b.x + a.y*b.y + a.z*b.z;  // 32.0ispc

Structs and operator overloading

struct Color {
    float r, g, b, a;
};

// Operator overloading (ISPC 1.28+ supports +,-,*,/,==,!=,<,>,<=,>=,etc.)
inline Color operator+(Color x, Color y) {
    Color result;
    result.r = x.r + y.r;
    result.g = x.g + y.g;
    result.b = x.b + y.b;
    result.a = x.a + y.a;
    return result;
}

inline Color operator*(Color c, float scale) {
    Color result;
    result.r = c.r * scale;
    result.g = c.g * scale;
    result.b = c.b * scale;
    result.a = c.a * scale;
    return result;
}

// Now you can write:
Color blended = (color_a + color_b) * 0.5f;ispc

16 foreach and Iteration Patterns

We've touched on foreach already, but let's go through all its forms and understand the tradeoffs.

Basic 1D foreach

This is the workhorse. The range 0 ... N is inclusive of 0, exclusive of N. The loop variable is automatically varying — each lane sees a different value.

foreach (i = 0 ... N) {
    // i is varying. Lane 0: i=0, lane 1: i=1, ..., lane 7: i=7
    // Next iteration: lane 0: i=8, lane 1: i=9, ...etc.
    output[i] = sin(input[i]);
}ispc

Multi-dimensional foreach: processing 2D and 3D data

When working with image or volume data, you often want to iterate over two or three indices at once. ISPC's foreach supports multiple dimensions:

export void process_image(uniform float pixels[], uniform int W, uniform int H) {
    foreach (row = 0 ... H, col = 0 ... W) {
        // 'row' and 'col' are both varying
        // ISPC distributes the 2D iteration space across the gang
        // So adjacent lanes might be at (row=0,col=0), (row=0,col=1), etc.
        int idx = row * W + col;
        pixels[idx] = gamma_correct(pixels[idx]);
    }
}

// 3D volume data
export void process_volume(uniform float vol[], uniform int X, uniform int Y, uniform int Z) {
    foreach (z = 0 ... Z, y = 0 ... Y, x = 0 ... X) {
        int idx = (z * Y + y) * X + x;
        vol[idx] = process_voxel(vol[idx], x, y, z);
    }
}ispc

foreach_tiled: different distribution for better cache behavior

Regular foreach distributes elements in a linear (row-major) order. For some 2D algorithms, a tiled distribution — where a gang processes a small 2D tile rather than a row-wide strip — can improve cache locality because all 8 lanes access nearby pixels.

// Regular foreach: lane assignment for 4-wide gang on a 4×4 image
// iter 1: pixels (0,0) (0,1) (0,2) (0,3)  ← same row, sequential in memory
// iter 2: pixels (1,0) (1,1) (1,2) (1,3)  ← next row

// foreach_tiled: tiles the space differently
// iter 1: pixels (0,0) (0,1) (1,0) (1,1)  ← 2×2 tile, spatially local
// iter 2: pixels (0,2) (0,3) (1,2) (1,3)  ← next tile
foreach_tiled (row = 0 ... H, col = 0 ... W) {
    // Same code, different lane layout for cache efficiency
    pixels[row * W + col] = process_pixel(row, col);
}ispc

Use foreach_tiled when your kernel reads data from a 2D neighborhood (e.g., blur filters, convolutions) — the tiled distribution means nearby lanes access overlapping data, which stays hot in cache.

17 Functions — Declarations, Calls, and Interop

export, task, and plain functions

ISPC has three kinds of function declarations, each with a different purpose:

Declaration	Callable from	Uniform/varying state	Use case
`export void foo()`	C++ and ISPC	Enters SPMD mode: varying defaults	Public API called by C++
`void foo()` (no qualifier)	ISPC only	Stays in current SPMD mode	Internal helpers, can be inlined
`task void foo()`	Via `launch` statement	Starts fresh SPMD mode	Multi-core task functions (Chapter 18)

// Internal helper: inline if possible, no C++ visibility
inline float fast_sigmoid(float x) {
    return 1.0f / (1.0f + exp(-x));
}

// Public kernel: C++ calls this, it's in the namespace ispc::
export void apply_sigmoid(uniform float data[], uniform int N) {
    foreach (i = 0 ... N) {
        data[i] = fast_sigmoid(data[i]);  // calls helper with varying float
    }
}ispc

Calling C functions from ISPC

Sometimes you need to call an existing C function from within ISPC. This works, but there's an important constraint: C functions are scalar — they take one value and return one value. If you call a C function with a varying argument, ISPC must call it once per active lane, serially. This destroys SIMD parallelism for that call.

// Declare the C function
extern "C" uniform float my_c_function(uniform float x);

export void call_c_example(uniform float data[], uniform int N) {
    // GOOD: uniform argument → one call, fast
    uniform float global_factor = my_c_function(data[0]);

    foreach (i = 0 ... N) {
        // SLOW: varying argument → called 8 times serially (once per lane)
        // Avoid this pattern in hot loops!
        float per_lane_result = my_c_function(data[i]);

        data[i] *= global_factor;
    }
}ispc

Function overloading

ISPC supports function overloading exactly like C++. You can define the same function name for different types:

float clamp(float v,  float lo,  float hi)  { return min(max(v, lo), hi); }
int   clamp(int   v,  int   lo,  int   hi)  { return min(max(v, lo), hi); }

// The compiler picks the right version based on argument types
float cf = clamp(3.5f, 0.0f, 1.0f);   // picks float version
int   ci = clamp(200,  0,    255);    // picks int versionispc

The external_only attribute (v1.25+)

By default, export functions generate two versions: one callable from C++ (external), and one callable from other ISPC functions (internal). The internal version can increase code size significantly. If you only call the function from C++, use external_only to suppress the internal version:

// This function is ONLY ever called from C++, not from other ISPC code
__attribute__((external_only))
export void my_kernel(uniform float data[], uniform int N) {
    foreach (i = 0 ... N) {
        data[i] *= 2.0f;
    }
}ispc

18 Task Parallelism — Multi-Core with ISPC

So far, everything we've discussed runs on a single CPU core using SIMD. ISPC also has a system for distributing work across multiple CPU cores simultaneously. This uses a different mechanism: tasks.

The two levels of parallelism

Without tasks (single core): Core 0: [gang][gang][gang][gang][gang]... → all work on one core, SIMD only With tasks (multi-core): Core 0: [task 0 - gang][gang][gang]... Core 1: [task 1 - gang][gang][gang]... → 4 cores × 8 SIMD = 32× parallelism Core 2: [task 2 - gang][gang][gang]... Core 3: [task 3 - gang][gang][gang]...

Tasks are like threads, but they go through ISPC's lightweight task scheduler rather than creating OS threads. Each task is itself an ISPC SPMD function — it runs a whole gang per "iteration" internally.

Defining a task function

A task function uses the task keyword instead of export. Inside, it has access to two special built-in variables: taskIndex (which task am I?) and taskCount (how many tasks total?). Both are uniform.

// A task processes a chunk of the total work
task void compute_chunk(uniform float input[], uniform float output[],
                          uniform int totalN) {

    // Each task handles a contiguous slice of the array
    // taskIndex and taskCount divide up the work evenly
    uniform int start = (taskIndex * totalN) / taskCount;
    uniform int end   = ((taskIndex + 1) * totalN) / taskCount;

    // Inside the task, we use SIMD foreach as normal
    foreach (i = start ... end) {
        output[i] = sqrt(input[i]) * log(input[i] + 1.0f);
    }
}ispc

Launching tasks and waiting for them

export void compute_parallel(uniform float input[], uniform float output[],
                               uniform int N) {

    // Choose task count: typically 1-2× the number of CPU cores
    // Too few: underutilizes cores. Too many: task overhead dominates.
    uniform int numTasks = 8;   // tune to your machine

    // Launch all tasks simultaneously — they run in parallel on different cores
    launch[numTasks] compute_chunk(input, output, N);

    // 'sync' blocks until ALL launched tasks have completed
    // You must sync before accessing any data that tasks wrote to
    sync;

    // Now all output[] values are ready
}ispc

The task runtime requirement

ISPC tasks need a runtime library to manage thread pools and scheduling. ISPC ships with one in its examples directory. You need to compile and link it with your program:

# Compile the task system (once, or add to your build)
g++ -O2 -c ispc/examples/common/tasksys.cpp -o tasksys.o

# Link it with your program
g++ main.o mykernel.o tasksys.o -o my_program -lpthreadbash

Alternatively, you can use Intel TBB (Threading Building Blocks) as the task backend, or implement your own task system — ISPC has a clean interface for this.

💡 When to use tasks

Tasks are worth the setup cost only for large workloads. For arrays of fewer than ~100,000 elements, the task launch overhead may outweigh the speedup. Use tasks when your total work takes more than a few milliseconds. For small kernels called in a tight loop, SIMD-only (no tasks) is often faster.

19 The Standard Library — Math Functions

ISPC's standard library provides vectorized versions of all standard math functions. Unlike C's math.h, which operates on one value at a time, ISPC's versions automatically operate on an entire gang (8 values at once on AVX2) using hardware SIMD instructions.

Basic math

float v = data[programIndex];   // a varying float

// All of these operate on the entire gang simultaneously:
float a = abs(v);              // absolute value
float s = sqrt(v);             // square root
float r = rsqrt(v);            // reciprocal sqrt: 1/sqrt(v)
float f = rsqrt_fast(v);       // fast approximation of 1/sqrt(v) (lower precision)
float lo = floor(v);           // round down
float hi = ceil(v);            // round up
float ro = round(v);           // round to nearest
float tr = trunc(v);           // truncate toward zero
float sg = sign(v);            // -1.0, 0.0, or +1.0
float mn = min(v, 0.0f);       // element-wise minimum
float mx = max(v, 1.0f);       // element-wise maximum
float cl = clamp(v, 0.0f, 1.0f); // clamp to [0, 1]ispc

Transcendental functions

float sinv  = sin(v);
float cosv  = cos(v);
float tanv  = tan(v);
float asinv = asin(v);         // inverse trig
float acosv = acos(v);
float atanv = atan(v);
float a2    = atan2(v, w);     // 2-argument arctangent
float expv  = exp(v);          // e^v
float logv  = log(v);          // natural log
float powv  = pow(v, 2.3f);   // v^2.3
float cbv   = cbrt(v);         // cube root (v1.27+)
float fmv   = fmod(v, 1.0f);  // floating-point remainderispc

Floating-point predicates

bool nan_v     = isnan(v);     // is this value NaN?
bool inf_v     = isinf(v);     // is this value ±infinity?
bool finite_v  = isfinite(v);  // is this a normal finite number?

// These return varying bool — one answer per lane
// Lane 0 might have a NaN while lane 3 has a normal numberispc

Saturating arithmetic

Regular integer arithmetic wraps around on overflow (255 + 1 = 0 for uint8). Saturating arithmetic clamps instead (255 + 1 = 255 for uint8). This is useful for image processing:

uint8 a = 200, b = 100;
uint8 wrapped   = a + b;            // 44 — WRONG for image add
uint8 saturated = saturating_add(a, b);  // 255 — clampedispc

20 Reductions — Aggregating Values Across Lanes

A reduction takes one value per lane (a varying value) and combines all of them into a single uniform value. This is how you compute things like "the sum of all elements" or "is any element greater than zero" from within a SIMD kernel.

Why you can't just add varying values to get a scalar

If v is a varying float (8 different values across 8 lanes), you can't write:

uniform float total = v;  // ERROR: which of the 8 values do you mean?ispc

You need an explicit reduction that says how to combine the 8 values. The reduce_add function adds all 8 lane values together into one scalar result. The CPU does this with a short sequence of SIMD instructions (horizontal additions) — still much faster than doing 8 scalar adds.

Available reductions

float v = data[programIndex];   // varying: 8 different values

// Sum all 8 lane values
uniform float total = reduce_add(v);
// e.g. v = [1,2,3,4,5,6,7,8] → total = 36

// Maximum across all lanes
uniform float maxval = reduce_max(v);
// e.g. v = [3, 8, 1, 5] → maxval = 8

// Minimum across all lanes
uniform float minval = reduce_min(v);

// Logical: is ANY lane's condition true?
uniform bool has_negative = any(v < 0.0f);
// Returns true if at least one lane has v < 0

// Logical: are ALL lanes' conditions true?
uniform bool all_positive = all(v > 0.0f);
// Returns true only if every lane has v > 0

// Logical: are NO lanes' conditions true?
uniform bool no_nans = none(isnan(v));ispc

Using reductions in practice: dot product

export uniform float dot_product(uniform float a[], uniform float b[],
                                    uniform int N) {
    // 'partial' accumulates sums within each lane independently
    float partial = 0.0f;   // varying: each lane has its own running total

    foreach (i = 0 ... N) {
        partial += a[i] * b[i];  // SIMD: 8 multiply-adds per instruction
    }

    // Now partial has 8 partial sums (one per lane)
    // Sum them into one uniform result
    return reduce_add(partial);
    // For N=16 on 8-wide: we did 2 SIMD iterations → 2 fused multiply-adds
    // Plus one 8-way horizontal sum at the end. Very fast!
}ispc

Reductions for early exit

export uniform bool contains_nan(uniform float data[], uniform int N) {
    foreach (i = 0 ... N) {
        if (any(isnan(data[i]))) {
            // any() gives a uniform bool: if ANY lane found a NaN,
            // we can break out of the foreach immediately
            return true;
        }
    }
    return false;
}ispc

21 Cross-Lane Operations — Shuffles, Broadcasts, Rotates

Sometimes you need to share data between lanes within the same gang. For example, lane 3 might need to know what lane 0's value is. These are called cross-lane operations, and they use SIMD shuffle instructions under the hood.

broadcast: send one lane's value to all lanes

broadcast(v, lane) takes the value of v in a specific lane and returns it to all lanes:

float v = data[programIndex];
// Suppose v = [3.0, 1.0, 4.0, 1.0, 5.0, 9.0, 2.0, 6.0]

float b = broadcast(v, 0);
// b = [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]
// All lanes now have lane 0's value

float c = broadcast(v, 5);
// c = [9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0, 9.0]ispc

rotate: shift values around the ring of lanes

rotate(v, delta) shifts each lane's value by delta positions, wrapping around:

float v = data[programIndex];
// v = [A, B, C, D, E, F, G, H]

float r = rotate(v, 1);
// r = [H, A, B, C, D, E, F, G]  — shifted left by 1, H wraps around

float s = rotate(v, -1);
// s = [B, C, D, E, F, G, H, A]  — shifted right by 1ispc

When is rotate useful? For stencil computations (finite differences, convolutions) where each element's result depends on its neighbors. Instead of loading the data multiple times with offsets, you can load once and use rotate to access neighbors:

foreach (i = 1 ... N-1) {
    float center = data[i];
    float right  = rotate(center, 1);   // neighbor to the right
    float left   = rotate(center, -1);  // neighbor to the left

    // Finite difference: central difference approximation of derivative
    result[i] = (right - left) * 0.5f;
    // Note: at gang boundaries, you'll need to handle the edge cases
}ispc

shuffle: arbitrary reordering

shuffle(v, perm) reorders lane values according to a varying index perm. Lane N gets the value from lane perm[N]:

float v    = data[programIndex];
int   perm = (programCount - 1) - programIndex;  // [7,6,5,4,3,2,1,0]
float rev  = shuffle(v, perm);  // reverses the lane orderispc

⚠️ Cross-lane operations are relatively expensive

Shuffle, rotate, and broadcast all require SIMD permute instructions, which can take 3–5 cycles and may require specific hardware support. Use them when necessary, but don't use them in inner loops if a simpler data layout would avoid them. For most algorithms, good data layout (SOA) eliminates the need for cross-lane operations entirely.

22 Atomic Operations

Atomic operations let multiple lanes (or threads) safely modify shared memory without data races. They're "atomic" in the sense that each operation appears to happen instantaneously — no other operation can see a partially-completed result.

The problem without atomics

Suppose you want to count how many elements pass a filter. The naive approach has a race condition:

uniform int count = 0;

foreach (i = 0 ... N) {
    if (data[i] > threshold) {
        count++;  // BUG: race condition — multiple lanes increment simultaneously
                  // The result is unpredictable
    }
}ispc

The correct approach: atomic or reduction

There are two ways to fix this. The reduction approach is usually faster:

// Option 1: atomic increment (correct but slower)
uniform int count = 0;
foreach (i = 0 ... N) {
    if (data[i] > threshold) {
        // atomic_add_global: safely adds 1 to count from any lane/thread
        atomic_add_global(&count, 1);
    }
}

// Option 2: reduction (faster — no atomic hardware needed)
int local_count = 0;   // varying: each lane tracks its own count
foreach (i = 0 ... N) {
    if (data[i] > threshold) {
        local_count++;   // no race: each lane increments its own copy
    }
}
uniform int total = reduce_add(local_count);  // combine at the endispc

Available atomic functions

// All variants: _global (between tasks/threads), _local (within a task)

// Integer atomics
int old = atomic_add_global(&shared_int, 1);    // returns old value
int old = atomic_subtract_global(&counter, 1);
int old = atomic_min_global(&min_val, x);
int old = atomic_max_global(&max_val, x);
int old = atomic_and_global(&flags, mask);
int old = atomic_or_global(&flags, bit);
int old = atomic_xor_global(&flags, bit);
int old = atomic_swap_global(&val, new_val);

// Compare-and-swap: atomically: if *ptr == cmp, set *ptr = new_val
int old = atomic_compare_exchange_global(&val, cmp, new_val);

// Float atomics (v1.30+)
float old = atomic_add_global(&shared_float, delta_f);ispc

foreach_active for atomic operations

A common pattern: use foreach_active to serialize the atomic operations, which is faster than issuing gated atomics from all lanes:

foreach (i = 0 ... N) {
    if (passes_filter(data[i])) {
        foreach_active (lane) {
            // Serial: each active lane does its atomic one at a time
            // This avoids hardware contention on the atomic unit
            int slot = atomic_add_global(&write_pos, 1);
            out[slot] = data[i];  // data[i] is now uniform (current lane's value)
        }
    }
}ispc

23 C/C++ Interoperability in Depth

Complete type mapping

When you pass data between C++ and ISPC, the types map according to this table. The key rule: only uniform types can cross the boundary — varying types are SIMD registers and have no C equivalent.

ISPC type	C/C++ type	Notes
`uniform bool`	`bool`	true=1 (v1.23+ matches C ABI)
`uniform int8`	`int8_t`
`uniform int16`	`int16_t`
`uniform int` / `uniform int32`	`int32_t`
`uniform int64`	`int64_t`
`uniform float`	`float`
`uniform double`	`double`
`uniform T*`	`T*`	Pointer to arrays processed by ISPC
`varying anything`	not representable	Cannot cross boundary

Sharing struct definitions between C++ and ISPC

The cleanest way to share data structures is to define them in a C++ header and include that header in your ISPC file. The same struct layout rules apply (C ABI, no hidden padding):

// shared_types.h — included by both C++ and ISPC
#pragma once
#include <stdint.h>

struct Ray {
    float ox, oy, oz;  // origin
    float dx, dy, dz;  // direction (normalized)
    float tmin, tmax;  // valid interval
};C++

// ray_kernel.ispc
#include "shared_types.h"

export void trace_rays(uniform Ray rays[], uniform int N,
                         uniform float hit_t[]) {
    foreach (i = 0 ... N) {
        // Access AOS struct fields — they become gathers (see Chapter 13)
        float ox = rays[i].ox;   // each lane reads from a different Ray
        float dx = rays[i].dx;
        hit_t[i] = intersect_sphere(ox, dx, ...);
    }
}ispc

Calling ISPC from C++: the complete pattern

// main.cpp
#include "shared_types.h"
#include "ray_kernel_ispc.h"   // generated by ISPC

int main() {
    const int N = 65536;
    Ray*   rays  = (Ray*)  aligned_alloc(64, N * sizeof(Ray));
    float* hit_t = (float*)aligned_alloc(64, N * sizeof(float));

    fill_rays(rays, N);  // your C++ setup code

    // Call the ISPC kernel — no special syntax, just a namespace prefix
    ispc::trace_rays(rays, N, hit_t);

    process_results(hit_t, N);

    free(rays); free(hit_t);
}C++

Using ISPC as a runtime-linked library (v1.28+)

ISPC 1.28 introduced libispc, which lets you embed the ISPC compiler in your application and compile kernels at runtime (JIT compilation). This is useful for shader systems, scripting, and dynamic code generation:

#include <ispc/ispc.h>

const char* kernel_source = R"(
    export void scale(uniform float data[], uniform int n, uniform float factor) {
        foreach (i = 0 ... n) { data[i] *= factor; }
    }
)";

// Compile at runtime for the current CPU
ISPCEngine engine;
engine.Initialize();
auto module = engine.CompileString(kernel_source, "--target=host");

// Get a function pointer to the compiled kernel
typedef void (*ScaleFn)(float*, int, float);
ScaleFn scale_fn = (ScaleFn)module.GetFunctionPtr("scale");
scale_fn(my_data, N, 2.0f);   // runs the JIT-compiled SIMD kernelC++

24 Multi-Target Compilation and Auto-Dispatch

A major practical challenge: you compile your program on your development machine with AVX-512, but users might run it on machines with only SSE4. How do you ship one binary that uses the best available SIMD on every machine?

ISPC's solution: compile for multiple targets at once and include auto-dispatch code that detects the CPU at runtime and calls the right version.

How to compile for multiple targets

# Compile once for three ISA levels simultaneously
ispc my_kernel.ispc \
    --target=sse4-i32x4,avx2-i32x8,avx512skx-x16 \
    -o my_kernel.o \
    -h my_kernel_ispc.h \
    -O2bash

What happens behind the scenes:

ISPC compiles three versions of your kernel — one for each target ISA
It emits all three into my_kernel.o, with mangled names (kernel_sse4, kernel_avx2, kernel_avx512)
It also generates a dispatcher function with the original name (kernel) that runs CPUID at first call, checks which ISA the current CPU supports, and calls the best available version
The generated header my_kernel_ispc.h declares only the original name — your C++ code calls it normally and gets the best version automatically

C++ calls: ispc::my_kernel(data, N); ↓ Dispatcher (generated): ┌─────────────────────────────────────────┐ │ if CPU supports AVX-512: call avx512 │ │ else if AVX2: call avx2 │ │ else: call sse4 │ └─────────────────────────────────────────┘ ↓ ↓ ↓ [avx512 code] [avx2 code] [sse4 code] (16-wide) (8-wide) (4-wide)

All available targets in ISPC 1.30

# x86 targets
sse4-i32x4          # SSE4.1, 4-wide, ~any x86 CPU since 2007
avx2-i32x8          # AVX2 8-wide, Intel Haswell+ or AMD Zen+ (most common target)
avx2-i32x16         # AVX2 double-pumped to 16-wide
avx512skx-x16       # AVX-512, Skylake-X servers
avx512icl-x16       # AVX-512, Ice Lake client
avx512spr-x16       # AVX-512 Sapphire Rapids (includes VNNI)
avx512gnr-x16       # AVX-512 Granite Rapids (AMX-FP16), v1.29+
avx10.2dmr-x16      # AVX10.2, v1.29+ (formerly avx10.2)

# ARM targets
neon-i32x4          # ARM NEON 4-wide (all ARM Cortex-A)
neon-i16x16         # ARM NEON 16-wide double-pumped (v1.26+)
neon-i8x32          # ARM NEON 32-wide double-pumped (v1.26+)

# Intel GPU targets (XE family)
xe-x8               # Intel Xe integrated GPU
xehpc-x16           # Intel Xe HPC (Ponte Vecchio)
xe2lpg-x16          # Intel Xe2 Lunar Lake GPU (v1.25+)
xe2hpg-x16          # Intel Xe2 Battlemage GPU (v1.25+)

# Auto
host                # whatever ISA the build machine supportsbash

25 Templates

ISPC supports function templates very similar to C++ templates. They let you write one function that works for multiple types, with the type determined at compile time.

Type templates

// Without templates, you'd need this for every type:
float  my_abs_f(float  v) { return v < 0 ? -v : v; }
int    my_abs_i(int    v) { return v < 0 ? -v : v; }
double my_abs_d(double v) { return v < 0 ? -v : v; }

// With templates, write once, use for any type:
template<typename T>
T my_abs(T v) { return v < (T)0 ? -v : v; }

float  a = my_abs<float> (-3.14f);   // 3.14f
int    b = my_abs<int>   (-42);       // 42
double c = my_abs<double>(-2.71);    // 2.71ispc

Non-type templates: fixed-size algorithms

Non-type template parameters let you specialize a function for a compile-time constant, like a vector size:

// Dot product of two N-element arrays, N known at compile time
template<typename T, int N>
T dot(uniform T a[], uniform T b[]) {
    T sum = (T)0;
    // Since N is a compile-time constant, the compiler can fully unroll this
    for (uniform int i = 0; i < N; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

uniform float a3[3] = {1, 2, 3};
uniform float b3[3] = {4, 5, 6};
uniform float d3 = dot<float, 3>(a3, b3);  // 1*4 + 2*5 + 3*6 = 32ispc

Templates with short vector types (v1.25+)

// A generic linear interpolation for any short vector type
template<typename T, int N>
T<N> lerp(T<N> a, T<N> b, uniform T t) {
    return a + (b - a) * t;
}

uniform float<3> p0 = {0.0f, 0.0f, 0.0f};
uniform float<3> p1 = {10.0f, 5.0f, 2.0f};
uniform float<3> mid = lerp<float, 3>(p0, p1, 0.5f);
// mid = {5.0, 2.5, 1.0}ispc

26 Performance Guide — Thinking Like the Compiler

Here we tie together everything we've learned into a systematic approach to writing fast ISPC code. Each guideline comes with a why — not just a rule to memorize.

1. Use foreach for all data-parallel loops — never manually scatter work

foreach guarantees correct tail handling (when N isn't divisible by the gang width), generates optimal loop bounds, and is the loop form the compiler understands best. Manual loops with programIndex are fragile and harder to optimize.

2. Mark everything uniform that you can — it's "free" information

Every uniform annotation tells the compiler "this value is the same in all 8 lanes." That information unlocks optimizations:

Uniform conditions → simple branches instead of masked execution
Uniform indices → sequential loads instead of gathers
Uniform loop bounds → the compiler knows the exact trip count

The compiler will warn you if you try to use something as uniform when it can't guarantee it is, so you won't accidentally break correctness by over-annotating.

3. Structure data as SOA (or soa<N>) for array-of-struct access

This is the single most impactful structural change you can make. It converts gathers (8 cache misses) into sequential loads (1 instruction). A well-tuned SOA layout often provides 2–4× speedup over AOS even when everything else is optimized.

4. Use cif / cfor / cwhile when divergence is unlikely

If you're writing a particle system and 99% of particles are active, the condition "is this particle alive?" will be true for all 8 lanes in most gangs. Use cif — at the cost of a single mask-check instruction, you get fully unmasked execution for the common case.

5. Minimize cross-lane operations in hot paths

Shuffle, rotate, and broadcast require special SIMD instructions that have higher latency than simple arithmetic. In inner loops, redesign your algorithm to avoid needing neighbor data across lanes, or restructure the data so neighbors are in sequential lanes (and therefore load together naturally).

6. Use packed_store_active for compaction (not a scatter)

A common pattern: "write only elements that pass a filter to a compact output buffer." The wrong way is a scatter (output[varying_idx] = val). The right way uses packed_store_active:

uniform int out_idx = 0;   // uniform write cursor

foreach (i = 0 ... N) {
    float val = data[i];
    if (val > threshold) {
        // packed_store_active writes only the active lanes' values,
        // compacted contiguously into output[out_idx...]
        // Returns the number of values written
        uniform int n_written = packed_store_active(&output[out_idx], val);
        out_idx += n_written;
    }
}
// output[0 .. out_idx-1] now holds the filtered values, no gapsispc

7. Reduce redundant loads — load once, use many times

// BAD: loads data[i] twice from memory
foreach (i = 0 ... N) {
    output1[i] = data[i] * 2.0f;
    output2[i] = data[i] + 1.0f;
}

// GOOD: load once, use twice
foreach (i = 0 ... N) {
    float v = data[i];    // one SIMD load
    output1[i] = v * 2.0f;
    output2[i] = v + 1.0f;
}ispc

8. Profile before optimizing

Use ISPC's --emit-asm flag to inspect the generated assembly. Look for:

vgatherdps / vscatterdps — gather/scatter instructions (potentially slow)
vmovaps / vmovups — aligned/unaligned sequential SIMD loads (fast)
Masked operations (instructions with k1 mask register in AVX-512) — check if they're in hot paths

ispc mykernel.ispc --target=avx2-i32x8 --emit-asm -o mykernel.s
grep -n "gather\|scatter\|vmovaps" mykernel.sbash

27 New Features in ISPC 1.28 – 1.30

ISPC 1.30: Intel AMX support

AMX (Advanced Matrix Extensions) is a hardware feature in Sapphire Rapids, Granite Rapids, and later Intel CPUs that provides dedicated matrix multiply-accumulate hardware. Think of it like a GPU tensor core built into a CPU core. ISPC 1.30 exposes it through a standard library header:

#include "amx.isph"   // only included when target supports AMX

// AMX works with "tiles" — small 2D register files
// Each tile can hold a 16×64 byte matrix (e.g., 16×16 BF16 values)

// Configure the tile dimensions
amx_tile_config cfg;
amx_tile_config_set_dims(&cfg, /*rows*/16, /*cols_bytes*/64);
amx_ldtilecfg(&cfg);

// Load two input matrices into tile registers
amx_tile_loadd_t0(matA_ptr, matA_stride);
amx_tile_loadd_t1(matB_ptr, matB_stride);

// Compute: tile_result += tile_A × tile_B  (BF16 inputs → float32 output)
amx_tdpbf16ps(TILE_RESULT, TILE_A, TILE_B);

// Store result
amx_tile_stored_t2(result_ptr, result_stride);ispc

AMX is only available on AVX-512 SPR, GNR, and AVX10.2DMR targets. For other targets, this code won't compile — use preprocessor guards:

#ifdef ISPC_TARGET_AVX512SPR
    amx_tdpbf16ps(TILE_C, TILE_A, TILE_B);  // Hardware AMX
#else
    fallback_matmul(C, A, B);               // Scalar fallback
#endifispc

ISPC 1.29: Profile-guided optimization

PGO uses runtime data (actual program execution) to guide the compiler's optimization decisions. Hot paths get more aggressive optimization; cold paths get smaller code. In v1.29:

# Step 1: compile with debug info for profiling
ispc mykernel.ispc --sample-profiling-debug-info -o mykernel_prof.o

# Step 2: run the program and collect profile data using Linux perf
perf record -e cycles:u -g ./my_program
perf script | create_llvm_prof --binary=my_program --out=profile.afdo

# Step 3: recompile using the profile — can give 5-60% speedup
ispc mykernel.ispc \
    --profile-sample-use=profile.afdo \
    -O3 \
    --target=avx2-i32x8 \
    -o mykernel_optimized.obash

ISPC 1.29: Stack Smash Protection

Stack canaries detect buffer overflows — useful for security-sensitive ISPC code:

ispc mykernel.ispc --stack-protector=strong -o mykernel.o
# --stack-protector       = protects functions with detectable vulnerabilities
# --stack-protector=strong = protects functions with any array or address-taken local
# --stack-protector=all   = protects every functionbash

ISPC 1.28: Python bindings via nanobind

ISPC 1.28 can auto-generate Python bindings, letting you call ISPC kernels directly from Python with NumPy arrays:

# Compile with Python wrapper generation
ispc mykernel.ispc \
    --nanobind-wrapper=mykernel_nb.cpp \
    --target=avx2-i32x8 \
    -o mykernel.obash

# setup.py (simplified)
import nanobind
from setuptools import setup
from nanobind.setuptools_helpers import NanobindExtension

setup(name="mykernel", ext_modules=[
    NanobindExtension("mykernel", sources=["mykernel_nb.cpp"],
                      extra_objects=["mykernel.o"])
])Python

# Python usage
import mykernel
import numpy as np

data = np.random.rand(1024).astype(np.float32)
result = np.zeros_like(data)
mykernel.fast_sqrt(data, result, 1024)  # calls ISPC SIMD kernel from PythonPython

ISPC 1.28: Struct operator overloading extended

v1.28 greatly extended operator overloading to cover all standard C++ operators including unary operators, assignment operators, and comparison operators:

struct Vec3 { float x, y, z; };

// Unary
inline Vec3 operator-(Vec3 a) { Vec3 r; r.x=-a.x; r.y=-a.y; r.z=-a.z; return r; }

// Compound assignment
inline Vec3& operator+=(Vec3& a, Vec3 b) { a.x+=b.x; a.y+=b.y; a.z+=b.z; return a; }

// Equality
inline bool operator==(Vec3 a, Vec3 b) {
    return a.x==b.x && a.y==b.y && a.z==b.z;
}

// Usage
Vec3 v = {1, 2, 3};
Vec3 neg = -v;            // {-1, -2, -3}
v += {10, 10, 10};       // v = {11, 12, 13}ispc

28 Full Example — Mandelbrot Set with Tasks

Let's build a complete, optimized Mandelbrot set renderer that uses both SIMD parallelism (via gangs) and multi-core parallelism (via tasks). This example demonstrates divergent control flow — different pixels need different iteration counts — which is where ISPC's masked execution shines.

Why Mandelbrot is a great ISPC example

The Mandelbrot set is defined by: starting from z=0, repeatedly apply z = z² + c, and count how many iterations before |z| > 2. If it never escapes, the point is "in" the set.

The challenge: different pixels escape at completely different iteration counts. Some escape after 2 iterations, some take 1000. This means the "while" loop has a highly varying exit condition — different lanes are done at different times. In scalar code, this is fine. In SIMD, lanes that finish early stay masked until all 8 are done. ISPC handles this automatically and efficiently.

mandelbrot.ispc

// Task function: each task renders a horizontal band of the image
task void mandelbrot_band(
    uniform float  x0,     // left  edge of complex plane view
    uniform float  y0,     // top   edge
    uniform float  dx,     // pixels per unit in x
    uniform float  dy,     // pixels per unit in y
    uniform int    width,
    uniform int    height,
    uniform int    maxIter,
    uniform int    output[])  // flat array: output[row * width + col]
{
    // Divide rows evenly among tasks
    uniform int row_start = (taskIndex * height) / taskCount;
    uniform int row_end   = ((taskIndex+1) * height) / taskCount;

    foreach (row = row_start ... row_end, col = 0 ... width) {
        // Convert pixel (col, row) to complex number (cr, ci)
        float cr = x0 + col * dx;
        float ci = y0 + row * dy;

        // Mandelbrot iteration
        float zr = cr, zi = ci;
        int   iter = 0;

        // Loop condition is varying: different lanes exit at different times
        // ISPC automatically manages the execution mask
        // cwhile: likely most lanes agree when far from the boundary
        cwhile (iter < maxIter && zr*zr + zi*zi < 4.0f) {
            float new_zr = zr*zr - zi*zi + cr;
            zi = 2.0f * zr * zi + ci;
            zr = new_zr;
            iter++;
        }

        // Each lane writes its own iteration count
        output[row * width + col] = iter;
    }
}

// Entry point from C++
export void render_mandelbrot(
    uniform float x0, uniform float y0,
    uniform float x1, uniform float y1,
    uniform int   width, uniform int height,
    uniform int   maxIter,
    uniform int   output[])
{
    uniform float dx = (x1 - x0) / width;
    uniform float dy = (y1 - y0) / height;

    // Launch one task per logical CPU core (adjust numTasks to your machine)
    uniform int numTasks = 8;
    launch[numTasks] mandelbrot_band(x0, y0, dx, dy, width, height, maxIter, output);
    sync;
}ispc

main.cpp

#include <cstdlib>
#include <cstdio>
#include "mandelbrot_ispc.h"

void write_ppm(const char* filename, const int* data, int W, int H, int maxIter);

int main() {
    const int W = 1920, H = 1080;
    const int maxIter = 1000;

    int* output = (int*)aligned_alloc(64, W * H * sizeof(int));

    // View: classic Mandelbrot region
    ispc::render_mandelbrot(
        -2.5f, -1.0f,   // x0, y0 (top-left of view)
         1.0f,  1.0f,   // x1, y1 (bottom-right)
        W, H, maxIter, output
    );

    write_ppm("mandelbrot.ppm", output, W, H, maxIter);
    printf("Rendered %dx%d Mandelbrot to mandelbrot.ppm\n", W, H);

    free(output);
}C++

Compile and run

# Compile ISPC kernel for AVX2 (multi-target for distribution)
ispc mandelbrot.ispc \
    --target=sse4-i32x4,avx2-i32x8,avx512skx-x16 \
    -o mandelbrot_ispc.o \
    -h mandelbrot_ispc.h \
    -O2

# Compile C++ and link with tasksys
g++ -O2 main.cpp mandelbrot_ispc.o ispc/examples/common/tasksys.cpp \
    -o mandelbrot -lpthread

# Run
./mandelbrot
# Expected speedup vs scalar C: ~15-20× on AVX2 + 8 coresbash

Understanding the performance

On a machine with 8 cores and AVX2 (8-wide SIMD), the theoretical speedup over scalar single-core is 8 (cores) × 8 (SIMD lanes) = 64×. In practice, the Mandelbrot set has high divergence near the boundary (different pixels take very different iteration counts), so SIMD efficiency is maybe 50–70% there. But pixels away from the boundary are either all inside (all reach maxIter) or all outside (all escape quickly), giving 100% SIMD efficiency. Overall, 15–25× real-world speedup is typical.

💡 Next steps

From here, explore the ISPC examples directory in the GitHub repository — it includes ray tracers, volume renderers, options pricing, and image processing kernels, all with full source. The ISPC user guide at ispc.github.io/ispc.html has the authoritative reference for every language feature.

Official resources: ISPC User's Guide · GitHub Releases · Examples on GitHub · Community Forum

Based on ISPC v1.30.0 · LLVM 19 backend · Compiled March 2026

Intel® ISPC ProgrammingFrom Scratch to Advanced

01 Why ISPC Exists — The Problem It Solves

The problem with regular C

The problem with intrinsics

What ISPC gives you

02 What SIMD Is — Inside the CPU

Scalar arithmetic: one operation at a time

SIMD registers: wide containers for multiple values

Different CPUs, different widths

Why compilers struggle with auto-vectorization

03 The SPMD Programming Model

What SPMD means in plain English

How this maps to SIMD hardware

The key insight: you think scalar, the CPU runs SIMD

04 Installation

Linux

macOS

Windows

Verify the installation

05 Your First ISPC Program — Explained Line by Line

The problem we're solving

The ISPC kernel — sqrt_kernel.ispc

The C++ host code — main.cpp

What the generated header looks like

06 Compilation — The Build Process

Step 1: Compile the ISPC kernel

Step 2: Compile the C++ host and link

What the --target flag does

CMake integration

07 Gangs, Program Instances, and programIndex

The gang: a group of program instances running together

programIndex: each instance knows its lane number

How foreach uses this internally

The "gang" crosses the whole function call

08 Uniform vs Varying — The Most Important Distinction

The core question: "Is this value the same across all lanes?"

What they look like in hardware

A concrete example that shows why this matters

Why uniform matters for performance

The rules for uniform vs varying

When to mark things uniform

09 The Execution Mask — How Branching Works in SIMD

The problem: SIMD can't truly branch

What the mask looks like

Predication: how masked writes work

The "both branches run" consequence

Gang convergence: ISPC's implicit guarantee

10 Control Flow in Depth

Regular if: works, but costs both branches

Coherent if (cif): checks the mask first

foreach_active: serialize over lanes

foreach_unique: de-duplicate varying values

Loops with varying conditions

11 How Memory Loads Work in SIMD

The ideal case: sequential (contiguous) access

What determines whether a load is sequential

12 Gather and Scatter — What They Are and Why They're Slow

What is a gather?

What is a scatter?

How to recognize gather/scatter in your code

ISPC will still compile code with gathers — it's not an error

How to avoid gathers

13 Array of Structures vs Structure of Arrays

Array of Structures (AOS) — the natural C way

Structure of Arrays (SOA) — the SIMD-friendly way

ISPC's soa keyword: the best of both worlds

The tile size must match the gang width

AOS-to-SOA conversion at runtime

14 Data Alignment — Why It Matters

What alignment means

How to allocate aligned memory

Telling ISPC about alignment

15 Types in Depth

Primitive types

Short vector types: fixed-size mathematical vectors

Short vectors and the standard library (v1.28+)

Structs and operator overloading

16 foreach and Iteration Patterns

Basic 1D foreach

Multi-dimensional foreach: processing 2D and 3D data

Intel® ISPC Programming
From Scratch to Advanced