Basic Concepts

Understanding Blitz’s execution model and core concepts.

Work Stealing

Blitz uses a work-stealing scheduler, the same approach used by Rayon, Intel TBB, and other high-performance parallel runtimes.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Worker 0   │     │  Worker 1   │     │  Worker 2   │
│ ┌─────────┐ │     │ ┌─────────┐ │     │ ┌─────────┐ │
│ │  Deque  │ │     │ │         │ │     │ │         │ │
│ │ ┌─────┐ │ │     │ │         │ │     │ │         │ │
│ │ │Job A│◄┼─┼─────┼─┼─STEAL───┼─┼─────┼─┼─STEAL   │ │
│ │ ├─────┤ │ │     │ │         │ │     │ │         │ │
│ │ │Job B│ │ │     │ │         │ │     │ │         │ │
│ │ └──▲──┘ │ │     │ └─────────┘ │     │ └─────────┘ │
│ └────┼────┘ │     │             │     │             │
│   push/pop  │     │             │     │             │
└─────────────┘     └─────────────┘     └─────────────┘

How It Works

Each worker has a deque (double-ended queue)
Workers push/pop from the bottom (LIFO - keeps cache hot)
Idle workers steal from the top (FIFO - takes oldest work)
No central queue - work is distributed automatically

Why Work Stealing?

Approach	Pros	Cons
Central queue	Simple	Contention bottleneck
Static partitioning	No overhead	Load imbalance
Work stealing	Dynamic balance, low contention	Slightly complex

Fork-Join Model

Blitz follows the fork-join execution model:

        fork(B)              join()
           │                    │
           ▼                    ▼
    ┌──────────┐         ┌──────────┐
    │ Task B   │         │ Wait for │
    │ (stolen) │         │ B result │
    └──────────┘         └──────────┘
           │                    │
           └────────────────────┘
                    │
              Both complete

Fork: Create a subtask that may run in parallel
Execute: Do local work while subtask runs
Join: Wait for subtask and combine results

// Fork-join example
const result = blitz.join(.{
    .left = .{ computeLeft, leftData },
    .right = .{ computeRight, rightData },
});
const total = result.left + result.right;

Grain Size

Grain size controls the minimum chunk size before parallelization:

// Default: automatic grain size
blitz.parallelFor(n, ctx_type, ctx, bodyFn);

// Custom grain size (1000 elements per chunk)
blitz.parallelForWithGrain(n, ctx_type, ctx, bodyFn, 1000);

Choosing Grain Size

Grain Size	Effect
Too small	Overhead dominates, slower than sequential
Too large	Poor load balancing, some cores idle
Just right	Amortizes overhead, good balance

Rule of thumb: Start with defaults. Only tune if profiling shows issues.

Sequential Threshold

Blitz automatically avoids parallelization for small data:

// Simple size check against default grain size (65536)
if (data.len >= blitz.DEFAULT_GRAIN_SIZE) {
    // Parallel path
} else {
    // Sequential path (less overhead)
}

The threshold depends on:

Operation type: Memory-bound ops need more data
Worker count: More workers = higher threshold
Data size: Must amortize fork/join overhead

Context Pattern

Blitz uses a context struct to pass data to parallel bodies:

// Define what data the parallel body needs
const Context = struct {
    input: []const f64,
    output: []f64,
    scale: f64,
};

// Create context instance
const ctx = Context{
    .input = input_data,
    .output = output_data,
    .scale = 2.5,
};

// Pass to parallel operation
blitz.parallelFor(input_data.len, Context, ctx, struct {
    fn body(c: Context, start: usize, end: usize) void {
        for (c.input[start..end], c.output[start..end]) |in, *out| {
            out.* = in * c.scale;
        }
    }
}.body);

Why Context?

No closures in Zig - Can’t capture variables
Explicit data flow - Clear what’s shared
Comptime optimization - Struct access is fast

Thread Pool Lifecycle

// 1. Initialize (spawns worker threads)
try blitz.init();

// 2. Use parallel operations (any number of times)
blitz.parallelFor(...);
blitz.parallelReduce(...);
const sum = blitz.iter(data).sum();

// 3. Cleanup (joins worker threads)
blitz.deinit();

Important:

init() returns error.AlreadyInitialized if called twice — always pair with deinit()
Always pair with deinit() using defer
Worker threads are reused across all operations