Skip to content
v1.0.0-zig0.15.2

Performance Tuning

Practical guidance for getting the best performance out of Blitz’s parallel runtime.

Parallelism has overhead: forking tasks, work-stealing, and joining results. The work per element must be large enough to amortize this cost.

Data SizePer-Element CostParallelize?
< 1,000AnyNo — overhead dominates
1K - 10KCheap (add, compare)Probably not
1K - 10KModerate (sqrt, trig)Maybe — benchmark it
1K - 10KExpensive (> 1us)Yes
> 10KCheapYes
> 100KAnyDefinitely

Use a size-based threshold:

if (data.len >= blitz.DEFAULT_GRAIN_SIZE) {
blitz.parallelFor(data.len, ctx_type, ctx, bodyFn);
} else {
for (data) |*v| v.* = transform(v.*);
}

The grain size controls the minimum number of elements per parallel task. It is the single most important tuning knob.

Blitz uses a default grain size of 65,536. This works well for lightweight per-element operations.

// Check current grain size
const current = blitz.getGrainSize();
// Set a custom grain size
blitz.setGrainSize(1000);
// Reset to default
blitz.setGrainSize(0);

Most parallel operations have a WithGrain variant for one-off tuning:

// Global grain size is unchanged
blitz.parallelForWithGrain(n, ctx_type, ctx, bodyFn, 500);
blitz.parallelReduceWithGrain(T, n, identity, ctx_type, ctx, mapFn, combineFn, 4096);
blitz.parallelCollectWithGrain(T, U, input, output, ctx_type, ctx, mapFn, 100);
Per-Element WorkSuggested GrainRationale
Trivial (increment, assign)50,000 - 100,000Fork/join overhead is significant relative to work
Light (arithmetic, comparison)10,000 - 50,000Default range works well
Moderate (sqrt, exp, string ops)1,000 - 10,000More parallelism justified
Heavy (complex math, I/O, alloc)100 - 1,000Each element is expensive enough
Very heavy (> 1ms per element)1 - 100Maximum parallelism

Rule of thumb: Each chunk should take at least 10-100 microseconds of work. If a chunk finishes in nanoseconds, the grain is too small.

const std = @import("std");
const blitz = @import("blitz");
fn benchmarkGrain(data: []f64, grain: usize) u64 {
const start = std.time.nanoTimestamp();
blitz.parallelForWithGrain(data.len, []f64, data, struct {
fn body(d: []f64, s: usize, e: usize) void {
for (d[s..e]) |*v| v.* = @exp(v.*);
}
}.body, grain);
return @intCast(std.time.nanoTimestamp() - start);
}
pub fn main() !void {
try blitz.init();
defer blitz.deinit();
var data: [1_000_000]f64 = undefined;
for (&data, 0..) |*v, i| v.* = @as(f64, @floatFromInt(i)) / 1_000_000.0;
// Try different grain sizes
const grains = [_]usize{ 100, 1_000, 10_000, 50_000, 100_000 };
for (grains) |g| {
@memset(&data, 0.5); // Reset
const ns = benchmarkGrain(&data, g);
std.debug.print("Grain {d:>7}: {d:>6.2} ms\n", .{
g, @as(f64, @floatFromInt(ns)) / 1_000_000.0,
});
}
}

Operations limited by CPU throughput (math, logic, branching). These scale well with cores.

// Compute-bound: scales nearly linearly
blitz.parallelFor(n, Context, ctx, struct {
fn body(c: Context, start: usize, end: usize) void {
for (c.data[start..end]) |*v| {
v.* = @exp(v.*) * @sin(v.*) + @cos(v.*); // Heavy math
}
}
}.body);

Operations limited by memory bandwidth (simple reads/writes over large arrays). Adding more cores does not help once bandwidth is saturated.

// Memory-bound: may not scale beyond 2-4 cores
blitz.parallelFor(n, Context, ctx, struct {
fn body(c: Context, start: usize, end: usize) void {
for (c.data[start..end]) |*v| {
v.* += 1; // Trivial compute, bottleneck is memory
}
}
}.body);

Signs of memory-bound behavior:

  • Speedup plateaus at 2-4x regardless of core count
  • Performance varies with array size (cache effects)
  • Same throughput as @memcpy for similar data sizes

Strategies for memory-bound code:

  • Use larger grain sizes to reduce overhead
  • Process data in cache-friendly order (sequential access patterns)
  • Consider whether parallelism is needed at all
  • Combine multiple passes into one (kernel fusion)

False sharing occurs when threads write to different variables that share the same cache line (typically 64 bytes). This causes cache lines to bounce between cores, destroying performance.

// BAD: Adjacent counters share cache lines
var counters: [8]u64 = undefined; // 64 bytes total = 1 cache line!
blitz.parallelFor(n, *[8]u64, &counters, struct {
fn body(c: *[8]u64, start: usize, end: usize) void {
const worker = start % 8;
for (start..end) |_| {
c[worker] += 1; // All 8 workers thrash the same cache line
}
}
}.body);

Pad structures to cache line boundaries:

// GOOD: Each counter on its own cache line
const PaddedCounter = struct {
value: u64 = 0,
_padding: [7]u64 = undefined, // Pad to 64 bytes
};
var counters: [8]PaddedCounter = .{.{}} ** 8;

Or better yet, avoid shared mutable state entirely. Use parallel reduction:

// BEST: No shared state at all
const total = blitz.parallelReduce(
u64, n, 0,
Context, ctx,
struct { fn map(c: Context, i: usize) u64 { return c.data[i]; } }.map,
struct { fn add(a: u64, b: u64) u64 { return a + b; } }.add,
);

Use blitz.getStats() to inspect runtime behavior:

try blitz.init();
defer blitz.deinit();
blitz.resetStats();
// Run your workload
blitz.parallelFor(n, ctx_type, ctx, bodyFn);
const stats = blitz.getStats();
std.debug.print("Tasks executed: {d}\n", .{stats.executed});
std.debug.print("Tasks stolen: {d}\n", .{stats.stolen});
MetricMeaning
executedTotal tasks run across all workers
stolenTasks taken from another worker’s deque

What to look for:

  • High stolen / executed ratio (> 30%): Good load balancing, work-stealing is active
  • Low stolen ratio (< 5%): Work is evenly distributed (or grain is too large)
  • Zero stolen: Either single-threaded or grain so large that each worker gets one chunk

Always measure sequential performance first:

// Sequential baseline
const seq_start = std.time.nanoTimestamp();
for (data) |*v| v.* = transform(v.*);
const seq_time = std.time.nanoTimestamp() - seq_start;
// Parallel
const par_start = std.time.nanoTimestamp();
blitz.parallelFor(data.len, ctx_type, ctx, bodyFn);
const par_time = std.time.nanoTimestamp() - par_start;
const speedup = @as(f64, @floatFromInt(seq_time)) / @as(f64, @floatFromInt(par_time));
std.debug.print("Speedup: {d:.1}x\n", .{speedup});

The first parallel call initializes threads and populates caches. Always discard the first iteration:

// Warm up (discard)
blitz.parallelFor(data.len, ctx_type, ctx, bodyFn);
// Actual measurement
const start = std.time.nanoTimestamp();
for (0..iterations) |_| {
blitz.parallelFor(data.len, ctx_type, ctx, bodyFn);
}
const elapsed = std.time.nanoTimestamp() - start;

For fast operations, run thousands of iterations to get stable measurements. A single run can be noisy due to OS scheduling and cache state.

std.debug.print("Workers: {d}\n", .{blitz.numWorkers()});

Maximum theoretical speedup is bounded by the number of workers. If you see 4x speedup on an 8-core machine, investigate whether the workload is memory-bound or the grain is suboptimal.

On macOS:

Terminal window
# Sample CPU usage during benchmark
sample <pid> 5 -file output.txt
# Instruments (Time Profiler)
xcrun xctrace record --template "Time Profiler" --launch -- ./benchmark

On Linux:

Terminal window
# perf stat for hardware counters
perf stat ./benchmark
# Flame graph
perf record -g ./benchmark && perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
// BAD: 100 elements is too few
blitz.parallelFor(100, ctx_type, ctx, bodyFn);
// GOOD: Check size first
if (data.len > 10_000) {
blitz.parallelFor(data.len, ctx_type, ctx, bodyFn);
} else {
for (data) |*v| v.* = process(v.*);
}

2. Excessive Allocations in Parallel Bodies

Section titled “2. Excessive Allocations in Parallel Bodies”
// BAD: Each chunk allocates
fn body(ctx: Context, start: usize, end: usize) void {
var list = std.ArrayList(i64).init(allocator); // Allocation contention!
defer list.deinit();
...
}
// GOOD: Pre-allocate or use stack buffers
fn body(ctx: Context, start: usize, end: usize) void {
var buf: [1024]i64 = undefined;
...
}
// Without init(), everything runs sequentially (no crash, just slow)
blitz.parallelFor(n, ctx_type, ctx, bodyFn);
// Always initialize
try blitz.init();
defer blitz.deinit();
SituationAction
Too slow, CPU idleDecrease grain size
Too slow, high overheadIncrease grain size
Doesn’t scale past 4xLikely memory-bound; increase grain or fuse passes
Slower than sequentialData too small or grain too small
Inconsistent timingsWarm up, run more iterations
High stolen ratioWork is being rebalanced (usually good)
Need per-worker dataUse broadcast() for init, array indexed by worker