Skip to content
v1.0.0-zig0.15.2

Benchmarking

Terminal window
# Run benchmarks (builds with ReleaseFast automatically)
zig build bench
# Run comparative benchmark (Blitz vs Rayon)
zig build compare

The main benchmark compares Blitz against sequential baselines:

======================================================================
BLITZ vs RAYON COMPARISON BENCHMARKS
======================================================================
Initialized with 10 workers
+-----------------------+------------+------------+----------+
| Benchmark | Blitz | Sequential | Speedup |
+-----------------------+------------+------------+----------+
| Parallel Sum 10M | 1.2 ms | 35.0 ms | 29.2x |
| Parallel Sort 10M | 430 ms | 4644 ms | 10.8x |
| Fork-Join fib(40) | 93 ms | 635 ms | 6.8x |
| Find First 10M | 3.3 ms | 28.0 ms | 8.5x |
+-----------------------+------------+------------+----------+

The build.zig defines two benchmark steps:

CommandDescription
zig build benchRun Blitz benchmarks (benchmarks/rayon_compare.zig)
zig build compareRun comparative Blitz vs Rayon benchmark (benchmarks/compare.zig)

Both are built with ReleaseFast optimization.

fn benchmarkSum() void {
const allocator = std.heap.page_allocator;
const n: usize = 10_000_000;
const data = allocator.alloc(i64, n) catch unreachable;
defer allocator.free(data);
// Initialize
for (data, 0..) |*v, i| v.* = @intCast(i);
// Warmup
_ = blitz.iter(i64, data).sum();
// Benchmark
const iterations = 10;
var total_ns: i64 = 0;
for (0..iterations) |_| {
const start = std.time.nanoTimestamp();
_ = blitz.iter(i64, data).sum();
total_ns += std.time.nanoTimestamp() - start;
}
const avg_ms = @as(f64, @floatFromInt(total_ns)) / @as(f64, iterations) / 1_000_000.0;
std.debug.print("Parallel sum: {d:.2} ms\n", .{avg_ms});
}
fn benchmarkComparison() void {
const data = generateData(10_000_000);
// Sequential baseline
const seq_start = std.time.nanoTimestamp();
var seq_sum: i64 = 0;
for (data) |v| seq_sum += v;
const seq_time = std.time.nanoTimestamp() - seq_start;
// Parallel
const par_start = std.time.nanoTimestamp();
const par_sum = blitz.iter(i64, data).sum();
const par_time = std.time.nanoTimestamp() - par_start;
// Verify correctness
std.debug.assert(seq_sum == par_sum);
// Report
const speedup = @as(f64, @floatFromInt(seq_time)) /
@as(f64, @floatFromInt(par_time));
std.debug.print("Speedup: {d:.1}x\n", .{speedup});
}

Benchmarks in build.zig are always built with ReleaseFast. If building manually:

Terminal window
zig build bench # Already uses ReleaseFast
// Run once to warm caches
_ = functionToBenchmark();
// Then measure
const start = std.time.nanoTimestamp();
_ = functionToBenchmark();
const elapsed = std.time.nanoTimestamp() - start;
const iterations = 100;
var min_time: i64 = std.math.maxInt(i64);
for (0..iterations) |_| {
const start = std.time.nanoTimestamp();
_ = functionToBenchmark();
const elapsed = std.time.nanoTimestamp() - start;
min_time = @min(min_time, elapsed);
}
// Report minimum (least affected by noise)
// Use doNotOptimizeAway to prevent dead code elimination
const result = functionToBenchmark();
std.mem.doNotOptimizeAway(&result);
// Allocate OUTSIDE the timing loop
const data = allocator.alloc(i64, n);
defer allocator.free(data);
// Only measure the operation
const start = std.time.nanoTimestamp();
processData(data);
const elapsed = std.time.nanoTimestamp() - start;
const elements_per_sec = @as(f64, @floatFromInt(n)) /
(@as(f64, @floatFromInt(elapsed_ns)) / 1_000_000_000.0);
std.debug.print("Throughput: {d:.2} billion elements/sec\n", .{
elements_per_sec / 1_000_000_000.0,
});
const speedup = @as(f64, @floatFromInt(sequential_time)) /
@as(f64, @floatFromInt(parallel_time));
std.debug.print("Speedup: {d:.1}x over sequential\n", .{speedup});
const efficiency = speedup / @as(f64, @floatFromInt(num_workers)) * 100.0;
std.debug.print("Parallel efficiency: {d:.1}%\n", .{efficiency});

Test with various data patterns:

fn benchmarkSort() void {
const patterns = .{
.{ "Random", generateRandom },
.{ "Sorted", generateSorted },
.{ "Reverse", generateReverse },
.{ "Equal", generateEqual },
.{ "Few unique", generateFewUnique },
};
inline for (patterns) |pattern| {
const name = pattern[0];
const generator = pattern[1];
const data = generator(10_000_000);
const start = std.time.nanoTimestamp();
blitz.sortAsc(i64, data);
const elapsed = std.time.nanoTimestamp() - start;
std.debug.print("{s}: {d:.2} ms\n", .{
name,
@as(f64, @floatFromInt(elapsed)) / 1_000_000.0,
});
}
}

Measure how performance scales with cores:

fn benchmarkScaling() void {
const core_counts = [_]u32{ 1, 2, 4, 8, 16 };
for (core_counts) |cores| {
blitz.deinit();
try blitz.initWithConfig(.{ .background_worker_count = cores - 1 });
const time = measureOperation();
std.debug.print("{} cores: {d:.2} ms\n", .{ cores, time });
}
}

When benchmarking parallel workloads, track CPU efficiency metrics:

const posix = std.posix;
fn getResourceUsage() struct { ctx_switches: i64, peak_memory: i64 } {
const ru = posix.getrusage(posix.RUSAGE.SELF);
return .{
.ctx_switches = ru.nivcsw, // Involuntary context switches
.peak_memory = ru.maxrss, // Peak RSS in bytes (macOS) or KB (Linux)
};
}
MetricGoodConcerning
CPU time / wall time< 2x (for N threads)> 3x indicates contention
Involuntary ctx switches< 1000> 10000 indicates spinning
Instructions per cycle> 1.5< 0.5 indicates stalls

Blitz uses progressive sleep with tunable constants across multiple files:

SpinWait (Latch.zig):
SPIN_LIMIT = 10 -- Spin iterations before yielding
YIELD_LIMIT = 20 -- Total iterations before sleeping
Worker sleep (Pool.zig):
ROUNDS_UNTIL_SLEEPY = 32 -- Steal rounds before getting sleepy
ROUNDS_UNTIL_SLEEPING = 33 -- Rounds before actual sleep
Future wait (Future.zig):
SPIN_LIMIT = 5 -- Spins while waiting for stolen job

Trade-offs:

  • Higher values: Better latency for continuous workloads, more CPU usage
  • Lower values: Better CPU efficiency for bursty workloads, higher wake latency

If you see high CPU time relative to wall time:

  1. Check context switches: High involuntary switches indicate thread contention
  2. Profile with perf: Look for time spent in futex_wait/spinLoopHint
  3. Measure efficiency: speedup / num_threads should be > 50%
Terminal window
# Linux perf
perf record ./benchmark
perf report
# macOS Instruments
xcrun xctrace record --template "Time Profiler" --launch ./benchmark
Terminal window
# Linux
perf stat -e cache-misses,cache-references ./benchmark
# Valgrind cachegrind
valgrind --tool=cachegrind ./benchmark