Benchmarking
Quick Start
Section titled “Quick Start”# Run benchmarks (builds with ReleaseFast automatically)zig build bench
# Run comparative benchmark (Blitz vs Rayon)zig build compareBenchmark Suite
Section titled “Benchmark Suite”The main benchmark compares Blitz against sequential baselines:
====================================================================== BLITZ vs RAYON COMPARISON BENCHMARKS======================================================================
Initialized with 10 workers
+-----------------------+------------+------------+----------+| Benchmark | Blitz | Sequential | Speedup |+-----------------------+------------+------------+----------+| Parallel Sum 10M | 1.2 ms | 35.0 ms | 29.2x || Parallel Sort 10M | 430 ms | 4644 ms | 10.8x || Fork-Join fib(40) | 93 ms | 635 ms | 6.8x || Find First 10M | 3.3 ms | 28.0 ms | 8.5x |+-----------------------+------------+------------+----------+Build Steps
Section titled “Build Steps”The build.zig defines two benchmark steps:
| Command | Description |
|---|---|
zig build bench | Run Blitz benchmarks (benchmarks/rayon_compare.zig) |
zig build compare | Run comparative Blitz vs Rayon benchmark (benchmarks/compare.zig) |
Both are built with ReleaseFast optimization.
Writing Benchmarks
Section titled “Writing Benchmarks”Basic Benchmark Pattern
Section titled “Basic Benchmark Pattern”fn benchmarkSum() void { const allocator = std.heap.page_allocator; const n: usize = 10_000_000;
const data = allocator.alloc(i64, n) catch unreachable; defer allocator.free(data);
// Initialize for (data, 0..) |*v, i| v.* = @intCast(i);
// Warmup _ = blitz.iter(i64, data).sum();
// Benchmark const iterations = 10; var total_ns: i64 = 0;
for (0..iterations) |_| { const start = std.time.nanoTimestamp(); _ = blitz.iter(i64, data).sum(); total_ns += std.time.nanoTimestamp() - start; }
const avg_ms = @as(f64, @floatFromInt(total_ns)) / @as(f64, iterations) / 1_000_000.0; std.debug.print("Parallel sum: {d:.2} ms\n", .{avg_ms});}Comparing Implementations
Section titled “Comparing Implementations”fn benchmarkComparison() void { const data = generateData(10_000_000);
// Sequential baseline const seq_start = std.time.nanoTimestamp(); var seq_sum: i64 = 0; for (data) |v| seq_sum += v; const seq_time = std.time.nanoTimestamp() - seq_start;
// Parallel const par_start = std.time.nanoTimestamp(); const par_sum = blitz.iter(i64, data).sum(); const par_time = std.time.nanoTimestamp() - par_start;
// Verify correctness std.debug.assert(seq_sum == par_sum);
// Report const speedup = @as(f64, @floatFromInt(seq_time)) / @as(f64, @floatFromInt(par_time)); std.debug.print("Speedup: {d:.1}x\n", .{speedup});}Benchmarking Best Practices
Section titled “Benchmarking Best Practices”1. Use Release Mode
Section titled “1. Use Release Mode”Benchmarks in build.zig are always built with ReleaseFast. If building manually:
zig build bench # Already uses ReleaseFast2. Warmup
Section titled “2. Warmup”// Run once to warm caches_ = functionToBenchmark();
// Then measureconst start = std.time.nanoTimestamp();_ = functionToBenchmark();const elapsed = std.time.nanoTimestamp() - start;3. Multiple Iterations
Section titled “3. Multiple Iterations”const iterations = 100;var min_time: i64 = std.math.maxInt(i64);
for (0..iterations) |_| { const start = std.time.nanoTimestamp(); _ = functionToBenchmark(); const elapsed = std.time.nanoTimestamp() - start; min_time = @min(min_time, elapsed);}
// Report minimum (least affected by noise)4. Prevent Optimization
Section titled “4. Prevent Optimization”// Use doNotOptimizeAway to prevent dead code eliminationconst result = functionToBenchmark();std.mem.doNotOptimizeAway(&result);5. Isolate Measurements
Section titled “5. Isolate Measurements”// Allocate OUTSIDE the timing loopconst data = allocator.alloc(i64, n);defer allocator.free(data);
// Only measure the operationconst start = std.time.nanoTimestamp();processData(data);const elapsed = std.time.nanoTimestamp() - start;Metrics to Track
Section titled “Metrics to Track”Throughput
Section titled “Throughput”const elements_per_sec = @as(f64, @floatFromInt(n)) / (@as(f64, @floatFromInt(elapsed_ns)) / 1_000_000_000.0);
std.debug.print("Throughput: {d:.2} billion elements/sec\n", .{ elements_per_sec / 1_000_000_000.0,});Speedup
Section titled “Speedup”const speedup = @as(f64, @floatFromInt(sequential_time)) / @as(f64, @floatFromInt(parallel_time));
std.debug.print("Speedup: {d:.1}x over sequential\n", .{speedup});Efficiency
Section titled “Efficiency”const efficiency = speedup / @as(f64, @floatFromInt(num_workers)) * 100.0;std.debug.print("Parallel efficiency: {d:.1}%\n", .{efficiency});Input Patterns
Section titled “Input Patterns”Test with various data patterns:
fn benchmarkSort() void { const patterns = .{ .{ "Random", generateRandom }, .{ "Sorted", generateSorted }, .{ "Reverse", generateReverse }, .{ "Equal", generateEqual }, .{ "Few unique", generateFewUnique }, };
inline for (patterns) |pattern| { const name = pattern[0]; const generator = pattern[1]; const data = generator(10_000_000);
const start = std.time.nanoTimestamp(); blitz.sortAsc(i64, data); const elapsed = std.time.nanoTimestamp() - start;
std.debug.print("{s}: {d:.2} ms\n", .{ name, @as(f64, @floatFromInt(elapsed)) / 1_000_000.0, }); }}Scaling Analysis
Section titled “Scaling Analysis”Measure how performance scales with cores:
fn benchmarkScaling() void { const core_counts = [_]u32{ 1, 2, 4, 8, 16 };
for (core_counts) |cores| { blitz.deinit(); try blitz.initWithConfig(.{ .background_worker_count = cores - 1 });
const time = measureOperation(); std.debug.print("{} cores: {d:.2} ms\n", .{ cores, time }); }}Thread Pool Efficiency
Section titled “Thread Pool Efficiency”When benchmarking parallel workloads, track CPU efficiency metrics:
Resource Usage
Section titled “Resource Usage”const posix = std.posix;
fn getResourceUsage() struct { ctx_switches: i64, peak_memory: i64 } { const ru = posix.getrusage(posix.RUSAGE.SELF); return .{ .ctx_switches = ru.nivcsw, // Involuntary context switches .peak_memory = ru.maxrss, // Peak RSS in bytes (macOS) or KB (Linux) };}Key Metrics
Section titled “Key Metrics”| Metric | Good | Concerning |
|---|---|---|
| CPU time / wall time | < 2x (for N threads) | > 3x indicates contention |
| Involuntary ctx switches | < 1000 | > 10000 indicates spinning |
| Instructions per cycle | > 1.5 | < 0.5 indicates stalls |
Sleep Protocol Tuning
Section titled “Sleep Protocol Tuning”Blitz uses progressive sleep with tunable constants across multiple files:
SpinWait (Latch.zig): SPIN_LIMIT = 10 -- Spin iterations before yielding YIELD_LIMIT = 20 -- Total iterations before sleeping
Worker sleep (Pool.zig): ROUNDS_UNTIL_SLEEPY = 32 -- Steal rounds before getting sleepy ROUNDS_UNTIL_SLEEPING = 33 -- Rounds before actual sleep
Future wait (Future.zig): SPIN_LIMIT = 5 -- Spins while waiting for stolen jobTrade-offs:
- Higher values: Better latency for continuous workloads, more CPU usage
- Lower values: Better CPU efficiency for bursty workloads, higher wake latency
Diagnosing Contention
Section titled “Diagnosing Contention”If you see high CPU time relative to wall time:
- Check context switches: High involuntary switches indicate thread contention
- Profile with perf: Look for time spent in
futex_wait/spinLoopHint - Measure efficiency:
speedup / num_threadsshould be > 50%
Profiling
Section titled “Profiling”CPU Profiling
Section titled “CPU Profiling”# Linux perfperf record ./benchmarkperf report
# macOS Instrumentsxcrun xctrace record --template "Time Profiler" --launch ./benchmarkCache Analysis
Section titled “Cache Analysis”# Linuxperf stat -e cache-misses,cache-references ./benchmark
# Valgrind cachegrindvalgrind --tool=cachegrind ./benchmark