Skip to content
v1.0.0-zig0.15.2

Parallel Reduction

Parallel reduction combines elements using an associative operation, achieving O(log n) depth with n processors.

data: [1, 2, 3, 4, 5, 6, 7, 8]
sum = 0
for each element:
sum += element
Time: O(n)
Step 1: Parallel local reductions (8 workers)
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ 8 │
└─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
Step 2: Combine pairs
3 7 11 15
└───┬───┘ └───┬───┘
10 26
Step 3: Final combine
10 + 26 = 36
Depth: O(log n)
Work: O(n)
pub fn parallelReduce(
comptime T: type,
n: usize,
identity: T,
comptime Context: type,
ctx: Context,
comptime map: fn(Context, usize) T,
comptime combine: fn(T, T) T,
) T {
if (n == 0) return identity;
// Determine chunk size
const num_workers = blitz.numWorkers();
const chunk_size = @max(n / (num_workers * 4), MIN_CHUNK_SIZE);
// Allocate partial results
var partials: [MAX_WORKERS]T = undefined;
var num_partials: usize = 0;
// Parallel map phase
blitz.parallelFor(n, struct {
ctx: Context,
partials: []T,
chunk_size: usize,
}, .{
.ctx = ctx,
.partials = &partials,
.chunk_size = chunk_size,
}, struct {
fn body(c: @This(), start: usize, end: usize) void {
var local_result = identity;
for (start..end) |i| {
local_result = combine(local_result, map(c.ctx, i));
}
// Store partial result
const chunk_idx = start / c.chunk_size;
c.partials[chunk_idx] = local_result;
}
}.body);
// Sequential combine phase (small number of partials)
var result = identity;
for (partials[0..num_partials]) |partial| {
result = combine(result, partial);
}
return result;
}

Process chunks directly for better efficiency:

pub fn parallelReduceChunked(
comptime T: type,
comptime E: type,
data: []const E,
identity: T,
comptime chunkFn: fn([]const E) T,
comptime combine: fn(T, T) T,
) T {
if (data.len == 0) return identity;
const num_workers = blitz.numWorkers();
const chunk_size = @max(data.len / num_workers, 1024);
const num_chunks = (data.len + chunk_size - 1) / chunk_size;
var partials: [MAX_WORKERS]T = undefined;
// Process chunks in parallel
blitz.parallelFor(num_chunks, struct {
data: []const E,
chunk_size: usize,
partials: *[MAX_WORKERS]T,
}, .{
.data = data,
.chunk_size = chunk_size,
.partials = &partials,
}, struct {
fn body(c: @This(), chunk_start: usize, chunk_end: usize) void {
for (chunk_start..chunk_end) |chunk_idx| {
const start = chunk_idx * c.chunk_size;
const end = @min(start + c.chunk_size, c.data.len);
c.partials[chunk_idx] = chunkFn(c.data[start..end]);
}
}
}.body);
// Combine partials
var result = identity;
for (partials[0..num_chunks]) |p| {
result = combine(result, p);
}
return result;
}

The combine function MUST be associative:

combine(combine(a, b), c) == combine(a, combine(b, c))

Why? Parallel execution groups elements differently than sequential:

Sequential: ((((1+2)+3)+4)+5)
Parallel: (1+2) + (3+4) + 5
If + is associative, both equal 15.
If we used subtraction: ((1-2)-3) != (1-(2-3)) // Wrong!
OperationIdentityAssociative?
Addition0Yes
Multiplication1Yes
MinmaxIntYes
MaxminIntYes
Bitwise ANDall 1sYes
Bitwise OR0Yes
Bitwise XOR0Yes
String concat""Yes
OperationWhy Not?
Subtraction(a-b)-c != a-(b-c)
Division(a/b)/c != a/(b/c)
AverageNot associative

Average isn’t directly associative, but we can work around it:

// Reduce to (sum, count), then divide
const SumCount = struct { sum: f64, count: usize };
const result = parallelReduce(
SumCount, data.len, .{ .sum = 0, .count = 0 },
data,
struct {
fn map(d: []const f64, i: usize) SumCount {
return .{ .sum = d[i], .count = 1 };
}
}.map,
struct {
fn combine(a: SumCount, b: SumCount) SumCount {
return .{ .sum = a.sum + b.sum, .count = a.count + b.count };
}
}.combine,
);
const average = result.sum / @as(f64, @floatFromInt(result.count));
n elements, p processors:
Parallel depth: O(n/p + log p)
- n/p: local reduction within each chunk
- log p: combining partial results
For n=1M, p=8:
- Local: 125,000 ops per worker
- Combine: 3 levels (log2 8)

Parallel reduction is work-efficient:

Sequential work: n-1 operations
Parallel work: n-1 operations (same!)
Parallel depth: O(log n)
Efficiency = Sequential work / Parallel work = 1 (optimal)

Work stealing handles uneven chunks:

Without work stealing:
Worker 0: [==========] done
Worker 1: [==========] done
Worker 2: [====================] still working...
Worker 3: [==========] done
With work stealing:
Worker 0: [==========][steal]
Worker 1: [==========][steal]
Worker 2: [============]
Worker 3: [==========][steal]
Benchmark: Sum 10M i64 elements, 10 workers
Implementation Time Speedup
------------------------------------------
Sequential 35.2 ms 1.0x
Parallel reduce 4.1 ms 8.6x
Data SizeRecommendation
< 1,000Sequential (overhead too high)
> 1,000Parallel reduce

Use a size threshold:

if (data.len >= blitz.DEFAULT_GRAIN_SIZE) {
return parallelReduce(...);
} else {
return sequentialReduce(...);
}