Architecture Overview
System Architecture
Section titled “System Architecture”+-----------------------------------------------------------------------------+| USER CODE || blitz.iter(data).sum() | blitz.join(.{...}) | blitz.sortAsc() |+-----------------------------------------------------------------------------+ | v+-----------------------------------------------------------------------------+| API LAYER (api.zig) || - High-level parallel primitives || - Automatic grain size calculation || - Sequential/parallel threshold decisions || - Iterator combinators and transformations || - Thread-local task context management |+-----------------------------------------------------------------------------+ | v+-----------------------------------------------------------------------------+| FORK-JOIN LAYER (Future.zig) || - Stack-allocated Future(Input, Output) for fork-join || - Embedded OnceLatch for completion signaling || - Hybrid join: latch-first for stolen, pop-first for local || - Active work-stealing during join wait |+-----------------------------------------------------------------------------+ | v+-----------------------------------------------------------------------------+| SCHEDULER (Pool.zig) || - ThreadPool with Rayon-style JEC (Jobs Event Counter) protocol || - AtomicCounters: packed u64 with sleeping/inactive/JEC counters || - CoreLatch: 4-state protocol (UNSET->SLEEPY->SLEEPING->SET) || - Progressive sleep: 32 yields -> announce sleepy -> sleep || - Background workers with Chase-Lev deques |+-----------------------------------------------------------------------------+ | v+-----------------------------------------------------------------------------+| LOCK-FREE PRIMITIVES || Deque.zig - Chase-Lev work-stealing deque (wait-free push/pop) || Latch.zig - 4-state OnceLatch, CountLatch, SpinWait || Pool.zig - Job struct (8 bytes: handler pointer) + ThreadPool || Sync.zig - SyncPtr for lock-free parallel writes |+-----------------------------------------------------------------------------+Component Details
Section titled “Component Details”ThreadPool (Pool.zig)
Section titled “ThreadPool (Pool.zig)”The thread pool manages worker threads and coordinates work distribution using Rayon-style sleep protocol.
ThreadPool+-- allocator: Allocator // For worker/deque allocation+-- workers: []Worker // Background workers+-- main_worker: Worker // For external call() invocations+-- num_workers: usize // Worker count+-- threads: []Thread // OS thread handles|+-- sleep: Sleep // Sleep manager with JEC protocol| +-- counters: AtomicCounters // Packed u64: sleeping|inactive|JEC| +-- worker_sleep_states: []WorkerSleepState|+-- injected_jobs_head: *Node // Treiber stack for external jobs+-- stopping: Atomic(bool) // Shutdown flag+-- workers_ready: Semaphore // Startup synchronizationKey Operations:
-
Job Injection (external threads calling
pool.call()):External thread Worker thread| |+-> Push to Treiber stack ---------++-> wakeOne() ---------------------++-> Wait on done event |+-> Pop from Treiber stack+-> Execute job+-> Signal done event -
Work Stealing:
Worker A (owner) Worker B (thief)| |+-> push(job) to deque || (bottom++) || +-> steal() from A's deque| | (top++, CAS)+-> pop() from deque || (bottom--, CAS if last) |
Worker (Pool.zig)
Section titled “Worker (Pool.zig)”Each worker thread maintains local state for efficient work distribution.
Worker (per thread)+-- pool: *ThreadPool // Back-reference to pool+-- id: u32 // Worker index+-- deque: ?Deque(*Job) // Chase-Lev deque (256 slots)+-- rng: XorShift64Star // For randomized victim selection+-- latch: CoreLatch // 4-state sleep coordination+-- stats: WorkerStats // jobs_executed, jobs_stolenWorker Loop State Machine (Rayon-style with JEC):
+------------------+ | ACTIVE | | (executing) | +--------+---------+ | no work found v +------------------+ | YIELD PHASE | | rounds 0-31 | | yield() each | | inactive_count++| +--------+---------+ | still no work v +------------------+ | ANNOUNCE SLEEPY | | round 32 | | JEC even->odd | | save snapshot | +--------+---------+ | one more yield v +------------------+ | SLEEPING | | round 34+ | | CoreLatch: | | UNSET->SLEEPY | | SLEEPY->SLEEPING| | sleeping_count++| | condvar.wait() | +--------+---------+ | wake signal | (waker decrements sleeping) v +------------------+ | ACTIVE | | rounds = 0 | +------------------+Future (Future.zig)
Section titled “Future (Future.zig)”Stack-allocated futures enable zero-allocation fork-join parallelism.
Future(Input, Output)+-- job: Job // Handler pointer (8 bytes)+-- latch: OnceLatch // Completion signal (4 bytes)+-- input: Input // Captured input value+-- result: Output // Storage for result
Total size: ~32-64 bytes (on stack)Fork-Join Flow:
1. fork(task, func, input): +-------------------------------------------------------------+ | self.input = input | | self.job.handler = Handler.handle // comptime specialized | | self.latch = OnceLatch.init() | | task.worker.pushAndWake(&self.job) // visible to thieves | +-------------------------------------------------------------+
2. join(task) - Hybrid Strategy: +-------------------------------------------------------------+ | // Fast path: check if stolen job already completed | | if (latch.probe()) return self.result; | | | | // Try pop first: catches local jobs at deep recursion | | if (worker.pop()) |job| { | | if (job == &self.job) return null; // execute locally | | job.execute(); | | } | | | | // Work-stealing loop while waiting | | while (!latch.probe()) { | | pop() or steal() and execute | | spin, then yield | | } | | return self.result; | +-------------------------------------------------------------+Chase-Lev Deque (Deque.zig)
Section titled “Chase-Lev Deque (Deque.zig)”Lock-free double-ended queue optimized for work-stealing.
Deque(T)+-- top: Atomic(isize) // Thieves' end (cache-line aligned)+-- _padding: [56]u8 // Prevent false sharing+-- bottom: Atomic(isize) // Owner's end (cache-line aligned)+-- _padding2: [56]u8+-- buffer: []T // Circular array (power of 2)+-- mask: usize // buffer.len - 1 for fast moduloMemory Ordering:
| Operation | Ordering | Rationale |
|---|---|---|
| push (write bottom) | release | Make item visible before index |
| pop (read/write bottom) | seq_cst | Synchronize with steal |
| pop (CAS top) | seq_cst | Race with thieves |
| steal (read top) | acquire | See owner’s pushes |
| steal (read bottom) | acquire | See owner’s pops |
| steal (CAS top) | seq_cst | Race with owner and other thieves |
Steal with Backoff:
pub fn stealLoop(self: *Self) ?T { var backoff: u32 = 0; while (true) { switch (self.steal().result) { .success => return item, .empty => return null, .retry => { // Exponential backoff: 1, 2, 4, 8, 16, 32 spins // Then yield to OS backoff = @min(backoff + 1, 6); spin(1 << backoff); } } }}OnceLatch (Latch.zig)
Section titled “OnceLatch (Latch.zig)”4-state latch that prevents missed wakes.
States: UNSET (0) --------------> SET (3) | ^ v | SLEEPY (1) ---------------> SET (3) | ^ v | SLEEPING (2) -------------> SET (3) | v UNSET (0) [spurious wake]The “Tickle-Then-Get-Sleepy” Pattern:
This pattern prevents a race where Thread A misses a wake:
Without SLEEPY state (race condition): Thread A: if (!done) { Thread B: done = true; Thread A: Thread B: wake(); // A not sleeping yet! Thread A: sleep(); // MISSED WAKE!
With SLEEPY state (safe): Thread A: state = SLEEPY; Thread A: if (!done) { Thread B: done = true; Thread A: if (CAS SLEEPY->SLEEPING) { Thread A: sleep(); Thread B: if (prev == SLEEPING) wake(); Thread A: } // B sees A is SLEEPING, will wake }Data Flow Examples
Section titled “Data Flow Examples”parallelFor
Section titled “parallelFor”1. User calls blitz.parallelFor(1000000, ctx, body) | v2. Calculate grain size: max(1000000 / (workers * 4), MIN_GRAIN) | v3. If n < threshold: execute sequentially Else: recursive parallel split | v4. Split in half: [0..500000) and [500000..1000000) Fork right half, recurse left | +-> Worker 0: process [0..500000) | +-- Further splits as needed | +-> Worker 1 steals: [500000..1000000) +-- Further splits as needed | v5. Join waits, actively stealing other work | v6. All ranges processed, returnparallelReduce
Section titled “parallelReduce”1. Divide into chunks based on grain size | v2. Parallel map phase: +---------+---------+---------+---------+ | Chunk 0 | Chunk 1 | Chunk 2 | Chunk 3 | | map() | map() | map() | map() | | reduce | reduce | reduce | reduce | +----+----+----+----+----+----+----+----+ | | | | v v v v part0 part1 part2 part3 | v3. Tree reduction (parallel): part0 -----+ part1 ----+|---> combined01 ----+ part2 ----+| | part3 ---++--> combined23 -----+---> final |4. Return final resultjoin (2-task)
Section titled “join (2-task)”1. blitz.join(.{ .a = taskA, .b = taskB }) | v2. Fork task B (push to deque, wake if sleepers) | +-> Task B visible to thieves | v3. Execute task A inline (no fork needed) | +-> If A does nested join, repeat recursively | v4. Join task B: - If B was stolen and done: return result - If B in deque: pop and execute locally - If B stolen but not done: steal other work while waiting | v5. Return .{ .a = resultA, .b = resultB }Key Design Decisions
Section titled “Key Design Decisions”1. Rayon-Style JEC Protocol
Section titled “1. Rayon-Style JEC Protocol”Prevents missed wakes through Jobs Event Counter toggling:
// When posting workfn newJobs(self: *Sleep, ...) void { // Toggle JEC even->odd to signal "new work available" const old = self.counters.incrementJecIfSleepy();
// Only wake sleeping workers (idle workers will find work naturally) const sleeping = extractSleeping(old); if (sleeping > 0) { self.wakeThreads(@min(num_jobs, sleeping)); }}Workers check JEC snapshot before sleeping to detect work posted during transition.
2. Stack-Allocated Futures
Section titled “2. Stack-Allocated Futures”Jobs are embedded in the caller’s stack frame:
fn parallelRecurse(...) void { var future: Future(Input, Output) = undefined; // On stack! future.fork(&task, func, input); // ... local work ... const result = future.join(&task);}Benefits:
- Zero heap allocation for fork-join
- Cache-friendly (future near call site)
- Automatic cleanup on function return
3. Hybrid Join Strategy
Section titled “3. Hybrid Join Strategy”Optimized for both shallow and deep recursion:
- Shallow recursion (high steal rate): Check latch first, stolen jobs often complete quickly
- Deep recursion (low steal rate): Try pop first, jobs usually still local
// Fast path for stolenif (latch.probe()) return result;
// Fast path for localif (worker.pop()) |job| { if (job == &self.job) return null; execute(job);}
// Full work-stealing loopwhile (!latch.probe()) { ... }4. Packed AtomicCounters
Section titled “4. Packed AtomicCounters”All sleep state in a single u64 for atomic multi-field operations:
AtomicCounters (u64):+-- Bits 0-15: sleeping_threads (blocked on condvar)+-- Bits 16-31: inactive_threads (not executing work)+-- Bits 32-63: JEC (even=sleepy, odd=active)fn newJobs(self: *Sleep, num_jobs: u32, queue_was_empty: bool) void { const old = self.counters.incrementJecIfSleepy(); const sleeping = extractSleeping(old); const inactive = extractInactive(old); const awake_but_idle = inactive -| sleeping;
// Only wake if not enough idle workers to grab the work if (queue_was_empty and awake_but_idle < num_jobs) { self.wakeThreads(@min(num_jobs - awake_but_idle, sleeping)); }}5. Comptime Specialization
Section titled “5. Comptime Specialization”All function pointers resolved at compile time:
pub fn parallelFor( n: usize, comptime Context: type, // Compile-time ctx: Context, comptime body: fn(...) void, // Compile-time) voidBenefits:
- No virtual dispatch
- Full inlining possible
- Type-specific optimizations
Performance Characteristics
Section titled “Performance Characteristics”| Operation | Time | Notes |
|---|---|---|
| Fork (push) | ~3 ns | Wait-free |
| Join (pop, local) | ~5 ns | Wait-free, no latch check needed |
| Join (stolen, done) | ~3 ns | Single latch.probe() |
| Steal | ~10-50 ns | Lock-free, CAS contention |
| Wake | ~5-10 ns | Futex |
| Full fork-join cycle | ~10-20 ns | Amortized over work |
Thread Safety Summary
Section titled “Thread Safety Summary”| Component | Safety | Notes |
|---|---|---|
| ThreadPool | Thread-safe | Singleton, all operations atomic |
| Worker | Single-owner | Only owning thread accesses |
| Deque.push/pop | Single-owner | Called by owner thread only |
| Deque.steal | Multi-reader | Multiple thieves can race (CAS) |
| Future | Single-owner | Lives on owner’s stack |
| OnceLatch | Thread-safe | Multiple threads can wait/set |
| Job handler | User responsibility | User ensures data safety |
Memory Layout Optimizations
Section titled “Memory Layout Optimizations”Cache-Line Alignment
Section titled “Cache-Line Alignment”Deque memory layout (prevents false sharing):+------------------------------------------------------------+| Cache Line 0: top (8 bytes) + padding (56 bytes) |+------------------------------------------------------------+| Cache Line 1: bottom (8 bytes) + padding (56 bytes) |+------------------------------------------------------------+| Cache Line 2+: buffer pointer, mask, allocator |+------------------------------------------------------------+This ensures:
- Owner (bottom) and thieves (top) don’t cause cache invalidation
- Buffer metadata rarely changes after initialization
Future Size Optimization
Section titled “Future Size Optimization”Future(i32, i64) layout:+-- job: Job (8 bytes: handler pointer)+-- latch: OnceLatch (4 bytes: state atomic)+-- padding (4 bytes: alignment)+-- input: i32 (4 bytes)+-- padding (4 bytes: alignment)+-- result: i64 (8 bytes)Total: 32 bytes (fits in half a cache line)