Skip to content

← all backend comparisons

concurrency

Parallel CPU work: rayon::par_iter vs goroutines + WaitGroup

GET /products/{slug}/recommendations scores every other product against the target with a CPU-bound similarity hash. Go fans out one goroutine per candidate writing to its own slot in a slice. Rust hands the slice to rayon::par_iter and lets the work-stealing pool spread it across threads.

Go (chi · sqlc · pgx)
goroutine-per-candidate, sync.WaitGroup join, results[idx] = ... (no mutex)
go/internal/httpserver/recommendations.go
// shop-two-backends not found at build time
Rust (axum · sqlx · tokio)
into_par_iter().map(score).collect() inside spawn_blocking
rust/src/routes/recommendations.rs
// shop-two-backends not found at build time

What to take away

Both implementations score the same FNV-style hash over each candidate's description (1500 iterations × ~200 bytes), so the workload is identical. The comparison is on the parallelism scaffolding, not the math.

Go: one goroutine per candidate. Each goroutine writes to results[idx] — its own pre-allocated slot. No mutex because no two goroutines touch the same index. The "no shared mutation" property is a code-review invariant: get it wrong (e.g. append instead of indexed assign) and go test -race catches it eventually, or production catches it. sync.WaitGroup joins. Idiomatic, ~30 lines including the helpers.

Rust: candidates.into_par_iter().map(score).collect(). That's the entire parallelism. Rayon owns the work-stealing pool; we describe the pure function and Rust monomorphizes the call. The borrow checker refuses any shared mutable state in the closure at compile time, so the "no mutex needed" invariant isn't a code-review property — it's a build-time guarantee.

The wrinkle on the Rust side is the runtime boundary. Rayon is synchronous; the handler is async. The fix is one line — wrap the rayon call in tokio::task::spawn_blocking(move || …).await? — so the rayon threads don't starve the async reactor.

Practical note: for ~30 candidates and a tight CPU loop, both versions finish well under 50 ms on a modest machine. Crank the iterations up by 10× and the parallel speedup is visible to the eye on either side. The code difference, though, is permanent — and that's what this page wants you to see.