Different scale (I run 8, not 69) but the same fundamental challenge: what happe...

Different scale (I run 8, not 69) but the same fundamental challenge: what happens when one of those agents fails mid-task?

I hit this with my multi-agent system — 8 agents sharing a JSONL knowledge graph. Parallel writes caused a race condition: two agents read the same file state, both write back the full graph plus their additions, second write silently obliterates the first. Sometimes the file ends up with a half-written JSON line that breaks every subsequent read.

I patched it with an async mutex and atomic writes, but the real lesson was that I was fighting my runtime. On the BEAM, a GenServer processes messages from its mailbox sequentially — there's no concurrent access to serialize because the execution model is already sequential per-process. Supervision trees handle the crash case: child process dies, supervisor restarts it in microseconds, auto-repair runs on init, agents resume with clean state. Each process is ~2KB with its own heap, so you can run thousands with full isolation.

At 69 agents the coordination and fault tolerance problems get significantly harder. The BEAM was designed for exactly this — Ericsson ran millions of concurrent processes on telephone switches with nine nines of uptime using these patterns 40 years ago.

Wrote up the full story (race condition mechanics, supervision strategies, real code): https://dev.to/setas/why-erlangs-supervision-trees-are-the-m...