Chapter 22 — Three Coordination Modes: Centralized, Swarm, Evolutionary #
"Order is not pressure which is imposed on society from without, but an equilibrium which is set up from within." -- Jose Ortega y Gasset
📖 25 min read | 👤 All personas | 🏷️ Part VI: Orchestration
What you'll learn:
- Three fundamentally different ways to coordinate multi-agent systems
- The tradeoffs of centralized RACI, swarm stigmergy, and evolutionary optimization
- When to use each mode and why one size does not fit all
- DataSims evidence: swarm convergence, deadlock rates, evolutionary fitness
The Problem: One Size Does Not Fit All #
Marcus, the data scientist, has two very different projects on his desk.
Project A is a regulatory churn model for the banking division. Every step must be auditable. Every decision must be traceable to a requirement. The regulator will ask "why did you choose this feature?" and Marcus needs a chain of evidence from business requirement to feature selection to model coefficient.
Project B is an exploratory analysis of a new market segment. Nobody knows what the right questions are yet. The data is messy, the hypothesis is vague, and the goal is to find interesting patterns as fast as possible. Auditability is nice but speed is essential.
Project A needs centralized, deterministic coordination. Project B needs something more fluid. The Neam agent stack supports both -- and a third mode for when you want the system to optimize its own coordination topology.
Mode 1: Centralized RACI #
Centralized RACI is the default coordination mode. The DIO acts as a central dispatcher, assigning tasks to specialist agents according to the RACI matrix.
Characteristics #
| Property | Value |
|---|---|
| Determinism | High -- same inputs produce same execution order |
| Auditability | Complete -- every dispatch recorded with RACI |
| Bottleneck | DIO is a single point of coordination |
| Parallelism | Limited -- sequential phase gates |
| Best for | Regulated workflows, compliance-critical projects |
How It Works #
- DIO receives the task specification
- DIO decomposes the task into phases (requirements, engineering, modeling, testing, deployment, monitoring)
- For each phase, DIO selects the R agent and dispatches the task
- R agent executes, consulting C agents as needed
- DIO validates the output against quality gates
- If passed, DIO advances to the next phase
- If failed, DIO retries or escalates
Strengths #
- Full traceability: Every decision is logged with RACI assignment
- Predictable execution: Phase ordering is deterministic
- Quality enforcement: No phase proceeds without DIO validation
- Regulatory compliance: Audit trails satisfy most governance requirements
Weaknesses #
- Central bottleneck: DIO must process every inter-agent communication
- Sequential overhead: Phases that could run in parallel are serialized
- Fragility: If the DIO's LLM call fails, the entire pipeline stalls
💡 When to use centralized RACI: Any project where auditability matters more than speed. Regulatory models, production deployments, anything that a compliance team will audit.
Mode 2: Swarm Stigmergy #
Swarm mode draws inspiration from biological swarm intelligence. Instead of a central dispatcher, agents coordinate through stigmergy -- indirect coordination via shared artifacts in the environment.
flowchart TB
subgraph SWARM["SWARM STIGMERGY MODE"]
BA["BA"] --> SPACE
DS["DS"] --> SPACE
Test["Test"] --> SPACE
subgraph SPACE["SHARED ARTIFACT SPACE"]
direction LR
BRD["BRD"]
Features["Features"]
Model["Model"]
Tests["Tests"]
Deploy["Deploy"]
end
SPACE --> Causal["Causal"]
SPACE --> MLOps["MLOps"]
SPACE --> DIO["DIO (watch)"]
end
style SPACE fill:#f9f9f9,stroke:#333
Agents deposit artifacts, consume others' artifacts, and react to changes
How Stigmergy Works #
In biological swarms, ants deposit pheromones that other ants follow. In the Neam swarm mode, agents deposit artifacts (documents, models, test results) into a shared space. Other agents consume these artifacts and produce new ones.
The key insight: no agent tells another agent what to do. Agents react to the state of the shared environment.
- SENSE: Check shared space for new/changed artifacts
- DECIDE: Can I contribute based on my specialty?
- ACT: Produce new artifact, deposit in shared space
- SIGNAL: Artifact publication notifies interested agents
- REPEAT: Continue until task converges
Convergence Detection #
The swarm converges when no agent has pending work:
| Iteration | Active Agents | New Artifacts | Status |
|---|---|---|---|
| 1 | 3 | 3 | Exploring |
| 5 | 4 | 2 | Building |
| 10 | 3 | 1 | Refining |
| 15 | 2 | 1 | Converging |
| 20 | 1 | 0 | Validating |
| 23 | 0 | 0 | CONVERGED |
Deadlock Prevention #
Swarms can deadlock when agents wait for artifacts that no agent will produce. The Neam swarm mode includes three deadlock prevention mechanisms:
- Timeout watchdog: If no new artifact appears within a configurable window, the DIO (in observer mode) injects a stimulus
- Dependency analysis: Before launching the swarm, the DIO verifies that every required artifact type has at least one capable producer
- Recovery injection: If deadlock is detected, the DIO can temporarily take over as a centralized dispatcher for the stuck phase
DataSims Evidence: Swarm Performance #
From the DataSims evaluation (evaluation/results/swarm_mode.json):
| Metric | Centralized | Swarm | Delta |
|---|---|---|---|
| Convergence | 7 phases (serial) | 23 iterations | Different measurement |
| Deadlock rate | 0% (by design) | 2% | Expected in decentralized |
| Recovery rate | N/A | 98% | Near-complete self-healing |
| AUC-ROC | 0.847 | 0.847 | Equivalent quality |
| CES | 0.925 | 0.925 | Equivalent effectiveness |
| Quality Gate | passed | passed | No degradation |
Key findings:
- 23 iterations to convergence: The swarm took 23 iteration cycles (not sequential phases) to reach a stable state where all artifacts were complete and validated.
- 2% deadlock rate: In 2% of iteration cycles, agents experienced temporary deadlock. This is expected in stigmergic systems and is within acceptable bounds.
- 98% recovery rate: Of the deadlocks that occurred, 98% were automatically resolved by the recovery mechanisms. Only 2% of 2% (0.04%) required DIO intervention.
⚠️ The 2% deadlock rate is a design tradeoff, not a defect. Centralized RACI has 0% deadlock because the DIO prevents it by construction. Swarm mode accepts a small deadlock probability in exchange for eliminating the central bottleneck and enabling parallel execution.
| Dimension | Centralized RACI | Swarm Stigmergy |
|---|---|---|
| Deadlock Risk | Low (0%) | Higher (2%) |
| Parallelism | Low (sequential) | High (concurrent) |
| Auditability | High (full RACI) | Moderate (artifact-based) |
💡 When to use swarm mode: Exploratory analysis, research projects, situations where you want agents to discover emergent patterns rather than follow a predetermined plan.
Mode 3: Evolutionary Optimization #
Evolutionary mode uses a genetic algorithm to optimize the agent coordination topology itself. Instead of using a fixed coordination strategy (centralized or swarm), the system evolves the best topology for the specific task.
flowchart TB
subgraph GEN1["Generation 1: Random Topologies"]
direction LR
T1["T1\n0.45"]
T2["T2\n0.62"]
T3["T3\n0.51"]
T4["T4\n0.73"]
T5["T5\n0.38"]
end
GEN1 --> SEL["Selection: Top 2 by fitness"]
SEL --> T4S["T4 (0.73)"]
SEL --> T2S["T2 (0.62)"]
T4S --> CROSS["Crossover"]
T2S --> CROSS
CROSS --> GEN2
subgraph GEN2["Generation 2: Evolved Topologies"]
direction LR
T4E["T4\n0.73"]
T4P["T4'\n0.78"]
T2P["T2'\n0.69"]
T6["T6\n0.71"]
T7["T7\n0.55"]
end
GEN2 --> REPEAT["... repeat for N generations ..."]
REPEAT --> FINAL
subgraph FINAL["Generation 67: Converged"]
BEST["Best Topo\nFitness = 0.91"]
end
Genome Representation #
Each "topology" is a genome that encodes:
Chromosome = [
agent_order: [BA, DS, Causal, Test, MLOps] // execution sequence
parallelism_flags: [0, 1, 1, 0, 0] // which phases run in parallel
consultation_edges: [(DS, Causal), (BA, Test)] // C relationships
gate_thresholds: [0.9, 0.85, 0.95, 0.90] // quality gate strictness
retry_limits: [3, 2, 3, 1, 2] // per-agent retry budgets
]
Fitness Function #
The fitness function evaluates each topology on a composite score:
Fitness(topology) =
0.25 * quality_score // model AUC, F1, etc.
+ 0.20 * speed_score // time to completion
+ 0.15 * reliability_score // error detection, recovery
+ 0.15 * traceability_score // RACI completeness
+ 0.10 * documentation_score // BRD, specs generated
+ 0.10 * cost_efficiency_score // LLM token cost
+ 0.05 * adaptability_score // response to quality issues
This is the same 7-dimension proficiency scoring used in the DataSims evaluation framework.
Mutation Operators #
Three mutation operators introduce variation:
- Swap mutation: Exchange two agents' positions in the execution order
- Gate mutation: Adjust a quality gate threshold by +/- 10%
- Edge mutation: Add or remove a consultation edge between two agents
DataSims Evidence: Evolutionary Performance #
From the DataSims evaluation (evaluation/results/evolutionary_mode.json):
| Metric | Centralized | Evolutionary | Delta |
|---|---|---|---|
| Best fitness | 0.925 (CES) | 0.91 | -1.6% |
| Convergence | N/A | Generation 67 | — |
| AUC-ROC | 0.847 | 0.847 | Equivalent |
| CES | 0.925 | 0.925 | Equivalent |
| Quality Gate | passed | passed | No degradation |
Key findings:
- 0.91 best fitness at generation 67: The GA converged to a topology with fitness 0.91 (out of 1.0) after 67 generations of evolution. This is a strong result given the search space size.
- Equivalent CES: The evolved topology achieved the same CES as the hand-designed centralized RACI, suggesting that the default coordination strategy is already near-optimal for the churn prediction task.
- Generation 67 convergence: Early generations explored widely (fitness 0.4-0.7). By generation 30, the population clustered around 0.85. Final convergence at generation 67 indicates the GA found a stable optimum.
xychart-beta title "Evolutionary Convergence Curve" x-axis "Generation" [0, 10, 20, 30, 40, 50, 60, 67] y-axis "Fitness" 0.4 --> 1.0 line [0.4, 0.62, 0.78, 0.85, 0.88, 0.9, 0.91, 0.91]
🎯 When to use evolutionary mode: When you are unsure of the optimal coordination strategy for a novel task type. The GA explores the topology space and converges on a good strategy. For well-understood tasks (like churn prediction), the default centralized RACI is already optimal.
Comparison: When to Use Which #
| Criterion | Centralized | Swarm | Evolutionary |
|---|---|---|---|
| Determinism | HIGH | LOW | LOW |
| Auditability | FULL | PARTIAL | FULL (best) |
| Speed | MODERATE | FAST | SLOW (setup) |
| Parallelism | LOW | HIGH | VARIES |
| Deadlock risk | NONE | 2% | NONE |
| Setup cost | LOW | LOW | HIGH (GA) |
| Novelty adapt. | LOW | MODERATE | HIGH |
| Best for | Regulated workflows | Exploratory analysis | Novel tasks or topology optimization |
Decision Framework #
Use this decision tree to select the right mode:
flowchart TD Q1["Is auditability required by regulation?"] Q1 -->|YES| RACI["Centralized RACI"] Q1 -->|NO| Q2["Is the task type well-understood?"] Q2 -->|YES| RACI2["Centralized RACI\n(proven, lowest overhead)"] Q2 -->|NO| Q3["Do you need speed over optimality?"] Q3 -->|YES| SWARM["Swarm Stigmergy"] Q3 -->|NO| EVO["Evolutionary GA"]
💡 In practice, most production deployments use centralized RACI. Swarm and evolutionary modes are valuable for research, exploration, and topology optimization -- but when a model goes to production, the compliance team wants deterministic, auditable execution.
Hybrid Approaches #
The three modes are not mutually exclusive. Common hybrid patterns:
- Evolutionary discovery + Centralized execution: Use the GA to find the optimal topology offline, then deploy it as a centralized RACI configuration in production.
- Centralized with swarm phases: Use centralized RACI for the overall lifecycle, but allow swarm behavior within specific phases (e.g., feature engineering, where multiple data exploration agents can work in parallel).
- Swarm with centralized gates: Let agents coordinate via stigmergy, but require DIO-validated quality gates between major phases.
Key Takeaways #
- Centralized RACI is deterministic and auditable but creates a coordination bottleneck -- best for regulated workflows
- Swarm stigmergy enables parallel execution through shared artifacts -- best for exploratory analysis (23 iterations to convergence, 2% deadlock, 98% recovery)
- Evolutionary GA optimizes the coordination topology itself -- best for novel tasks (0.91 fitness, convergence at generation 67)
- All three modes achieved equivalent model quality (AUC=0.847) and CES (0.925) on the churn prediction task
- The choice depends on auditability requirements, task novelty, and speed needs
- Hybrid approaches combine the strengths of multiple modes
For Further Exploration #
- DataSims Repository -- Swarm and evolutionary results in
evaluation/results/ - Chapter 21 -- RACI matrix architecture in detail
- Chapter 23 -- How each coordination mode handles errors differently