Chapter 22 — Three Coordination Modes: Centralized, Swarm, Evolutionary #

"Order is not pressure which is imposed on society from without, but an equilibrium which is set up from within." -- Jose Ortega y Gasset

📖 25 min read | 👤 All personas | 🏷️ Part VI: Orchestration

What you'll learn:

Three fundamentally different ways to coordinate multi-agent systems
The tradeoffs of centralized RACI, swarm stigmergy, and evolutionary optimization
When to use each mode and why one size does not fit all
DataSims evidence: swarm convergence, deadlock rates, evolutionary fitness

The Problem: One Size Does Not Fit All #

Marcus, the data scientist, has two very different projects on his desk.

Project A is a regulatory churn model for the banking division. Every step must be auditable. Every decision must be traceable to a requirement. The regulator will ask "why did you choose this feature?" and Marcus needs a chain of evidence from business requirement to feature selection to model coefficient.

Project B is an exploratory analysis of a new market segment. Nobody knows what the right questions are yet. The data is messy, the hypothesis is vague, and the goal is to find interesting patterns as fast as possible. Auditability is nice but speed is essential.

Project A needs centralized, deterministic coordination. Project B needs something more fluid. The Neam agent stack supports both -- and a third mode for when you want the system to optimize its own coordination topology.

Mode 1: Centralized RACI #

Centralized RACI is the default coordination mode. The DIO acts as a central dispatcher, assigning tasks to specialist agents according to the RACI matrix.

Characteristics #

Property	Value
Determinism	High -- same inputs produce same execution order
Auditability	Complete -- every dispatch recorded with RACI
Bottleneck	DIO is a single point of coordination
Parallelism	Limited -- sequential phase gates
Best for	Regulated workflows, compliance-critical projects

How It Works #

DIO receives the task specification
DIO decomposes the task into phases (requirements, engineering, modeling, testing, deployment, monitoring)
For each phase, DIO selects the R agent and dispatches the task
R agent executes, consulting C agents as needed
DIO validates the output against quality gates
If passed, DIO advances to the next phase
If failed, DIO retries or escalates

Strengths #

Full traceability: Every decision is logged with RACI assignment
Predictable execution: Phase ordering is deterministic
Quality enforcement: No phase proceeds without DIO validation
Regulatory compliance: Audit trails satisfy most governance requirements

Weaknesses #

Central bottleneck: DIO must process every inter-agent communication
Sequential overhead: Phases that could run in parallel are serialized
Fragility: If the DIO's LLM call fails, the entire pipeline stalls

💡 When to use centralized RACI: Any project where auditability matters more than speed. Regulatory models, production deployments, anything that a compliance team will audit.

Mode 2: Swarm Stigmergy #

Swarm mode draws inspiration from biological swarm intelligence. Instead of a central dispatcher, agents coordinate through stigmergy -- indirect coordination via shared artifacts in the environment.

flowchart TB
  subgraph SWARM["SWARM STIGMERGY MODE"]
    BA["BA"] --> SPACE
    DS["DS"] --> SPACE
    Test["Test"] --> SPACE
    subgraph SPACE["SHARED ARTIFACT SPACE"]
      direction LR
      BRD["BRD"]
      Features["Features"]
      Model["Model"]
      Tests["Tests"]
      Deploy["Deploy"]
    end
    SPACE --> Causal["Causal"]
    SPACE --> MLOps["MLOps"]
    SPACE --> DIO["DIO (watch)"]
  end
  style SPACE fill:#f9f9f9,stroke:#333

Agents deposit artifacts, consume others' artifacts, and react to changes

How Stigmergy Works #

In biological swarms, ants deposit pheromones that other ants follow. In the Neam swarm mode, agents deposit artifacts (documents, models, test results) into a shared space. Other agents consume these artifacts and produce new ones.

The key insight: no agent tells another agent what to do. Agents react to the state of the shared environment.

Agent Behavior Loop (each agent independently)

SENSE: Check shared space for new/changed artifacts
DECIDE: Can I contribute based on my specialty?
ACT: Produce new artifact, deposit in shared space
SIGNAL: Artifact publication notifies interested agents
REPEAT: Continue until task converges

Convergence Detection #

The swarm converges when no agent has pending work:

Iteration	Active Agents	New Artifacts	Status
1	3	3	Exploring
5	4	2	Building
10	3	1	Refining
15	2	1	Converging
20	1	0	Validating
23	0	0	CONVERGED

Deadlock Prevention #

Swarms can deadlock when agents wait for artifacts that no agent will produce. The Neam swarm mode includes three deadlock prevention mechanisms:

Timeout watchdog: If no new artifact appears within a configurable window, the DIO (in observer mode) injects a stimulus
Dependency analysis: Before launching the swarm, the DIO verifies that every required artifact type has at least one capable producer
Recovery injection: If deadlock is detected, the DIO can temporarily take over as a centralized dispatcher for the stuck phase

DataSims Evidence: Swarm Performance #

From the DataSims evaluation (evaluation/results/swarm_mode.json):

Metric	Centralized	Swarm	Delta
Convergence	7 phases (serial)	23 iterations	Different measurement
Deadlock rate	0% (by design)	2%	Expected in decentralized
Recovery rate	N/A	98%	Near-complete self-healing
AUC-ROC	0.847	0.847	Equivalent quality
CES	0.925	0.925	Equivalent effectiveness
Quality Gate	passed	passed	No degradation

Key findings:

23 iterations to convergence: The swarm took 23 iteration cycles (not sequential phases) to reach a stable state where all artifacts were complete and validated.
2% deadlock rate: In 2% of iteration cycles, agents experienced temporary deadlock. This is expected in stigmergic systems and is within acceptable bounds.
98% recovery rate: Of the deadlocks that occurred, 98% were automatically resolved by the recovery mechanisms. Only 2% of 2% (0.04%) required DIO intervention.

⚠️ The 2% deadlock rate is a design tradeoff, not a defect. Centralized RACI has 0% deadlock because the DIO prevents it by construction. Swarm mode accepts a small deadlock probability in exchange for eliminating the central bottleneck and enabling parallel execution.

Swarm vs. Centralized Tradeoff

Dimension	Centralized RACI	Swarm Stigmergy
Deadlock Risk	Low (0%)	Higher (2%)
Parallelism	Low (sequential)	High (concurrent)
Auditability	High (full RACI)	Moderate (artifact-based)

💡 When to use swarm mode: Exploratory analysis, research projects, situations where you want agents to discover emergent patterns rather than follow a predetermined plan.

Mode 3: Evolutionary Optimization #

Evolutionary mode uses a genetic algorithm to optimize the agent coordination topology itself. Instead of using a fixed coordination strategy (centralized or swarm), the system evolves the best topology for the specific task.

flowchart TB
  subgraph GEN1["Generation 1: Random Topologies"]
    direction LR
    T1["T1\n0.45"]
    T2["T2\n0.62"]
    T3["T3\n0.51"]
    T4["T4\n0.73"]
    T5["T5\n0.38"]
  end
  GEN1 --> SEL["Selection: Top 2 by fitness"]
  SEL --> T4S["T4 (0.73)"]
  SEL --> T2S["T2 (0.62)"]
  T4S --> CROSS["Crossover"]
  T2S --> CROSS
  CROSS --> GEN2
  subgraph GEN2["Generation 2: Evolved Topologies"]
    direction LR
    T4E["T4\n0.73"]
    T4P["T4'\n0.78"]
    T2P["T2'\n0.69"]
    T6["T6\n0.71"]
    T7["T7\n0.55"]
  end
  GEN2 --> REPEAT["... repeat for N generations ..."]
  REPEAT --> FINAL
  subgraph FINAL["Generation 67: Converged"]
    BEST["Best Topo\nFitness = 0.91"]
  end

Genome Representation #

Each "topology" is a genome that encodes:

CODE

  Chromosome = [
    agent_order:        [BA, DS, Causal, Test, MLOps]  // execution sequence
    parallelism_flags:  [0, 1, 1, 0, 0]                // which phases run in parallel
    consultation_edges: [(DS, Causal), (BA, Test)]      // C relationships
    gate_thresholds:    [0.9, 0.85, 0.95, 0.90]        // quality gate strictness
    retry_limits:       [3, 2, 3, 1, 2]                 // per-agent retry budgets
  ]

Fitness Function #

The fitness function evaluates each topology on a composite score:

CODE

  Fitness(topology) =
      0.25 * quality_score          // model AUC, F1, etc.
    + 0.20 * speed_score            // time to completion
    + 0.15 * reliability_score      // error detection, recovery
    + 0.15 * traceability_score     // RACI completeness
    + 0.10 * documentation_score    // BRD, specs generated
    + 0.10 * cost_efficiency_score  // LLM token cost
    + 0.05 * adaptability_score     // response to quality issues

This is the same 7-dimension proficiency scoring used in the DataSims evaluation framework.

Mutation Operators #

Three mutation operators introduce variation:

Swap mutation: Exchange two agents' positions in the execution order
Gate mutation: Adjust a quality gate threshold by +/- 10%
Edge mutation: Add or remove a consultation edge between two agents

DataSims Evidence: Evolutionary Performance #

From the DataSims evaluation (evaluation/results/evolutionary_mode.json):

Metric	Centralized	Evolutionary	Delta
Best fitness	0.925 (CES)	0.91	-1.6%
Convergence	N/A	Generation 67	—
AUC-ROC	0.847	0.847	Equivalent
CES	0.925	0.925	Equivalent
Quality Gate	passed	passed	No degradation

Key findings:

0.91 best fitness at generation 67: The GA converged to a topology with fitness 0.91 (out of 1.0) after 67 generations of evolution. This is a strong result given the search space size.
Equivalent CES: The evolved topology achieved the same CES as the hand-designed centralized RACI, suggesting that the default coordination strategy is already near-optimal for the churn prediction task.
Generation 67 convergence: Early generations explored widely (fitness 0.4-0.7). By generation 30, the population clustered around 0.85. Final convergence at generation 67 indicates the GA found a stable optimum.

xychart-beta
  title "Evolutionary Convergence Curve"
  x-axis "Generation" [0, 10, 20, 30, 40, 50, 60, 67]
  y-axis "Fitness" 0.4 --> 1.0
  line [0.4, 0.62, 0.78, 0.85, 0.88, 0.9, 0.91, 0.91]

🎯 When to use evolutionary mode: When you are unsure of the optimal coordination strategy for a novel task type. The GA explores the topology space and converges on a good strategy. For well-understood tasks (like churn prediction), the default centralized RACI is already optimal.

Comparison: When to Use Which #

Criterion	Centralized	Swarm	Evolutionary
Determinism	HIGH	LOW	LOW
Auditability	FULL	PARTIAL	FULL (best)
Speed	MODERATE	FAST	SLOW (setup)
Parallelism	LOW	HIGH	VARIES
Deadlock risk	NONE	2%	NONE
Setup cost	LOW	LOW	HIGH (GA)
Novelty adapt.	LOW	MODERATE	HIGH
Best for	Regulated workflows	Exploratory analysis	Novel tasks or topology optimization

Decision Framework #

Use this decision tree to select the right mode:

flowchart TD
  Q1["Is auditability required by regulation?"]
  Q1 -->|YES| RACI["Centralized RACI"]
  Q1 -->|NO| Q2["Is the task type well-understood?"]
  Q2 -->|YES| RACI2["Centralized RACI\n(proven, lowest overhead)"]
  Q2 -->|NO| Q3["Do you need speed over optimality?"]
  Q3 -->|YES| SWARM["Swarm Stigmergy"]
  Q3 -->|NO| EVO["Evolutionary GA"]

💡 In practice, most production deployments use centralized RACI. Swarm and evolutionary modes are valuable for research, exploration, and topology optimization -- but when a model goes to production, the compliance team wants deterministic, auditable execution.

Hybrid Approaches #

The three modes are not mutually exclusive. Common hybrid patterns:

Evolutionary discovery + Centralized execution: Use the GA to find the optimal topology offline, then deploy it as a centralized RACI configuration in production.

Centralized with swarm phases: Use centralized RACI for the overall lifecycle, but allow swarm behavior within specific phases (e.g., feature engineering, where multiple data exploration agents can work in parallel).

Swarm with centralized gates: Let agents coordinate via stigmergy, but require DIO-validated quality gates between major phases.

Key Takeaways #

Centralized RACI is deterministic and auditable but creates a coordination bottleneck -- best for regulated workflows
Swarm stigmergy enables parallel execution through shared artifacts -- best for exploratory analysis (23 iterations to convergence, 2% deadlock, 98% recovery)
Evolutionary GA optimizes the coordination topology itself -- best for novel tasks (0.91 fitness, convergence at generation 67)
All three modes achieved equivalent model quality (AUC=0.847) and CES (0.925) on the churn prediction task
The choice depends on auditability requirements, task novelty, and speed needs
Hybrid approaches combine the strengths of multiple modes

For Further Exploration #

DataSims Repository -- Swarm and evolutionary results in evaluation/results/
Chapter 21 -- RACI matrix architecture in detail
Chapter 23 -- How each coordination mode handles errors differently