Chapter 5: Meet the 14 Agents #

"The strength of the team is each individual member. The strength of each member is the team." -- Phil Jackson

30 min read | All personas | Part II: The Architecture

What you'll learn:

The role, personality, authority, and capabilities of each of the 14 specialist agents
The Agent Capability Matrix -- which agents produce, consume, reason, and validate
The trait-based capability system: DataProducer, DataConsumer, CausalReasoner, QualityGatekeeper
The composable agent pattern -- how agents combine into crews for complex tasks

The Problem #

Marcus is a data scientist at a mid-size e-commerce company. He has been asked to build a churn prediction model. He knows how to train an XGBoost classifier. He knows how to compute SHAP values. What he does not know is where the customer data lives, whether it has PII that needs masking, what the business definition of "churn" actually is, how to build the feature pipeline, how to validate data quality, how to deploy the model to production, or how to monitor it for drift after launch.

Marcus is not incompetent. He is specialized. And that is the fundamental problem with modern data organizations: the work requires 7-10 distinct specializations, but most teams have 3-4 people who each cover 2-3 areas with varying depth. The gaps between specializations are where projects fail.

The Intelligent Data Organization does not replace Marcus. It gives him 13 colleagues who are always available, never context-switch to other projects, and communicate through structured artifacts rather than Slack messages that disappear into the scroll.

Here are those 13 colleagues, plus the orchestrator who coordinates them all.

Agent Profile Cards #

Each agent below is presented as a profile card covering its essential characteristics. These are not theoretical descriptions -- they correspond directly to agent type declarations in the Neam language and their C++ runtime implementations.

Agent 1: Data Agent #

Data Agent — Layer 1 | Source & Schema Manager

Role: Source & Schema Manager | Neam type: data agent { ... }
Personality: Meticulous, contract-oriented, defensive
Authority: Controls all data ingestion points

Key Capabilities:

Declare typed schema contracts with version tracking
Register sources (PostgreSQL, S3, Kafka, REST APIs)
Configure sinks with write modes and batching
Define quality gates (freshness, completeness, uniqueness)
Route computation to appropriate engines (Spark, DuckDB)
Track lineage from source to destination

Activated When: New data source onboarding · Schema contract definition or update · Quality gate configuration

Produces: Schema contracts, source registrations, quality gate configs, lineage metadata
Consumes: Infrastructure profiles, governance policies
Traits: DataProducer, QualityGatekeeper

Agent 2: ETL Agent #

ETL Agent — Layer 1 | SQL-First Warehouse Builder

Role: SQL-First Warehouse Builder | Neam type: etl agent { ... }
Personality: Methodical, SQL-native, transformation-focused
Authority: Controls warehouse schema and data loading

Key Capabilities:

Dimensional modeling (Kimball star, Inmon, Data Vault)
SQL-first transformations with multi-dialect transpilation
SCD Type 1/2/3 handling for slowly changing dimensions
Feature engineering from warehouse tables
Semantic layer definitions
Self-healing pipeline recovery
Automatic lineage tracking at column level

Activated When: Staging → warehouse transformation · Feature table engineering · Schema change pipeline updates

Produces: Dimension tables, fact tables, feature tables, aggregate tables, SQL transformation scripts
Consumes: Schema contracts (Data Agent), BRDs (Data-BA), governance policies, infrastructure profiles
Traits: DataProducer, DataConsumer

Agent 3: Migration Agent #

Migration Agent — Layer 1 | Zero-Downtime Platform Mover

Role: Zero-Downtime Platform Mover | Neam type: migration agent { ... }
Personality: Cautious, methodical, rollback-ready
Authority: Controls platform migration execution

Key Capabilities:

Wave planning (prioritize tables by dependency graph)
Schema translation across platforms (Oracle → Snowflake)
Data type mapping with precision preservation
Reconciliation (row counts, checksums, sampling)
Cutover strategies (big-bang, trickle, dual-write)
Rollback plans for every migration wave

Activated When: Moving between data platforms · Legacy decommissioning · Cloud migration (on-prem to cloud)

Produces: Migration plans, schema translation scripts, reconciliation reports, cutover runbooks
Consumes: Source schemas, target infrastructure profiles, governance constraints
Traits: DataProducer, DataConsumer

Agent 4: DataOps Agent #

DataOps Agent — Layer 2 | SRE for Data

Role: SRE for Data | Neam type: dataops agent { ... }
Personality: Vigilant, proactive, escalation-aware
Authority: Can restart, retry, and skip pipeline stages

Key Capabilities:

Pipeline monitoring (latency, volume, error rates)
Anomaly detection (statistical + ML-based)
Cross-source correlation (identify cascading failures)
Automated triage and root cause classification
Guardrailed auto-heal (restart, retry, skip with limits)
SLA tracking and reporting
FinOps: cost tracking per pipeline, per agent

Activated When: Pipeline metrics exceed thresholds · SLA breach imminent or occurred · Cost anomaly in compute · DIO requests health status

Produces: Health reports, anomaly alerts, triage reports, SLA dashboards, cost breakdowns
Consumes: Pipeline metrics, infrastructure state, event bus alerts
Traits: QualityGatekeeper

Agent 5: Governance Agent #

Governance Agent — Layer 2 | Compliance Officer

Role: Compliance Officer | Neam type: governance agent { ... }
Personality: Strict, policy-driven, audit-minded
Authority: Can block data flows that violate policies

Key Capabilities:

Data classification (PII, PHI, financial, public)
Access policy enforcement (RBAC, ABAC)
Column-level lineage tracking
Regulatory compliance validation (GDPR, CCPA, DORA)
Audit trail generation (every access logged)
Data quality scoring per domain
External tool connectors (Collibra, Atlas, Purview)

Activated When: New source contains PII/PHI · Regulatory audit requested · Access policy creation or validation · Cross-border data movement

Produces: Classification reports, access policies, lineage graphs, audit trails, compliance certificates
Consumes: Schema metadata, data catalogs, regulatory requirements, infrastructure profiles
Traits: QualityGatekeeper

Agent 6: Modeling Agent #

Modeling Agent — Layer 2 | Data Architect

Role: Data Architect | Neam type: modeling agent { ... }
Personality: Analytical, pattern-seeking, standards-aware
Authority: Proposes schema changes (requires approval)

Key Capabilities:

Schema reverse-engineering from existing databases
ER model generation and visualization
Normalization analysis (1NF through BCNF)
Dimensional design (star schema, snowflake, vault)
Schema amendment proposals with impact analysis
Cross-schema dependency mapping

Activated When: New database needs architectural analysis · Schema change impact assessment · Normalization/denormalization consideration · Data model documentation

Produces: ER diagrams, normalization reports, dimensional designs, amendment proposals, dependency maps
Consumes: Schema metadata, data catalogs, existing models, business requirements
Traits: DataConsumer

Agent 7: Analyst Agent #

Analyst Agent — Layer 2 | NL-to-SQL Query Engine

Role: NL-to-SQL Query Engine | Neam type: analyst agent { ... }
Personality: Responsive, dialect-aware, insight-oriented
Authority: Read-only access through governed channels

Key Capabilities:

Natural language to SQL translation
9 SQL dialects (Postgres, Snowflake, BigQuery, Databricks SQL, Redshift, Oracle, Teradata, Trino, DuckDB)
Platform-specific query optimization
Governed execution (respects access policies)
Multi-format output (Excel, PDF, HTML, Slack, JSON)
Insight discovery and anomaly highlighting

Activated When: Business user needs ad-hoc analysis · DIO needs data exploration · Causal Agent needs observational data

Produces: Query results, formatted reports, data summaries, insight annotations
Consumes: Schema metadata, SQL connections, governance policies, natural language queries
Traits: DataConsumer

Agent 8: Data-BA Agent #

Data-BA Agent — Layer 3 | Requirements Intelligence Analyst

Role: Requirements Intelligence Analyst | Neam type: databa agent { ... }
Personality: Inquisitive, structured, traceability-obsessed
Authority: Defines what should be built (not how)

Key Capabilities:

LLM-assisted requirements elicitation
Business Requirements Document (BRD) generation
Given/When/Then acceptance criteria formulation
Traceability matrix (requirement → implementation → test)
Impact analysis for requirement changes
BABOK v3 aligned elicitation techniques
Stakeholder communication templates

Activated When: New data project initiated · Business requirements need formalization · Requirement change impact assessment · Acceptance criteria definition

Produces: BRDs, acceptance criteria, traceability matrices, impact analysis reports, stakeholder summaries
Consumes: Business context, Agent.MD domain knowledge, existing project documentation
Traits: DataProducer

Insight

- The Data-BA Agent operates at "day minus one" of the data lifecycle. Before any pipeline is built, any model is trained, or any query is written, the Data-BA produces the BRD that defines what success looks like. This is the single most impactful architectural decision in the system: requirements before engineering.

Agent 9: DataScientist Agent #

DataScientist Agent — Layer 3 | ML/AI Modeler

Role: ML/AI Modeler | Neam type: datascientist agent { ... }
Personality: Experimental, hypothesis-driven, explainability-focused
Authority: Trains models, registers in MLflow

Key Capabilities:

Problem framing (classification, regression, clustering)
Hypothesis testing with statistical rigor
EDA-driven technique selection
Volume-aware compute routing: <100K rows → scikit-learn · 100K–10M → XGBoost/LightGBM · >10M → Spark ML
Feature engineering and selection
ML, DL, and NLP pipeline construction
SHAP-based explainability for every model

Activated When: Prediction task defined in BRD · Feature table ready · Model retraining triggered by MLOps

Produces: Trained models (MLflow), SHAP values, EDA reports, feature importance rankings
Consumes: Feature tables, BRD acceptance criteria, compute profiles, Agent.MD preferences
Traits: DataConsumer, DataProducer

Agent 10: Causal Agent #

Causal Agent — Layer 3 | Causal Reasoning Engine

Role: Causal Reasoning Engine | Neam type: causal agent { ... }
Personality: Skeptical, rigorous, "correlation is not causation" embodied
Authority: Validates causal claims, proposes interventions

Key Capabilities:

Pearl's Ladder of Causation:
- Rung 1: Association (observational, P(Y|X))
- Rung 2: Intervention (do-calculus, P(Y|do(X)))
- Rung 3: Counterfactual (P(Y_x|X',Y'))
Structural Causal Model (SCM) construction
Bayesian inference via PyMC
Causal discovery via DoWhy
Average Treatment Effect (ATE) estimation
Counterfactual scenario generation

Activated When: Root cause analysis needed · Intervention impact estimation · "What if" scenarios · Revenue anomaly causal explanation

Produces: Causal DAGs, ATE estimates, counterfactual reports, intervention recommendations, SCM specifications
Consumes: Feature tables, model outputs, SHAP values, domain knowledge from Agent.MD
Traits: CausalReasoner, DataConsumer

Anti-Pattern

- Using SHAP values as causal evidence. SHAP tells you which features were important to the model's prediction. It does not tell you which features cause the outcome. A model might weight days_since_last_order highly for churn prediction, but the Causal Agent reveals that support_ticket_resolution_time is the actual causal driver. Reducing resolution time causes reduced churn. Reducing days since last order is just chasing a symptom.

Agent 11: DataTest Agent #

DataTest Agent — Layer 3 | Independent Quality Validator

Role: Independent Quality Validator | Neam type: datatest agent { ... }
Personality: Skeptical, adversarial, never rubber-stamps
Authority: Can block deployments via quality gate failures

Key Capabilities:

Test generation from BRD acceptance criteria
ETL validation (row counts, schema, referential integrity)
Data warehouse consistency checks
ML model validation (AUC, precision, recall thresholds)
API endpoint testing
Quality gates: blocking (must pass) vs advisory (warning)
Test coverage reporting with traceability to requirements

Activated When: Artifact validation needed · Quality gate checkpoint reached · Deployment approval requested

Produces: Test reports, quality gate verdicts (PASS/FAIL), coverage metrics, defect lists
Consumes: BRD acceptance criteria, feature tables, trained models, API endpoints, ETL outputs
Traits: QualityGatekeeper

Insight

- The DataTest Agent is architecturally separated from all builder agents. The agent that trains the model cannot be the agent that validates it. This is not a code organization choice; it is a trust boundary. In the Neam runtime, the DataTest Agent has read-only access to artifacts produced by other agents. It cannot modify them, only judge them.

Agent 12: MLOps Agent #

MLOps Agent — Layer 3 | Production ML Guardian

Role: Production ML Guardian | Neam type: mlops agent { ... }
Personality: Operationally cautious, metrics-obsessed
Authority: Controls model deployment and rollback

Key Capabilities:

6 types of drift detection:
- Data drift (feature distribution shift)
- Concept drift (target relationship change)
- Prediction drift (output distribution change)
- Feature drift (individual feature shifts)
- Label drift (ground truth distribution change)
- Upstream drift (source data pattern change)
Deployment strategies (canary, shadow, blue-green, A/B)
Champion-challenger model management
Automated retraining triggers
Serving infrastructure (Flask, FastAPI, SageMaker)

Activated When: Model passes quality gates · Drift thresholds exceeded · Champion underperforming challenger · Scheduled retraining window

Produces: Deployment configs, drift reports, model serving endpoints, retraining triggers, A/B test results
Consumes: Validated models, quality gate results, prediction logs, monitoring metrics
Traits: QualityGatekeeper, DataConsumer

Agent 13: DIO (Data Intelligent Orchestrator) #

DIO (Data Intelligent Orchestrator) — Layer 4 | Multi-Agent Coordinator

Role: Multi-Agent Coordinator | Neam type: dio agent { ... }
Personality: Strategic, delegation-focused, accountability-driven
Authority: Can activate any agent, assign RACI roles, allocate budgets

Key Capabilities:

Task understanding (intent classification + decomposition)
Crew formation (scored selection of agent subsets)
Pattern selection (8 auto-patterns for common workflows)
RACI delegation (Responsible, Accountable, Consulted, Informed for every sub-task)
Execution management (sequential, parallel, conditional)
State machine with checkpoint/rewind
Error recovery (retry → fallback → escalation)
Result synthesis (combine all agent outputs)

Activated When: Any data task is submitted · Always active as the entry point for all orchestrated work

Produces: Execution plans, RACI matrices, crew assignments, synthesized results, activity logs
Consumes: Task descriptions, Agent.MD domain knowledge, infrastructure profiles, agent status reports
Traits: (Orchestrator — unique role, no data traits)

The DIO is covered in depth in Chapter 6.

Agent 14: Deploy Agent #

Deploy Agent — Cross-Layer | Infrastructure Deployment Manager

Role: Infrastructure Deployment Manager | Neam type: deploy { ... }
Personality: DevOps-native, infrastructure-as-code oriented
Authority: Provisions and tears down compute resources

Key Capabilities:

Container deployment (Docker, Kubernetes)
Serverless deployment (Lambda, Cloud Run, Azure Functions)
Infrastructure-as-Code generation (Terraform, CloudFormation)
Multi-cloud targeting from single Neam program
Health checks and readiness probes
Rolling updates with automatic rollback

Activated When: Model needs production serving infrastructure · Pipeline needs scheduled compute · Infrastructure changes for scaling

Produces: Deployment manifests, Terraform plans, container configs, health check endpoints
Consumes: Infrastructure profiles, model artifacts, deployment strategies from MLOps Agent
Traits: DataProducer

The Agent Capability Matrix #

With all 14 agents introduced, here is how their capabilities map across the data lifecycle:

Agent	Requirements	Ingest Data	Transform	Model/Train	Governance	Monitor/Test	Deploy/Serve
Data Agent		Primary	Supporting
ETL Agent		Supporting	Primary
Migration Agent		Primary	Primary
DataOps Agent					Supporting	Primary
Governance Agent					Primary
Modeling Agent			Supporting
Analyst Agent		Supporting
Data-BA Agent	Primary
DataScientist Agent			Supporting	Primary
Causal Agent				Primary
DataTest Agent						Primary
MLOps Agent						Primary	Primary
Deploy Agent							Primary
DIO	Supporting	Supporting	Supporting	Supporting	Supporting	Supporting	Supporting

The Trait-Based Capability System #

Agents are not categorized by arbitrary labels. They implement traits -- composable capability markers that define what an agent can do in the system. Four traits form the foundation:

DataProducer #

Agents with the DataProducer trait create new data artifacts: tables, files, models, reports. The Data Agent produces schema registrations. The ETL Agent produces dimension and fact tables. The DataScientist Agent produces trained models. The Data-BA Agent produces BRDs.

DataConsumer #

Agents with the DataConsumer trait read data artifacts produced by others. The ETL Agent consumes schema contracts from the Data Agent. The DataScientist Agent consumes feature tables from the ETL Agent. The Causal Agent consumes model outputs and SHAP values from the DataScientist Agent.

CausalReasoner #

Only the Causal Agent holds this trait. It marks the ability to perform causal inference -- constructing SCMs, applying do-calculus, generating counterfactuals. This is not just another form of data analysis; it operates on a fundamentally different rung of Pearl's Ladder of Causation.

QualityGatekeeper #

Agents with the QualityGatekeeper trait can block downstream progress. The Data Agent blocks ingestion if quality gates fail. The Governance Agent blocks data flows that violate compliance policies. The DataTest Agent blocks deployment if tests fail. The DataOps Agent blocks operations if SLA breaches are detected. The MLOps Agent blocks serving if drift exceeds thresholds.

Agent	DataProducer	DataConsumer	CausalReasoner	QualityGatekeeper
Data Agent	X			X
ETL Agent	X	X
Migration Agent	X	X
DataOps Agent				X
Governance Agent				X
Modeling Agent		X
Analyst Agent		X
Data-BA Agent	X
DataScientist	X	X
Causal Agent		X	X
DataTest Agent				X
MLOps Agent		X		X
Deploy Agent	X
DIO	Orchestrator — coordinates all traits

Insight

- Traits are not mutually exclusive. The DataScientist Agent is both a DataProducer (it creates models) and a DataConsumer (it reads feature tables). The MLOps Agent is both a DataConsumer (it reads model outputs) and a QualityGatekeeper (it blocks deployment on drift). This composability is what makes agents flexible enough to participate in different crew configurations.

The Composable Agent Pattern #

Not every task needs all 14 agents. The DIO selects a subset -- a "crew" -- based on the task requirements. Here are common crew compositions:

Task	Crew	Size
Churn Prediction (Full Lifecycle)	Data-BA, ETL, DataScientist, Causal, DataTest, MLOps, Governance	7 + DIO
Ad-Hoc Business Analysis	Analyst, Governance	2 + DIO
Data Quality Audit	DataOps, DataTest, Governance	3 + DIO
Platform Migration	Migration, Data Agent, Modeling, DataTest, Governance	5 + DIO
Revenue Anomaly Investigation	Analyst, Causal, DataScientist	3 + DIO
GDPR Compliance Audit	Governance, DataTest, Data-BA	3 + DIO
Pipeline Failure Investigation	DataOps, Causal, ETL	3 + DIO
Model Retraining	DataScientist, DataTest, MLOps	3 + DIO

The crew formation algorithm scores each agent on four dimensions: capability match (40%), cost efficiency (20%), infrastructure compatibility (20%), and historical performance (20%). Chapter 6 details this scoring system.

Try It

- Using the DataSims environment, run the GDPR compliance audit (Problem Statement 5). Notice how the DIO forms a crew of only 3 agents, skipping the DataScientist, MLOps, and ETL agents entirely. The crew is task-appropriate, not task-maximal.

Agent Interaction Patterns #

Agents do not interact freely. The DIO mediates all interactions through three patterns:

Pattern 1: Sequential Handoff

Used when output of one agent is input to the next (e.g., Requirements → Engineering → Modeling).

flowchart LR
  BA["Data-BA"] -->|"BRD"| ETL["ETL Agent"]
  ETL -->|"features"| DS["DataScientist"]
  DS -->|"model"| OUT["Output"]

Pattern 2: Parallel Execution

Used when tasks are independent and can run concurrently (e.g., Predictive + Causal + Descriptive analysis).

flowchart LR
  DIO["DIO"] --> DS["DataScientist"]
  DIO --> CA["Causal Agent"]
  DIO --> AN["Analyst Agent"]
  DS --> M["model"]
  CA --> D["DAG"]
  AN --> R["report"]

Pattern 3: Gate-Blocked Progression

Used when a quality gate must pass before the next stage (e.g., Model validation before deployment).

flowchart LR
  DS["DataScientist"] --> DT["DataTest"]
  DT -->|"PASS"| ML["MLOps"]
  DT -->|"FAIL"| DS

Industry Perspective #

The 14-agent taxonomy maps to real organizational roles in data teams. A typical enterprise data organization has these roles, often filled by the same person wearing multiple hats:

Agent	Traditional Role	Typical Headcount
Data-BA	Business Analyst	1-2 per project
Data Agent + ETL	Data Engineer	2-4 per team
Migration	Platform Engineer	0-1 (project-based)
DataOps	DataOps / SRE	1-2 per organization
Governance	Data Steward / DPO	1-2 per organization
Modeling	Data Architect	0-1 per organization
Analyst	Data Analyst	2-5 per team
DataScientist	Data Scientist	1-3 per team
Causal	(rarely exists)	0 in most orgs
DataTest	QA Engineer	0-1 per team
MLOps	ML Engineer	1-2 per team
Deploy	DevOps	1-2 shared
DIO	Project Manager	1 per project

Total headcount for a full-lifecycle data project: 12-25 people across an organization. The Intelligent Data Organization does not eliminate these roles -- it augments them. A team of 3-4 people can leverage 14 agents to cover the entire lifecycle without gaps.

The Causal Agent fills a notable gap. In the table above, most organizations have zero people dedicated to causal reasoning. Correlational analysis is the default. The Causal Agent brings Pearl's framework to every project, whether or not the organization has a causal inference specialist.

The Evidence #

DataSims ablation experiments systematically removed individual agents to measure their impact:

Ablation	Agent(s) Removed	Impact on Full System
A1	DIO (orchestrator)	Task completion: 100% -> 45%
A2	Data-BA (requirements)	Traceability: 95% -> 22%
A3	DataScientist (modeling)	No prediction capability
A4	Causal (why analysis)	Root cause: "support_quality" -> "unknown"
A5	DataTest (validation)	Test coverage: 94% -> 0%, silent failures
A6	Agent.MD (domain knowledge)	AUC: 0.847 -> 0.782 (-7.7%, p<0.01)
A7	MLOps (production ops)	No drift detection, no deployment
A8	RACI (accountability)	Traceability loss: 80%

Every agent matters. Removing any single agent degrades the system measurably. This is the empirical foundation for the 14-agent architecture -- not theoretical elegance, but measured necessity.

Key Takeaways #

The 14 agents cover the complete data lifecycle from requirements to production monitoring, with no gaps in coverage.
Each agent has a defined role, authority level, and set of capabilities -- they are specialists, not generalists.
Four traits (DataProducer, DataConsumer, CausalReasoner, QualityGatekeeper) provide a composable capability system that the DIO uses for crew formation.
Not every task needs all 14 agents. The composable agent pattern allows the DIO to form right-sized crews of 2-7 agents based on task requirements.
The Causal Agent fills a critical gap that exists in most data organizations: the ability to answer "why" rather than just "what."
DataSims ablation experiments confirm that removing any single agent degrades system performance measurably.

For Further Exploration #

Neam Agent Declarations -- Syntax reference for all agent types
DataSims Repository -- See the agents in action on the SimShop platform
Wooldridge, M. (2009). An Introduction to MultiAgent Systems -- Academic foundation for multi-agent coordination