Chapter 16 -- The Data-BA Agent: Requirements Intelligence #
"If I had an hour to solve a problem, I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions." -- Albert Einstein
25 min read | Raj (BA), David (VP), Priya (DE), Marcus (DS) | Part V: Analytical Intelligence
What you'll learn:
- Why the Data-BA Agent is always the first agent invoked -- "Day -1" of every project
- How LLM-assisted requirements elicitation eliminates the telephone game between stakeholders
- BRD generation that produces machine-readable specifications, not shelf-ware documents
- Given/When/Then acceptance criteria that downstream agents (DataTest, DataScientist) consume directly
- Source-to-target data mapping and traceability matrices
- Impact analysis aligned with BABOK v3 knowledge areas
- DataSims proof: removing Data-BA drops documentation score to 0.12 (88% decrease)
The Problem: Requirements That Nobody Reads #
Raj has been a business analyst for twelve years. He has written hundreds of Business Requirements Documents. He estimates that fewer than 20% of them were ever read in full by the engineering teams that implemented them.
This is not because Raj writes poor documents. It is because the traditional BA workflow is fundamentally disconnected from the implementation workflow. Raj writes a 40-page BRD in Confluence. Engineers skim the executive summary, open a Jira epic, and start building what they think was requested. Six weeks later, the UAT session reveals that the model predicts customer lifetime value when the business actually asked for churn propensity. The requirements were clear -- they just never made it into the code.
The gap is not a people problem. It is a format problem. Requirements written in natural language for human consumption cannot be directly consumed by automated systems. Every translation step -- BRD to Jira ticket, Jira ticket to data model, data model to pipeline code -- is a lossy compression of intent.
The Data-BA Agent eliminates the gap by producing structured, machine-readable specifications that downstream agents consume directly. No translation. No interpretation. No loss.
Why "Day -1" Matters #
Every project has a Day 0 -- the day engineering begins. Most projects also have a Day 2 -- when operations take over. But the most important day is Day -1: the day before building starts, when someone decides what to build and why.
flowchart LR A["Day -1
Data-BA Agent
WHAT & WHY"] -->|"specs"| B["Day 0
DS / ETL /
Causal / Data
HOW"] B -->|"models"| C["Day 1
DataTest Agent
VALIDATE"] C -->|"gate"| D["Day 2+
MLOps / DataOps
KEEP HEALTHY"] D -->|"production metrics,
drift alerts,
change requests"| A
Skip Day -1 and you build faster but build the wrong thing. The Data-BA Agent ensures that Day -1 is not skipped, not abbreviated, and not performed in an unstructured ad hoc conversation over Slack.
Key Insight: The Data-BA Agent is the only agent in the Neam ecosystem that is always invoked first. The DIO orchestrator will not proceed to Day 0 phases without validated requirements output from the Data-BA Agent.
LLM-Assisted Requirements Elicitation #
Traditional elicitation methods -- interviews, workshops, document analysis -- are effective but slow. The Data-BA Agent augments them with LLM-assisted interview generation, gap detection, and conflict identification.
// ═══ Requirements Elicitation ═══
requirements_elicitation ChurnElicitation {
stakeholders: {
primary: [
{ name: "VP Customer Success", role: "business_sponsor",
domain: "retention" },
{ name: "Data Engineering Lead", role: "technical_lead",
domain: "data_pipelines" }
],
secondary: ["Product Manager", "Compliance Officer"]
},
methods: {
interviews: {
model: "gpt-4o",
question_generation: "auto",
gap_detection: true
},
document_analysis: {
sources: [
"./docs/business_case.pdf",
"./docs/current_process.md"
]
},
existing_system_analysis: {
agents: ["CustomerAnalyst", "ChurnDS"]
}
},
output: {
format: "structured_requirements",
include_gaps: true,
review_required: true
}
}
The gap_detection: true flag tells the LLM to identify questions that should have been asked but were not. In our SimShop experiment, the elicitation phase identified 3 gaps that the original problem statement omitted: GDPR constraints on PII columns, SLA requirements for the scoring API, and the need for per-customer explanations (SHAP).
BRD Generation #
The brd_generator declaration produces a structured Business Requirements Document that is simultaneously human-readable and machine-parseable.
brd_generator ChurnBRD {
project: {
name: "Enterprise Customer Churn Prediction",
id: "PROJ-2026-042"
},
objectives: {
primary: "Reduce enterprise churn from 14% to 8% within 6 months",
secondary: [
"Identify top 5 churn drivers per customer",
"Enable proactive retention outreach"
],
kpis: ["churn_rate", "net_revenue_retention", "intervention_roi"]
},
scope: {
in_scope: [
"Enterprise customers (>1000 employees)",
"Binary churn prediction model (AUC > 0.80)",
"Weekly batch + real-time scoring API (<200ms p99)",
"CRM integration for score delivery"
],
out_of_scope: [
"SMB customers", "Self-service portal UI"
]
},
constraints: [
"GDPR/SOC2 compliance required",
"Monthly compute budget: $500",
"8-week MVP timeline"
],
risks: [
{ risk: "Data quality in legacy CRM",
impact: "high",
mitigation: "Data profiling + imputation pipeline" },
{ risk: "Model accuracy below target",
impact: "medium",
mitigation: "AutoML + champion-challenger evaluation" }
],
output: {
format: "markdown",
sections: ["executive_summary", "objectives", "scope",
"benefits", "constraints", "assumptions", "risks"]
}
}
Critical: The BRD is not a final deliverable that gathers dust. It is a living input consumed by the DataScientist agent (for problem framing), the DataTest agent (for acceptance criteria), and the DIO orchestrator (for phase planning). When the BRD changes, downstream artifacts automatically regenerate.
Given/When/Then Acceptance Criteria #
The acceptance_criteria declaration generates testable conditions in Gherkin-style Given/When/Then format. These criteria become the input to the DataTest Agent (Chapter 19).
acceptance_criteria ChurnAcceptance {
source: ChurnBRD,
criteria: [
{
id: "AC-001",
category: "model_performance",
given: "A holdout test set of 20,000 customers",
when: "The churn model generates predictions",
then: "AUC-ROC >= 0.80 AND false_positive_rate < 0.30"
},
{
id: "AC-002",
category: "api_performance",
given: "The scoring API is deployed to production",
when: "100 concurrent requests are submitted",
then: "p99 latency < 200ms AND availability >= 99.9%"
},
{
id: "AC-003",
category: "explainability",
given: "Any individual customer prediction",
when: "SHAP values are computed",
then: "Top 5 contributing features are returned with direction and magnitude"
},
{
id: "AC-004",
category: "compliance",
given: "The feature pipeline is executed",
when: "Features are generated from raw data",
then: "No PII columns (email, phone, DOB, name) appear as direct features"
},
{
id: "AC-005",
category: "traceability",
given: "Any model prediction in production",
when: "An auditor requests lineage",
then: "Full trace from business requirement to test result is available"
}
]
}
The 12 acceptance criteria generated in the full DataSims experiment covered: model performance (3), API performance (2), explainability (2), compliance (2), data quality (2), and monitoring (1).
Source-to-Target Data Mapping #
The data_requirements declaration maps business concepts to physical data sources. This is the bridge between "we need churn prediction" and "which tables, which columns, which joins."
data_requirements ChurnDataMap {
source: ChurnBRD,
mappings: [
{
business_concept: "customer_tenure",
source_table: "simshop_oltp.customers",
source_column: "created_at",
transformation: "DATEDIFF(day, created_at, CURRENT_DATE)",
target_table: "ml_features.customer_360",
target_column: "tenure_days",
data_type: "integer"
},
{
business_concept: "recent_support_load",
source_table: "simshop_oltp.support_tickets",
source_column: "created_at, status",
transformation: "COUNT(*) WHERE created_at > NOW() - INTERVAL '30 days'",
target_table: "ml_features.customer_360",
target_column: "support_tickets_30d",
data_type: "integer"
},
{
business_concept: "purchase_recency",
source_table: "simshop_oltp.orders",
source_column: "order_date",
transformation: "DATEDIFF(day, MAX(order_date), CURRENT_DATE)",
target_table: "ml_features.customer_360",
target_column: "days_since_last_order",
data_type: "integer"
}
],
data_sources_identified: 4,
schemas: ["simshop_oltp", "simshop_staging", "simshop_dw", "ml_features"]
}
BABOK v3 Alignment: Source-to-target mapping corresponds to BABOK Knowledge Area 5 (Requirements Analysis and Design Definition), Technique 10.42 (Data Modeling). The Data-BA Agent automates what traditionally requires weeks of analyst effort and multiple review cycles.
Traceability Matrix #
The traceability matrix is the spine of accountability. It connects every business need to its implementation and its test.
| Business Need | Requirement | Implementation | Test |
|---|---|---|---|
| Predict churn (90-day window) | REQ-001 | ChurnDS.ml_experiment XGBoost classifier |
AC-001 (AUC) AC-002 (API) |
| Explain drivers (per customer) | REQ-002 | SHAP explainer feature importance |
AC-003 (top 5) |
| GDPR compliance | REQ-003 | PII exclusion filter governance policy |
AC-004 (no PII) |
| Production monitoring (drift detection) | REQ-004 | drift_monitor config hourly checks |
AC-005 (trace) |
traceability_matrix ChurnTraceability {
source_brd: ChurnBRD,
entries: [
{
business_need: "Predict customer churn within 90-day window",
requirement_id: "REQ-001",
implementation: "ChurnDS.ml_experiment(XGBoost)",
test_ids: ["AC-001", "AC-002"],
status: "implemented"
},
{
business_need: "Explain churn drivers per customer",
requirement_id: "REQ-002",
implementation: "ChurnDS.explainability(SHAP)",
test_ids: ["AC-003"],
status: "implemented"
},
{
business_need: "GDPR compliance for feature pipeline",
requirement_id: "REQ-003",
implementation: "GovernanceAgent.pii_filter",
test_ids: ["AC-004"],
status: "implemented"
},
{
business_need: "Production drift monitoring",
requirement_id: "REQ-004",
implementation: "ChurnMLOps.drift_monitor",
test_ids: ["AC-005"],
status: "implemented"
}
]
}
Impact Analysis #
When requirements change -- and they always change -- the Data-BA Agent performs upstream and downstream impact analysis to identify every artifact affected.
impact_analysis ChurnImpact {
change: "Add real-time scoring (was batch-only)",
upstream_impacts: [
{ artifact: "ETL pipeline", impact: "Must add streaming ingestion",
severity: "high" },
{ artifact: "Feature store", impact: "Must support online features",
severity: "high" }
],
downstream_impacts: [
{ artifact: "API deployment", impact: "New serving infrastructure",
severity: "high" },
{ artifact: "Monitoring", impact: "Latency SLA tracking added",
severity: "medium" },
{ artifact: "Test suite", impact: "Add load testing scenarios",
severity: "medium" }
],
affected_agents: ["ETLAgent", "DataScientist", "MLOps", "DataTest"],
estimated_effort: "2 additional sprints"
}
BABOK v3 Knowledge Areas Covered: The Data-BA Agent maps to 5 of 6 BABOK knowledge areas: Business Analysis Planning (KA2), Elicitation and Collaboration (KA3), Requirements Life Cycle Management (KA4), Requirements Analysis and Design Definition (KA5), and Solution Evaluation (KA6). Only Strategy Analysis (KA1) remains a human-only activity.
The Complete Data-BA Agent Declaration #
// ═══ BUDGET ═══
budget BABudget { cost: 50.00, tokens: 500000 }
// ═══ DATA-BA AGENT ═══
databa agent ChurnBA {
provider: "openai",
model: "gpt-4o",
temperature: 0.3,
agent_md: "./agents/simshop_ba.agent.md",
budget: BABudget
}
// ═══ INVOKE ═══
let requirements = ba_elicit(ChurnBA, ChurnElicitation)
let brd = ba_generate_brd(ChurnBA, ChurnBRD)
let criteria = ba_generate_criteria(ChurnBA, ChurnAcceptance)
let mapping = ba_map_sources(ChurnBA, ChurnDataMap)
let matrix = ba_trace(ChurnBA, ChurnTraceability)
// Pass structured output to downstream agents
print(requirements)
print(brd)
print(criteria)
Industry Perspective #
The role of business analyst has been under pressure for a decade. Agile methodologies reduced the perceived need for detailed upfront requirements. "Move fast and break things" culture deprioritized specification work. The result, according to the Standish Group's CHAOS Report (2023), is that 31% of projects are canceled before completion and 52% exceed their budget by 189% -- with "incomplete requirements" cited as the number one cause of failure.
The Data-BA Agent does not replace the business analyst. It amplifies them. Raj still makes the judgment calls -- which stakeholders to involve, which constraints to prioritize, which risks to accept. But the Data-BA Agent handles the mechanical work: generating structured documents, detecting gaps, maintaining traceability, and producing output in a format that downstream agents can consume without human translation.
Organizations adopting BABOK v3 practices report 40% fewer rework cycles (IIBA, 2024). The Data-BA Agent encodes those practices as executable specifications.
Evidence: DataSims Experimental Proof #
Experiment: Ablation A1 -- System Without Data-BA Agent #
Setup: The full SimShop churn prediction workflow was run 5 times with the Data-BA Agent disabled (ablation no_data_ba). All other agents remained active.
Results:
| Metric | Full System | Without Data-BA | Delta |
|---|---|---|---|
| BRD Generated | true | false | Lost |
| Acceptance Criteria | 12 | 0 | -100% |
| Documentation Score | 1.00 | 0.12 | -88% |
| Model AUC | 0.847 | 0.847 | No change |
| Test Coverage | 0.94 | 0.94 | No change |
| Quality Gate | passed | passed | No change |
Analysis: The model still trains successfully because the DataScientist agent can operate autonomously on the problem statement. But the organizational outputs collapse:
- Zero acceptance criteria means the DataTest Agent has no formal specification to test against. Tests still run (because they are pre-configured), but there is no traceability from business need to test result.
- No BRD means no formal scope definition, no risk register, no constraint documentation. In a regulated environment, this is an audit failure.
- Documentation score drops to 0.12 (88% decrease), indicating that downstream agents receive almost no structured context about why the work is being done.
flowchart LR
subgraph WITHOUT["WITHOUT DATA-BA"]
direction LR
A1["Business Need"] -.->|"no spec
no mapping"| B1["Implementation"]
B1 -.->|"no criteria
no traceability"| C1["Test"]
end
subgraph WITH["WITH DATA-BA"]
direction LR
A2["Business Need"] --> B2["Requirement"]
B2 --> C2["Implementation"]
C2 --> D2["Test"]
A2 --- TM["Traceability Matrix"]
B2 --- TM
C2 --- TM
D2 --- TM
end
Key Finding: The Data-BA Agent does not improve model accuracy. It improves organizational readiness. Without it, the system produces a good model that nobody can audit, nobody can trace, and nobody can formally validate against business intent.
Reproducibility: 5/5 runs succeeded. Results are deterministic. Full data available at github.com/neam-lang/Data-Sims in evaluation/results/ablation_no_ba.json.
Key Takeaways #
- The Data-BA Agent operates at "Day -1" -- before any engineering begins. It is always the first agent the DIO orchestrator invokes
- Requirements elicitation uses LLM-assisted gap detection to identify questions that stakeholders did not think to ask
- BRD generation produces structured, machine-readable specifications that downstream agents consume directly -- no lossy human translation
- Given/When/Then acceptance criteria flow directly to the DataTest Agent for automated validation
- Source-to-target data mapping bridges the gap between business concepts and physical data sources
- Traceability matrices connect every business need to its implementation and test, enabling full auditability
- BABOK v3 alignment covers 5 of 6 knowledge areas, encoding industry best practices as executable specifications
- DataSims ablation A1 proves that removing the Data-BA Agent drops documentation score by 88% and eliminates all acceptance criteria, while model accuracy is unaffected -- confirming that the agent's value is organizational, not algorithmic