Chapter 19 -- The DataTest Agent: The Independent Critic #
"Quality is not an act, it is a habit." -- Aristotle
25 min read | Sarah (MLOps), Priya (DE), Marcus (DS), David (VP) | Part V: Analytical Intelligence
What you'll learn:
- Why the DataTest Agent is architecturally independent -- it never validates its own work
- How test generation derives directly from acceptance criteria (Data-BA output)
- ETL testing: data quality checks, row count reconciliation, referential integrity
- Data warehouse testing: dimension/fact validation, SCD correctness, grain verification
- ML testing: model metrics, fairness audits, explainability validation
- API testing: latency benchmarks, throughput validation, error handling
- Quality gates: BLOCKING vs advisory thresholds and their deployment implications
- DataSims proof: ablation A3 drops test count to 0 and quality gate to "skipped"; ablation A7 shows gate "bypassed"
The Problem: Quis Custodiet Ipsos Custodes? #
"Who watches the watchmen?" This ancient question is the central design challenge of any testing system. If the data scientist writes the model and the tests, who catches the test that was accidentally written to always pass? If the ETL engineer builds the pipeline and the validation checks, who notices that the check tests for row count > 0 when it should test for row count within 5% of the source?
In traditional data teams, the answer is "code review" -- a human-powered process that scales poorly and catches errors inconsistently. A 2023 study by Thoughtworks found that code review catches only 60% of data quality issues and only 25% of logic errors in test assertions.
The DataTest Agent solves this with a principle borrowed from financial auditing: the entity being audited never audits itself. The DataTest Agent is architecturally independent. It receives acceptance criteria from the Data-BA Agent, receives artifacts (pipelines, models, APIs) from the DataScientist, ETL, and MLOps agents, and generates tests that it -- and only it -- executes.
flowchart LR BA["Data-BA Agent\n(Day -1)"] -- "criteria" --> DT["DataTest Agent\n(Independent Critic)"] DS["DS / ETL /\nMLOps Agents\n(Day 0-2)"] -- "artifacts" --> DT DT -- "NEVER validates\nits own work" -.-x DS DT --> QG["Quality Gate\nPASS / FAIL"]
Key Insight: The DataTest Agent does not write the code it tests. It does not train the models it evaluates. It does not build the pipelines it validates. This separation of concerns is not a convenience -- it is a structural guarantee of objectivity.
Test Generation from Acceptance Criteria #
The DataTest Agent's primary input is the acceptance_criteria declaration produced by the Data-BA Agent (Chapter 16). Each acceptance criterion becomes one or more executable tests.
// Input: Acceptance criteria from Data-BA Agent
// Output: Executable test suite
datatest agent ChurnTester {
provider: "openai",
model: "gpt-4o",
budget: TestBudget
}
test_suite ChurnTests {
source_criteria: ChurnAcceptance, // from Data-BA Agent
test_categories: [
"etl_quality",
"dw_validation",
"ml_metrics",
"api_performance",
"compliance",
"explainability"
],
generation: {
method: "criteria_driven", // tests derive FROM criteria
coverage_target: 0.95,
edge_cases: true,
negative_tests: true // test what SHOULD fail
}
}
ETL Testing #
ETL tests validate that data moves correctly from source to target with expected transformations applied.
etl_tests PipelineValidation {
suite: ChurnTests,
tests: [
{
id: "ETL-001",
name: "Row count reconciliation",
type: "reconciliation",
source: { table: "simshop_oltp.customers",
query: "SELECT COUNT(*) FROM customers WHERE active = true" },
target: { table: "ml_features.customer_360",
query: "SELECT COUNT(*) FROM customer_360" },
assertion: "abs(source_count - target_count) / source_count < 0.02",
severity: "blocking"
},
{
id: "ETL-002",
name: "Null rate check",
type: "data_quality",
table: "ml_features.customer_360",
columns: ["tenure_days", "days_since_last_order",
"support_tickets_30d"],
assertion: "null_rate < 0.05",
severity: "blocking"
},
{
id: "ETL-003",
name: "Referential integrity",
type: "referential_integrity",
parent: { table: "simshop_oltp.customers", key: "customer_id" },
child: { table: "ml_features.customer_360", key: "customer_id" },
assertion: "orphan_count == 0",
severity: "blocking"
},
{
id: "ETL-004",
name: "Value range validation",
type: "data_quality",
table: "ml_features.customer_360",
checks: [
{ column: "tenure_days", assertion: "value >= 0" },
{ column: "support_tickets_30d", assertion: "value >= 0" },
{ column: "churn_probability",
assertion: "value >= 0.0 AND value <= 1.0" }
],
severity: "blocking"
},
{
id: "ETL-005",
name: "Freshness check",
type: "freshness",
table: "ml_features.customer_360",
column: "updated_at",
assertion: "max_age_hours < 24",
severity: "advisory"
}
]
}
Data Warehouse Testing #
DW tests validate structural correctness: dimension/fact relationships, SCD (Slowly Changing Dimension) behavior, and grain consistency.
dw_tests WarehouseValidation {
suite: ChurnTests,
tests: [
{
id: "DW-001",
name: "Dimension uniqueness",
type: "dimension_validation",
table: "simshop_dw.dim_customer",
primary_key: "customer_sk",
natural_key: "customer_id",
assertion: "no_duplicate_natural_keys_for_current_records",
severity: "blocking"
},
{
id: "DW-002",
name: "SCD Type 2 correctness",
type: "scd_validation",
table: "simshop_dw.dim_customer",
scd_type: 2,
effective_date: "effective_from",
expiry_date: "effective_to",
current_flag: "is_current",
assertions: [
"exactly_one_current_record_per_natural_key",
"no_gaps_in_date_ranges",
"no_overlapping_date_ranges"
],
severity: "blocking"
},
{
id: "DW-003",
name: "Fact table grain",
type: "grain_validation",
table: "simshop_dw.fact_customer_activity",
grain: ["customer_sk", "activity_date"],
assertion: "no_duplicate_grain",
severity: "blocking"
},
{
id: "DW-004",
name: "Fact-dimension join integrity",
type: "referential_integrity",
fact: "simshop_dw.fact_customer_activity",
dimension: "simshop_dw.dim_customer",
join_key: "customer_sk",
assertion: "all_fact_keys_exist_in_dimension",
severity: "blocking"
}
]
}
ML Testing #
ML tests go beyond accuracy metrics. They validate fairness, explainability, and robustness.
ml_tests ModelValidation {
suite: ChurnTests,
tests: [
{
id: "ML-001",
name: "AUC-ROC threshold",
type: "metric_threshold",
model: "ChurnModel",
metric: "auc_roc",
assertion: "value >= 0.80",
severity: "blocking",
traces_to: "AC-001" // links to acceptance criteria
},
{
id: "ML-002",
name: "F1 Score threshold",
type: "metric_threshold",
model: "ChurnModel",
metric: "f1",
assertion: "value >= 0.65",
severity: "blocking",
traces_to: "AC-001"
},
{
id: "ML-003",
name: "Fairness audit",
type: "fairness",
model: "ChurnModel",
protected_attributes: ["gender", "age_group"],
metric: "demographic_parity_ratio",
assertion: "value >= 0.80",
severity: "blocking",
traces_to: "AC-004"
},
{
id: "ML-004",
name: "SHAP explainability",
type: "explainability",
model: "ChurnModel",
method: "shap",
assertion: "top_k_features >= 5 AND all_values_computed",
severity: "blocking",
traces_to: "AC-003"
},
{
id: "ML-005",
name: "PII exclusion verification",
type: "compliance",
model: "ChurnModel",
forbidden_features: ["email", "phone", "date_of_birth",
"first_name", "last_name"],
assertion: "none_present_in_feature_set",
severity: "blocking",
traces_to: "AC-004"
},
{
id: "ML-006",
name: "Model stability",
type: "stability",
model: "ChurnModel",
method: "bootstrap",
n_iterations: 100,
assertion: "auc_std < 0.03",
severity: "advisory"
}
]
}
API Testing #
API tests validate the serving infrastructure before production deployment.
api_tests APIValidation {
suite: ChurnTests,
tests: [
{
id: "API-001",
name: "p99 latency",
type: "latency",
endpoint: "/v1/churn/predict",
method: "POST",
payload: { customer_id: "test_customer_001" },
concurrency: 100,
duration_seconds: 60,
assertion: "p99_ms < 200",
severity: "blocking",
traces_to: "AC-002"
},
{
id: "API-002",
name: "Throughput",
type: "throughput",
endpoint: "/v1/churn/predict",
assertion: "requests_per_second >= 500",
severity: "advisory"
},
{
id: "API-003",
name: "Error handling",
type: "error_handling",
scenarios: [
{ input: { customer_id: "nonexistent" },
expected_status: 404 },
{ input: { customer_id: null },
expected_status: 400 },
{ input: {},
expected_status: 400 }
],
severity: "blocking"
}
]
}
Quality Gates: BLOCKING vs Advisory #
The quality_gate declaration is the enforcement mechanism. It determines whether the pipeline proceeds to deployment or halts.
quality_gate ChurnGate {
suite: ChurnTests,
blocking_rules: {
// ALL blocking tests must pass for the gate to open
required_pass_rate: 1.0,
categories: ["etl_quality", "ml_metrics", "compliance"]
},
advisory_rules: {
// Advisory failures are logged but do not block
min_pass_rate: 0.80,
categories: ["api_performance", "stability"]
},
actions: {
on_pass: {
notify: ["data-team@company.com"],
proceed_to: "deployment",
log: "Quality gate PASSED — proceeding to canary deployment"
},
on_fail: {
notify: ["data-team@company.com", "oncall@company.com"],
block_deployment: true,
create_incident: true,
log: "Quality gate FAILED — deployment BLOCKED"
},
on_advisory_fail: {
notify: ["data-team@company.com"],
proceed_to: "deployment",
log: "Quality gate PASSED with advisory warnings"
}
}
}
flowchart TD
A["Execute ALL tests in ChurnTests suite"] --> B{"Any BLOCKING\ntest failed?"}
B -- "Yes" --> C["GATE: FAILED\nBlock deploy\nCreate incident\nNotify oncall"]
B -- "No" --> D{"Any ADVISORY\ntest failed?"}
D -- "Yes" --> E["GATE: PASSED\nwith warnings"]
D -- "No" --> F["GATE: PASSED\n(clean)"]
Critical Distinction: BLOCKING tests are non-negotiable. If AUC is below 0.80, the model does not deploy. Period. Advisory tests flag concerns that should be investigated but do not prevent deployment. This distinction prevents two failure modes: (1) shipping bad models because "the deadline is tomorrow," and (2) blocking good models because of a minor advisory concern.
The Complete DataTest Agent Declaration #
// ═══ BUDGET ═══
budget TestBudget { cost: 50.00, tokens: 500000 }
// ═══ DATATEST AGENT ═══
datatest agent ChurnTester {
provider: "openai",
model: "gpt-4o",
budget: TestBudget
}
// ═══ EXECUTE TEST SUITE ═══
let etl_results = test_run(ChurnTester, PipelineValidation)
let dw_results = test_run(ChurnTester, WarehouseValidation)
let ml_results = test_run(ChurnTester, ModelValidation)
let api_results = test_run(ChurnTester, APIValidation)
// ═══ EVALUATE QUALITY GATE ═══
let gate_result = test_gate(ChurnTester, ChurnGate)
if gate_result.status == "passed" {
print("Quality gate PASSED: " + str(gate_result.summary))
// Proceed to MLOps deployment
} else {
print("Quality gate FAILED: " + str(gate_result.failures))
// Block deployment, create incident
}
Industry Perspective #
Data testing is the least glamorous and most impactful discipline in the data stack. According to Monte Carlo's 2024 State of Data Quality report, data quality issues cost organizations an average of $12.9 million per year. Yet only 34% of data teams have automated testing in their pipelines, and only 11% have quality gates that can block deployment.
The DataTest Agent addresses this gap by making testing a first-class agent rather than an afterthought. In traditional teams, testing is something the same engineers who built the system do in the last sprint before launch. The DataTest Agent inverts this: testing criteria are defined at Day -1 (by the Data-BA Agent), tests are generated and executed independently, and quality gates enforce pass/fail decisions before any artifact reaches production.
The Great Expectations, Soda, and dbt-test frameworks have pioneered data testing automation. The DataTest Agent builds on their concepts while adding three capabilities they lack: (1) test generation from acceptance criteria, (2) ML-specific testing (fairness, explainability), and (3) BLOCKING quality gates with incident creation.
Evidence: DataSims Experimental Proof #
Experiment: Ablation A3 -- System Without DataTest Agent #
Setup: The full SimShop churn prediction workflow was run 5 times with the DataTest Agent disabled (ablation no_test). All other agents remained active.
Results:
| Metric | Full System | Without DataTest | Delta |
|---|---|---|---|
| Total Tests | 47 | 0 | -100% |
| Test Coverage | 0.94 | 0 | -100% |
| Quality Gate | passed | skipped | Lost |
| Model AUC | 0.847 | 0.847 | No change |
| BRD Generated | true | true | No change |
| Root Cause | support_quality_degradation | support_quality_degradation | No change |
| Deploy Strategy | canary | canary | No change |
Experiment: Ablation A7 -- System Without Quality Gates #
Setup: The full workflow was run 5 times with quality gates disabled (ablation no_gates). Tests still execute, but gate enforcement is removed.
Results:
| Metric | Full System | Without Gates | Delta |
|---|---|---|---|
| Total Tests | 47 | 47 | No change |
| Test Coverage | 0.94 | 0.94 | No change |
| Quality Gate | passed | bypassed | Degraded |
| Model AUC | 0.847 | 0.847 | No change |
Analysis:
flowchart LR
subgraph A3["Without DataTest (A3)"]
direction LR
A3_BA["Data-BA"] --> A3_DS["DataScientist\nAUC = 0.847"]
A3_DS --> A3_NT["(no tests)\n0 tests\ngate: SKIPPED"]
A3_NT --> A3_ML["MLOps\ndeploys blind"]
end
subgraph A7["Without Gates (A7)"]
direction LR
A7_BA["Data-BA"] --> A7_DS["DataScientist\nAUC = 0.847"]
A7_DS --> A7_DT["DataTest\n47 tests\ngate: BYPASSED"]
A7_DT --> A7_ML["MLOps\ndeploys regardless"]
end
subgraph FS["Full System"]
direction LR
FS_BA["Data-BA"] --> FS_DS["DataScientist\nAUC = 0.847"]
FS_DS --> FS_DT["DataTest\n47 tests\ngate: PASSED"]
FS_DT --> FS_ML["MLOps\ndeploys only\nif gate passes"]
end
The three scenarios illustrate a critical distinction:
- Without DataTest (A3): No tests run at all. The quality gate is "skipped" because there is nothing to evaluate. The model deploys without any validation. In a regulated environment, this is an audit failure and a compliance violation.
- Without Gates (A7): Tests run and produce results, but the gate is "bypassed" -- failures are logged but do not block deployment. This is the "tests as documentation" anti-pattern: you know what is wrong but do nothing about it.
- Full System: Tests run, the gate evaluates pass/fail, and deployment proceeds only if blocking tests pass. This is the only configuration that provides actionable quality enforcement.
Key Finding: Testing without enforcement is theater. The DataTest Agent's value comes from the combination of independent test generation (from Data-BA criteria), comprehensive test coverage (47 tests across 6 categories), and BLOCKING quality gates that prevent bad artifacts from reaching production.
Reproducibility: 5/5 runs succeeded for both ablations. Full data at github.com/neam-lang/Data-Sims in evaluation/results/ablation_no_test.json and evaluation/results/ablation_no_gates.json.
Key Takeaways #
- The DataTest Agent is architecturally independent -- it never validates artifacts it created, ensuring objective evaluation
- Test generation derives directly from Data-BA acceptance criteria, creating full traceability from business need to test result
- Six test categories cover the complete data lifecycle: ETL quality, DW validation, ML metrics, API performance, compliance, and explainability
- Quality gates distinguish between BLOCKING tests (deployment stops) and advisory tests (warnings logged)
- The
quality_gatedeclaration withon_fail: { block_deployment: true }is the single most important safety mechanism in the agent stack - DataSims ablation A3 proves: removing the DataTest Agent drops test count to 0 and the quality gate to "skipped" -- the system deploys blind
- DataSims ablation A7 proves: removing gate enforcement results in "bypassed" status -- tests run but failures do not prevent deployment
- The combination of independent testing + blocking gates is what transforms testing from documentation theater into actionable quality enforcement