Chapter 19 -- The DataTest Agent: The Independent Critic #

"Quality is not an act, it is a habit." -- Aristotle

25 min read | Sarah (MLOps), Priya (DE), Marcus (DS), David (VP) | Part V: Analytical Intelligence

What you'll learn:

Why the DataTest Agent is architecturally independent -- it never validates its own work
How test generation derives directly from acceptance criteria (Data-BA output)
ETL testing: data quality checks, row count reconciliation, referential integrity
Data warehouse testing: dimension/fact validation, SCD correctness, grain verification
ML testing: model metrics, fairness audits, explainability validation
API testing: latency benchmarks, throughput validation, error handling
Quality gates: BLOCKING vs advisory thresholds and their deployment implications
DataSims proof: ablation A3 drops test count to 0 and quality gate to "skipped"; ablation A7 shows gate "bypassed"

The Problem: Quis Custodiet Ipsos Custodes? #

"Who watches the watchmen?" This ancient question is the central design challenge of any testing system. If the data scientist writes the model and the tests, who catches the test that was accidentally written to always pass? If the ETL engineer builds the pipeline and the validation checks, who notices that the check tests for row count > 0 when it should test for row count within 5% of the source?

In traditional data teams, the answer is "code review" -- a human-powered process that scales poorly and catches errors inconsistently. A 2023 study by Thoughtworks found that code review catches only 60% of data quality issues and only 25% of logic errors in test assertions.

The DataTest Agent solves this with a principle borrowed from financial auditing: the entity being audited never audits itself. The DataTest Agent is architecturally independent. It receives acceptance criteria from the Data-BA Agent, receives artifacts (pipelines, models, APIs) from the DataScientist, ETL, and MLOps agents, and generates tests that it -- and only it -- executes.

flowchart LR
  BA["Data-BA Agent\n(Day -1)"] -- "criteria" --> DT["DataTest Agent\n(Independent Critic)"]
  DS["DS / ETL /\nMLOps Agents\n(Day 0-2)"] -- "artifacts" --> DT
  DT -- "NEVER validates\nits own work" -.-x DS
  DT --> QG["Quality Gate\nPASS / FAIL"]

Key Insight: The DataTest Agent does not write the code it tests. It does not train the models it evaluates. It does not build the pipelines it validates. This separation of concerns is not a convenience -- it is a structural guarantee of objectivity.

Test Generation from Acceptance Criteria #

The DataTest Agent's primary input is the acceptance_criteria declaration produced by the Data-BA Agent (Chapter 16). Each acceptance criterion becomes one or more executable tests.

NEAM

// Input: Acceptance criteria from Data-BA Agent
// Output: Executable test suite

datatest agent ChurnTester {
    provider: "openai",
    model: "gpt-4o",
    budget: TestBudget
}

test_suite ChurnTests {
    source_criteria: ChurnAcceptance,    // from Data-BA Agent

    test_categories: [
        "etl_quality",
        "dw_validation",
        "ml_metrics",
        "api_performance",
        "compliance",
        "explainability"
    ],

    generation: {
        method: "criteria_driven",       // tests derive FROM criteria
        coverage_target: 0.95,
        edge_cases: true,
        negative_tests: true             // test what SHOULD fail
    }
}

ETL Testing #

ETL tests validate that data moves correctly from source to target with expected transformations applied.

NEAM

etl_tests PipelineValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "ETL-001",
            name: "Row count reconciliation",
            type: "reconciliation",
            source: { table: "simshop_oltp.customers",
                      query: "SELECT COUNT(*) FROM customers WHERE active = true" },
            target: { table: "ml_features.customer_360",
                      query: "SELECT COUNT(*) FROM customer_360" },
            assertion: "abs(source_count - target_count) / source_count < 0.02",
            severity: "blocking"
        },
        {
            id: "ETL-002",
            name: "Null rate check",
            type: "data_quality",
            table: "ml_features.customer_360",
            columns: ["tenure_days", "days_since_last_order",
                       "support_tickets_30d"],
            assertion: "null_rate < 0.05",
            severity: "blocking"
        },
        {
            id: "ETL-003",
            name: "Referential integrity",
            type: "referential_integrity",
            parent: { table: "simshop_oltp.customers", key: "customer_id" },
            child: { table: "ml_features.customer_360", key: "customer_id" },
            assertion: "orphan_count == 0",
            severity: "blocking"
        },
        {
            id: "ETL-004",
            name: "Value range validation",
            type: "data_quality",
            table: "ml_features.customer_360",
            checks: [
                { column: "tenure_days", assertion: "value >= 0" },
                { column: "support_tickets_30d", assertion: "value >= 0" },
                { column: "churn_probability",
                  assertion: "value >= 0.0 AND value <= 1.0" }
            ],
            severity: "blocking"
        },
        {
            id: "ETL-005",
            name: "Freshness check",
            type: "freshness",
            table: "ml_features.customer_360",
            column: "updated_at",
            assertion: "max_age_hours < 24",
            severity: "advisory"
        }
    ]
}

Data Warehouse Testing #

DW tests validate structural correctness: dimension/fact relationships, SCD (Slowly Changing Dimension) behavior, and grain consistency.

NEAM

dw_tests WarehouseValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "DW-001",
            name: "Dimension uniqueness",
            type: "dimension_validation",
            table: "simshop_dw.dim_customer",
            primary_key: "customer_sk",
            natural_key: "customer_id",
            assertion: "no_duplicate_natural_keys_for_current_records",
            severity: "blocking"
        },
        {
            id: "DW-002",
            name: "SCD Type 2 correctness",
            type: "scd_validation",
            table: "simshop_dw.dim_customer",
            scd_type: 2,
            effective_date: "effective_from",
            expiry_date: "effective_to",
            current_flag: "is_current",
            assertions: [
                "exactly_one_current_record_per_natural_key",
                "no_gaps_in_date_ranges",
                "no_overlapping_date_ranges"
            ],
            severity: "blocking"
        },
        {
            id: "DW-003",
            name: "Fact table grain",
            type: "grain_validation",
            table: "simshop_dw.fact_customer_activity",
            grain: ["customer_sk", "activity_date"],
            assertion: "no_duplicate_grain",
            severity: "blocking"
        },
        {
            id: "DW-004",
            name: "Fact-dimension join integrity",
            type: "referential_integrity",
            fact: "simshop_dw.fact_customer_activity",
            dimension: "simshop_dw.dim_customer",
            join_key: "customer_sk",
            assertion: "all_fact_keys_exist_in_dimension",
            severity: "blocking"
        }
    ]
}

ML Testing #

ML tests go beyond accuracy metrics. They validate fairness, explainability, and robustness.

NEAM

ml_tests ModelValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "ML-001",
            name: "AUC-ROC threshold",
            type: "metric_threshold",
            model: "ChurnModel",
            metric: "auc_roc",
            assertion: "value >= 0.80",
            severity: "blocking",
            traces_to: "AC-001"          // links to acceptance criteria
        },
        {
            id: "ML-002",
            name: "F1 Score threshold",
            type: "metric_threshold",
            model: "ChurnModel",
            metric: "f1",
            assertion: "value >= 0.65",
            severity: "blocking",
            traces_to: "AC-001"
        },
        {
            id: "ML-003",
            name: "Fairness audit",
            type: "fairness",
            model: "ChurnModel",
            protected_attributes: ["gender", "age_group"],
            metric: "demographic_parity_ratio",
            assertion: "value >= 0.80",
            severity: "blocking",
            traces_to: "AC-004"
        },
        {
            id: "ML-004",
            name: "SHAP explainability",
            type: "explainability",
            model: "ChurnModel",
            method: "shap",
            assertion: "top_k_features >= 5 AND all_values_computed",
            severity: "blocking",
            traces_to: "AC-003"
        },
        {
            id: "ML-005",
            name: "PII exclusion verification",
            type: "compliance",
            model: "ChurnModel",
            forbidden_features: ["email", "phone", "date_of_birth",
                                  "first_name", "last_name"],
            assertion: "none_present_in_feature_set",
            severity: "blocking",
            traces_to: "AC-004"
        },
        {
            id: "ML-006",
            name: "Model stability",
            type: "stability",
            model: "ChurnModel",
            method: "bootstrap",
            n_iterations: 100,
            assertion: "auc_std < 0.03",
            severity: "advisory"
        }
    ]
}

API Testing #

API tests validate the serving infrastructure before production deployment.

NEAM

api_tests APIValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "API-001",
            name: "p99 latency",
            type: "latency",
            endpoint: "/v1/churn/predict",
            method: "POST",
            payload: { customer_id: "test_customer_001" },
            concurrency: 100,
            duration_seconds: 60,
            assertion: "p99_ms < 200",
            severity: "blocking",
            traces_to: "AC-002"
        },
        {
            id: "API-002",
            name: "Throughput",
            type: "throughput",
            endpoint: "/v1/churn/predict",
            assertion: "requests_per_second >= 500",
            severity: "advisory"
        },
        {
            id: "API-003",
            name: "Error handling",
            type: "error_handling",
            scenarios: [
                { input: { customer_id: "nonexistent" },
                  expected_status: 404 },
                { input: { customer_id: null },
                  expected_status: 400 },
                { input: {},
                  expected_status: 400 }
            ],
            severity: "blocking"
        }
    ]
}

Quality Gates: BLOCKING vs Advisory #

The quality_gate declaration is the enforcement mechanism. It determines whether the pipeline proceeds to deployment or halts.

NEAM

quality_gate ChurnGate {
    suite: ChurnTests,

    blocking_rules: {
        // ALL blocking tests must pass for the gate to open
        required_pass_rate: 1.0,
        categories: ["etl_quality", "ml_metrics", "compliance"]
    },

    advisory_rules: {
        // Advisory failures are logged but do not block
        min_pass_rate: 0.80,
        categories: ["api_performance", "stability"]
    },

    actions: {
        on_pass: {
            notify: ["data-team@company.com"],
            proceed_to: "deployment",
            log: "Quality gate PASSED — proceeding to canary deployment"
        },
        on_fail: {
            notify: ["data-team@company.com", "oncall@company.com"],
            block_deployment: true,
            create_incident: true,
            log: "Quality gate FAILED — deployment BLOCKED"
        },
        on_advisory_fail: {
            notify: ["data-team@company.com"],
            proceed_to: "deployment",
            log: "Quality gate PASSED with advisory warnings"
        }
    }
}

flowchart TD
  A["Execute ALL tests in ChurnTests suite"] --> B{"Any BLOCKING\ntest failed?"}
  B -- "Yes" --> C["GATE: FAILED\nBlock deploy\nCreate incident\nNotify oncall"]
  B -- "No" --> D{"Any ADVISORY\ntest failed?"}
  D -- "Yes" --> E["GATE: PASSED\nwith warnings"]
  D -- "No" --> F["GATE: PASSED\n(clean)"]

Critical Distinction: BLOCKING tests are non-negotiable. If AUC is below 0.80, the model does not deploy. Period. Advisory tests flag concerns that should be investigated but do not prevent deployment. This distinction prevents two failure modes: (1) shipping bad models because "the deadline is tomorrow," and (2) blocking good models because of a minor advisory concern.

The Complete DataTest Agent Declaration #

NEAM

// ═══ BUDGET ═══
budget TestBudget { cost: 50.00, tokens: 500000 }

// ═══ DATATEST AGENT ═══
datatest agent ChurnTester {
    provider: "openai",
    model: "gpt-4o",
    budget: TestBudget
}

// ═══ EXECUTE TEST SUITE ═══
let etl_results = test_run(ChurnTester, PipelineValidation)
let dw_results = test_run(ChurnTester, WarehouseValidation)
let ml_results = test_run(ChurnTester, ModelValidation)
let api_results = test_run(ChurnTester, APIValidation)

// ═══ EVALUATE QUALITY GATE ═══
let gate_result = test_gate(ChurnTester, ChurnGate)

if gate_result.status == "passed" {
    print("Quality gate PASSED: " + str(gate_result.summary))
    // Proceed to MLOps deployment
} else {
    print("Quality gate FAILED: " + str(gate_result.failures))
    // Block deployment, create incident
}

Industry Perspective #

Data testing is the least glamorous and most impactful discipline in the data stack. According to Monte Carlo's 2024 State of Data Quality report, data quality issues cost organizations an average of $12.9 million per year. Yet only 34% of data teams have automated testing in their pipelines, and only 11% have quality gates that can block deployment.

The DataTest Agent addresses this gap by making testing a first-class agent rather than an afterthought. In traditional teams, testing is something the same engineers who built the system do in the last sprint before launch. The DataTest Agent inverts this: testing criteria are defined at Day -1 (by the Data-BA Agent), tests are generated and executed independently, and quality gates enforce pass/fail decisions before any artifact reaches production.

The Great Expectations, Soda, and dbt-test frameworks have pioneered data testing automation. The DataTest Agent builds on their concepts while adding three capabilities they lack: (1) test generation from acceptance criteria, (2) ML-specific testing (fairness, explainability), and (3) BLOCKING quality gates with incident creation.

Evidence: DataSims Experimental Proof #

Experiment: Ablation A3 -- System Without DataTest Agent #

Setup: The full SimShop churn prediction workflow was run 5 times with the DataTest Agent disabled (ablation no_test). All other agents remained active.

Results:

Metric	Full System	Without DataTest	Delta
Total Tests	47	0	-100%
Test Coverage	0.94	0	-100%
Quality Gate	passed	skipped	Lost
Model AUC	0.847	0.847	No change
BRD Generated	true	true	No change
Root Cause	support_quality_degradation	support_quality_degradation	No change
Deploy Strategy	canary	canary	No change

Experiment: Ablation A7 -- System Without Quality Gates #

Setup: The full workflow was run 5 times with quality gates disabled (ablation no_gates). Tests still execute, but gate enforcement is removed.

Results:

Metric	Full System	Without Gates	Delta
Total Tests	47	47	No change
Test Coverage	0.94	0.94	No change
Quality Gate	passed	bypassed	Degraded
Model AUC	0.847	0.847	No change

Analysis:

flowchart LR
  subgraph A3["Without DataTest (A3)"]
    direction LR
    A3_BA["Data-BA"] --> A3_DS["DataScientist\nAUC = 0.847"]
    A3_DS --> A3_NT["(no tests)\n0 tests\ngate: SKIPPED"]
    A3_NT --> A3_ML["MLOps\ndeploys blind"]
  end

  subgraph A7["Without Gates (A7)"]
    direction LR
    A7_BA["Data-BA"] --> A7_DS["DataScientist\nAUC = 0.847"]
    A7_DS --> A7_DT["DataTest\n47 tests\ngate: BYPASSED"]
    A7_DT --> A7_ML["MLOps\ndeploys regardless"]
  end

  subgraph FS["Full System"]
    direction LR
    FS_BA["Data-BA"] --> FS_DS["DataScientist\nAUC = 0.847"]
    FS_DS --> FS_DT["DataTest\n47 tests\ngate: PASSED"]
    FS_DT --> FS_ML["MLOps\ndeploys only\nif gate passes"]
  end

The three scenarios illustrate a critical distinction:

Without DataTest (A3): No tests run at all. The quality gate is "skipped" because there is nothing to evaluate. The model deploys without any validation. In a regulated environment, this is an audit failure and a compliance violation.

Without Gates (A7): Tests run and produce results, but the gate is "bypassed" -- failures are logged but do not block deployment. This is the "tests as documentation" anti-pattern: you know what is wrong but do nothing about it.

Full System: Tests run, the gate evaluates pass/fail, and deployment proceeds only if blocking tests pass. This is the only configuration that provides actionable quality enforcement.

Key Finding: Testing without enforcement is theater. The DataTest Agent's value comes from the combination of independent test generation (from Data-BA criteria), comprehensive test coverage (47 tests across 6 categories), and BLOCKING quality gates that prevent bad artifacts from reaching production.

Reproducibility: 5/5 runs succeeded for both ablations. Full data at github.com/neam-lang/Data-Sims in evaluation/results/ablation_no_test.json and evaluation/results/ablation_no_gates.json.

Key Takeaways #

The DataTest Agent is architecturally independent -- it never validates artifacts it created, ensuring objective evaluation
Test generation derives directly from Data-BA acceptance criteria, creating full traceability from business need to test result
Six test categories cover the complete data lifecycle: ETL quality, DW validation, ML metrics, API performance, compliance, and explainability
Quality gates distinguish between BLOCKING tests (deployment stops) and advisory tests (warnings logged)
The quality_gate declaration with on_fail: { block_deployment: true } is the single most important safety mechanism in the agent stack
DataSims ablation A3 proves: removing the DataTest Agent drops test count to 0 and the quality gate to "skipped" -- the system deploys blind
DataSims ablation A7 proves: removing gate enforcement results in "bypassed" status -- tests run but failures do not prevent deployment
The combination of independent testing + blocking gates is what transforms testing from documentation theater into actionable quality enforcement