Chapter 19 -- The DataTest Agent: The Independent Critic #

"Quality is not an act, it is a habit." -- Aristotle


25 min read | Sarah (MLOps), Priya (DE), Marcus (DS), David (VP) | Part V: Analytical Intelligence

What you'll learn:


The Problem: Quis Custodiet Ipsos Custodes? #

"Who watches the watchmen?" This ancient question is the central design challenge of any testing system. If the data scientist writes the model and the tests, who catches the test that was accidentally written to always pass? If the ETL engineer builds the pipeline and the validation checks, who notices that the check tests for row count > 0 when it should test for row count within 5% of the source?

In traditional data teams, the answer is "code review" -- a human-powered process that scales poorly and catches errors inconsistently. A 2023 study by Thoughtworks found that code review catches only 60% of data quality issues and only 25% of logic errors in test assertions.

The DataTest Agent solves this with a principle borrowed from financial auditing: the entity being audited never audits itself. The DataTest Agent is architecturally independent. It receives acceptance criteria from the Data-BA Agent, receives artifacts (pipelines, models, APIs) from the DataScientist, ETL, and MLOps agents, and generates tests that it -- and only it -- executes.

DIAGRAM Independence Principle
flowchart LR
  BA["Data-BA Agent\n(Day -1)"] -- "criteria" --> DT["DataTest Agent\n(Independent Critic)"]
  DS["DS / ETL /\nMLOps Agents\n(Day 0-2)"] -- "artifacts" --> DT
  DT -- "NEVER validates\nits own work" -.-x DS
  DT --> QG["Quality Gate\nPASS / FAIL"]

Key Insight: The DataTest Agent does not write the code it tests. It does not train the models it evaluates. It does not build the pipelines it validates. This separation of concerns is not a convenience -- it is a structural guarantee of objectivity.


Test Generation from Acceptance Criteria #

The DataTest Agent's primary input is the acceptance_criteria declaration produced by the Data-BA Agent (Chapter 16). Each acceptance criterion becomes one or more executable tests.

NEAM
// Input: Acceptance criteria from Data-BA Agent
// Output: Executable test suite

datatest agent ChurnTester {
    provider: "openai",
    model: "gpt-4o",
    budget: TestBudget
}

test_suite ChurnTests {
    source_criteria: ChurnAcceptance,    // from Data-BA Agent

    test_categories: [
        "etl_quality",
        "dw_validation",
        "ml_metrics",
        "api_performance",
        "compliance",
        "explainability"
    ],

    generation: {
        method: "criteria_driven",       // tests derive FROM criteria
        coverage_target: 0.95,
        edge_cases: true,
        negative_tests: true             // test what SHOULD fail
    }
}

ETL Testing #

ETL tests validate that data moves correctly from source to target with expected transformations applied.

NEAM
etl_tests PipelineValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "ETL-001",
            name: "Row count reconciliation",
            type: "reconciliation",
            source: { table: "simshop_oltp.customers",
                      query: "SELECT COUNT(*) FROM customers WHERE active = true" },
            target: { table: "ml_features.customer_360",
                      query: "SELECT COUNT(*) FROM customer_360" },
            assertion: "abs(source_count - target_count) / source_count < 0.02",
            severity: "blocking"
        },
        {
            id: "ETL-002",
            name: "Null rate check",
            type: "data_quality",
            table: "ml_features.customer_360",
            columns: ["tenure_days", "days_since_last_order",
                       "support_tickets_30d"],
            assertion: "null_rate < 0.05",
            severity: "blocking"
        },
        {
            id: "ETL-003",
            name: "Referential integrity",
            type: "referential_integrity",
            parent: { table: "simshop_oltp.customers", key: "customer_id" },
            child: { table: "ml_features.customer_360", key: "customer_id" },
            assertion: "orphan_count == 0",
            severity: "blocking"
        },
        {
            id: "ETL-004",
            name: "Value range validation",
            type: "data_quality",
            table: "ml_features.customer_360",
            checks: [
                { column: "tenure_days", assertion: "value >= 0" },
                { column: "support_tickets_30d", assertion: "value >= 0" },
                { column: "churn_probability",
                  assertion: "value >= 0.0 AND value <= 1.0" }
            ],
            severity: "blocking"
        },
        {
            id: "ETL-005",
            name: "Freshness check",
            type: "freshness",
            table: "ml_features.customer_360",
            column: "updated_at",
            assertion: "max_age_hours < 24",
            severity: "advisory"
        }
    ]
}

Data Warehouse Testing #

DW tests validate structural correctness: dimension/fact relationships, SCD (Slowly Changing Dimension) behavior, and grain consistency.

NEAM
dw_tests WarehouseValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "DW-001",
            name: "Dimension uniqueness",
            type: "dimension_validation",
            table: "simshop_dw.dim_customer",
            primary_key: "customer_sk",
            natural_key: "customer_id",
            assertion: "no_duplicate_natural_keys_for_current_records",
            severity: "blocking"
        },
        {
            id: "DW-002",
            name: "SCD Type 2 correctness",
            type: "scd_validation",
            table: "simshop_dw.dim_customer",
            scd_type: 2,
            effective_date: "effective_from",
            expiry_date: "effective_to",
            current_flag: "is_current",
            assertions: [
                "exactly_one_current_record_per_natural_key",
                "no_gaps_in_date_ranges",
                "no_overlapping_date_ranges"
            ],
            severity: "blocking"
        },
        {
            id: "DW-003",
            name: "Fact table grain",
            type: "grain_validation",
            table: "simshop_dw.fact_customer_activity",
            grain: ["customer_sk", "activity_date"],
            assertion: "no_duplicate_grain",
            severity: "blocking"
        },
        {
            id: "DW-004",
            name: "Fact-dimension join integrity",
            type: "referential_integrity",
            fact: "simshop_dw.fact_customer_activity",
            dimension: "simshop_dw.dim_customer",
            join_key: "customer_sk",
            assertion: "all_fact_keys_exist_in_dimension",
            severity: "blocking"
        }
    ]
}

ML Testing #

ML tests go beyond accuracy metrics. They validate fairness, explainability, and robustness.

NEAM
ml_tests ModelValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "ML-001",
            name: "AUC-ROC threshold",
            type: "metric_threshold",
            model: "ChurnModel",
            metric: "auc_roc",
            assertion: "value >= 0.80",
            severity: "blocking",
            traces_to: "AC-001"          // links to acceptance criteria
        },
        {
            id: "ML-002",
            name: "F1 Score threshold",
            type: "metric_threshold",
            model: "ChurnModel",
            metric: "f1",
            assertion: "value >= 0.65",
            severity: "blocking",
            traces_to: "AC-001"
        },
        {
            id: "ML-003",
            name: "Fairness audit",
            type: "fairness",
            model: "ChurnModel",
            protected_attributes: ["gender", "age_group"],
            metric: "demographic_parity_ratio",
            assertion: "value >= 0.80",
            severity: "blocking",
            traces_to: "AC-004"
        },
        {
            id: "ML-004",
            name: "SHAP explainability",
            type: "explainability",
            model: "ChurnModel",
            method: "shap",
            assertion: "top_k_features >= 5 AND all_values_computed",
            severity: "blocking",
            traces_to: "AC-003"
        },
        {
            id: "ML-005",
            name: "PII exclusion verification",
            type: "compliance",
            model: "ChurnModel",
            forbidden_features: ["email", "phone", "date_of_birth",
                                  "first_name", "last_name"],
            assertion: "none_present_in_feature_set",
            severity: "blocking",
            traces_to: "AC-004"
        },
        {
            id: "ML-006",
            name: "Model stability",
            type: "stability",
            model: "ChurnModel",
            method: "bootstrap",
            n_iterations: 100,
            assertion: "auc_std < 0.03",
            severity: "advisory"
        }
    ]
}

API Testing #

API tests validate the serving infrastructure before production deployment.

NEAM
api_tests APIValidation {
    suite: ChurnTests,
    tests: [
        {
            id: "API-001",
            name: "p99 latency",
            type: "latency",
            endpoint: "/v1/churn/predict",
            method: "POST",
            payload: { customer_id: "test_customer_001" },
            concurrency: 100,
            duration_seconds: 60,
            assertion: "p99_ms < 200",
            severity: "blocking",
            traces_to: "AC-002"
        },
        {
            id: "API-002",
            name: "Throughput",
            type: "throughput",
            endpoint: "/v1/churn/predict",
            assertion: "requests_per_second >= 500",
            severity: "advisory"
        },
        {
            id: "API-003",
            name: "Error handling",
            type: "error_handling",
            scenarios: [
                { input: { customer_id: "nonexistent" },
                  expected_status: 404 },
                { input: { customer_id: null },
                  expected_status: 400 },
                { input: {},
                  expected_status: 400 }
            ],
            severity: "blocking"
        }
    ]
}

Quality Gates: BLOCKING vs Advisory #

The quality_gate declaration is the enforcement mechanism. It determines whether the pipeline proceeds to deployment or halts.

NEAM
quality_gate ChurnGate {
    suite: ChurnTests,

    blocking_rules: {
        // ALL blocking tests must pass for the gate to open
        required_pass_rate: 1.0,
        categories: ["etl_quality", "ml_metrics", "compliance"]
    },

    advisory_rules: {
        // Advisory failures are logged but do not block
        min_pass_rate: 0.80,
        categories: ["api_performance", "stability"]
    },

    actions: {
        on_pass: {
            notify: ["data-team@company.com"],
            proceed_to: "deployment",
            log: "Quality gate PASSED — proceeding to canary deployment"
        },
        on_fail: {
            notify: ["data-team@company.com", "oncall@company.com"],
            block_deployment: true,
            create_incident: true,
            log: "Quality gate FAILED — deployment BLOCKED"
        },
        on_advisory_fail: {
            notify: ["data-team@company.com"],
            proceed_to: "deployment",
            log: "Quality gate PASSED with advisory warnings"
        }
    }
}
DIAGRAM Quality Gate Decision Flow
flowchart TD
  A["Execute ALL tests in ChurnTests suite"] --> B{"Any BLOCKING\ntest failed?"}
  B -- "Yes" --> C["GATE: FAILED\nBlock deploy\nCreate incident\nNotify oncall"]
  B -- "No" --> D{"Any ADVISORY\ntest failed?"}
  D -- "Yes" --> E["GATE: PASSED\nwith warnings"]
  D -- "No" --> F["GATE: PASSED\n(clean)"]

Critical Distinction: BLOCKING tests are non-negotiable. If AUC is below 0.80, the model does not deploy. Period. Advisory tests flag concerns that should be investigated but do not prevent deployment. This distinction prevents two failure modes: (1) shipping bad models because "the deadline is tomorrow," and (2) blocking good models because of a minor advisory concern.


The Complete DataTest Agent Declaration #

NEAM
// ═══ BUDGET ═══
budget TestBudget { cost: 50.00, tokens: 500000 }

// ═══ DATATEST AGENT ═══
datatest agent ChurnTester {
    provider: "openai",
    model: "gpt-4o",
    budget: TestBudget
}

// ═══ EXECUTE TEST SUITE ═══
let etl_results = test_run(ChurnTester, PipelineValidation)
let dw_results = test_run(ChurnTester, WarehouseValidation)
let ml_results = test_run(ChurnTester, ModelValidation)
let api_results = test_run(ChurnTester, APIValidation)

// ═══ EVALUATE QUALITY GATE ═══
let gate_result = test_gate(ChurnTester, ChurnGate)

if gate_result.status == "passed" {
    print("Quality gate PASSED: " + str(gate_result.summary))
    // Proceed to MLOps deployment
} else {
    print("Quality gate FAILED: " + str(gate_result.failures))
    // Block deployment, create incident
}

Industry Perspective #

Data testing is the least glamorous and most impactful discipline in the data stack. According to Monte Carlo's 2024 State of Data Quality report, data quality issues cost organizations an average of $12.9 million per year. Yet only 34% of data teams have automated testing in their pipelines, and only 11% have quality gates that can block deployment.

The DataTest Agent addresses this gap by making testing a first-class agent rather than an afterthought. In traditional teams, testing is something the same engineers who built the system do in the last sprint before launch. The DataTest Agent inverts this: testing criteria are defined at Day -1 (by the Data-BA Agent), tests are generated and executed independently, and quality gates enforce pass/fail decisions before any artifact reaches production.

The Great Expectations, Soda, and dbt-test frameworks have pioneered data testing automation. The DataTest Agent builds on their concepts while adding three capabilities they lack: (1) test generation from acceptance criteria, (2) ML-specific testing (fairness, explainability), and (3) BLOCKING quality gates with incident creation.


Evidence: DataSims Experimental Proof #

Experiment: Ablation A3 -- System Without DataTest Agent #

Setup: The full SimShop churn prediction workflow was run 5 times with the DataTest Agent disabled (ablation no_test). All other agents remained active.

Results:

MetricFull SystemWithout DataTestDelta
Total Tests470-100%
Test Coverage0.940-100%
Quality GatepassedskippedLost
Model AUC0.8470.847No change
BRD GeneratedtruetrueNo change
Root Causesupport_quality_degradationsupport_quality_degradationNo change
Deploy StrategycanarycanaryNo change

Experiment: Ablation A7 -- System Without Quality Gates #

Setup: The full workflow was run 5 times with quality gates disabled (ablation no_gates). Tests still execute, but gate enforcement is removed.

Results:

MetricFull SystemWithout GatesDelta
Total Tests4747No change
Test Coverage0.940.94No change
Quality GatepassedbypassedDegraded
Model AUC0.8470.847No change

Analysis:

DIAGRAM Ablation Comparison: DataTest and Quality Gates
flowchart LR
  subgraph A3["Without DataTest (A3)"]
    direction LR
    A3_BA["Data-BA"] --> A3_DS["DataScientist\nAUC = 0.847"]
    A3_DS --> A3_NT["(no tests)\n0 tests\ngate: SKIPPED"]
    A3_NT --> A3_ML["MLOps\ndeploys blind"]
  end

  subgraph A7["Without Gates (A7)"]
    direction LR
    A7_BA["Data-BA"] --> A7_DS["DataScientist\nAUC = 0.847"]
    A7_DS --> A7_DT["DataTest\n47 tests\ngate: BYPASSED"]
    A7_DT --> A7_ML["MLOps\ndeploys regardless"]
  end

  subgraph FS["Full System"]
    direction LR
    FS_BA["Data-BA"] --> FS_DS["DataScientist\nAUC = 0.847"]
    FS_DS --> FS_DT["DataTest\n47 tests\ngate: PASSED"]
    FS_DT --> FS_ML["MLOps\ndeploys only\nif gate passes"]
  end

The three scenarios illustrate a critical distinction:

Key Finding: Testing without enforcement is theater. The DataTest Agent's value comes from the combination of independent test generation (from Data-BA criteria), comprehensive test coverage (47 tests across 6 categories), and BLOCKING quality gates that prevent bad artifacts from reaching production.

Reproducibility: 5/5 runs succeeded for both ablations. Full data at github.com/neam-lang/Data-Sims in evaluation/results/ablation_no_test.json and evaluation/results/ablation_no_gates.json.


Key Takeaways #