Chapter 20 -- The MLOps Agent: Production Guardian #

"Everyone has a plan until they get punched in the mouth." -- Mike Tyson


25 min read | Sarah (MLOps), Marcus (DS), Priya (DE), David (VP) | Part V: Analytical Intelligence

What you'll learn:


The Problem: The Model That Worked on Tuesday #

Sarah deployed the churn model on a Tuesday. AUC of 0.847. Quality gates passed. Canary deployment looked clean. She closed the deployment ticket and moved on to the next project.

Three weeks later, the VP of Customer Success called. "Your model identified 200 customers as high-churn-risk this week. We reached out to all of them. Eighteen of them had already churned before the model scored them. Another forty said they had no intention of leaving and found the outreach annoying. What is going on?"

What happened was drift. Not a single dramatic failure, but a slow degradation that no one was monitoring. A new product launch changed customer behavior patterns. A pricing change shifted the spend distribution. A CRM migration introduced a 48-hour lag in the support ticket data. None of these changes broke the model -- they made it quietly, invisibly wrong.

This is the Day-2 problem. Building the model is Day 0. Deploying it is Day 1. Keeping it healthy in production -- while the world changes around it -- is Day 2 through Day infinity. And it is where most ML projects silently fail.


The 6 Types of Drift #

Drift is not a single phenomenon. The MLOps Agent monitors six distinct types, each with different detection methods and remediation strategies.

TypeWhat ChangesDetection Method
Data DriftInput feature distributions shiftKS test, PSI, Jensen-Shannon divergence
Concept DriftP(Y|X) changes -- same inputs, different correct outputsPerformance monitoring, label delay tracking
Prediction DriftModel output distribution shifts (even if accuracy holds)Output distribution monitoring, KL divergence
Label DriftGround truth label distribution changesLabel distribution monitoring, class balance tracking
Feature DriftIndividual feature distributions shift (subset of data drift)Per-feature PSI, correlation tracking
Schema DriftColumn names, types, or availability changeSchema comparison, contract validation
NEAM
drift_monitor ChurnDrift {
    model: "ChurnModel",
    baseline: {
        dataset: "ml_features.customer_360_baseline",
        timestamp: "2025-12-31"
    },
    monitors: [
        {
            type: "data_drift",
            method: "psi",              // Population Stability Index
            threshold: 0.2,             // >0.2 = significant drift
            features: "all",
            frequency: "hourly"
        },
        {
            type: "concept_drift",
            method: "performance_decay",
            metric: "auc_roc",
            threshold: 0.05,            // >5% drop = alert
            window: "7d",
            frequency: "daily"
        },
        {
            type: "prediction_drift",
            method: "kl_divergence",
            threshold: 0.1,
            frequency: "hourly"
        },
        {
            type: "feature_drift",
            method: "psi",
            threshold: 0.15,
            features: ["days_since_last_order", "support_tickets_30d",
                       "login_trend_30d", "spend_trend_30d",
                       "cart_abandonment_rate"],
            frequency: "hourly"
        },
        {
            type: "schema_drift",
            method: "contract_validation",
            expected_schema: ChurnFeatures,
            frequency: "on_ingestion"
        }
    ],
    alerts: {
        channels: ["slack", "pagerduty"],
        escalation: {
            warning: "data-team",
            critical: "oncall"
        }
    }
}

Key Insight: Data drift and concept drift are fundamentally different problems with different solutions. Data drift means the inputs changed -- maybe your feature pipeline broke, or the real world shifted. Concept drift means the relationship between inputs and outputs changed -- the model's logic is stale even if the data looks normal. The MLOps Agent monitors both independently.


Continuous Training Pipelines #

When drift is detected, the MLOps Agent can trigger automated retraining. The retraining_pipeline declaration defines when, how, and under what constraints retraining occurs.

NEAM
retraining_pipeline ChurnRetrain {
    model: "ChurnModel",
    triggers: [
        {
            type: "drift_detected",
            source: ChurnDrift,
            min_severity: "warning",
            cooldown_hours: 24          // don't retrain more than daily
        },
        {
            type: "performance_decay",
            metric: "auc_roc",
            threshold: 0.80,            // retrain if AUC drops below target
            window: "7d"
        },
        {
            type: "scheduled",
            cron: "0 2 * * 0",          // Weekly at 2 AM Sunday
            always_run: false           // skip if no drift detected
        },
        {
            type: "data_volume",
            min_new_labels: 1000,        // retrain when 1000 new labels available
            label_table: "ml_labels.churn_actuals"
        }
    ],
    training: {
        use_experiment: "ChurnModel",   // reuse DS agent's experiment config
        data_window: "rolling_12_months",
        validation: "temporal_split",
        auto_hyperparameter_tune: true,
        max_training_hours: 4
    },
    promotion: {
        strategy: "champion_challenger",
        min_improvement: 0.01,          // new model must beat by >= 1%
        metric: "auc_roc",
        require_gate: true              // must pass quality gate
    }
}

Deployment Strategies #

The deployment_strategy declaration manages how new models reach production. Four strategies are supported, each with distinct risk profiles.

StrategyRiskRollbackCostBest For
CanaryLowFastLowMost cases
ShadowNoneN/AMediumHigh-risk models
Blue-GreenLowInstantHighZero-downtime
A/B TestMediumMediumMediumBusiness impact

Canary Deployment #

Route a small percentage of traffic to the new model. Monitor. Gradually increase if healthy.

NEAM
deployment_strategy CanaryDeploy {
    type: "canary",
    stages: [
        { traffic_pct: 5,  duration: "1h",  gate: "latency_and_errors" },
        { traffic_pct: 10, duration: "4h",  gate: "latency_errors_metrics" },
        { traffic_pct: 25, duration: "12h", gate: "full_metrics" },
        { traffic_pct: 50, duration: "24h", gate: "full_metrics" },
        { traffic_pct: 100, duration: "stable", gate: "none" }
    ],
    rollback: {
        auto: true,
        conditions: [
            "error_rate > 0.01",
            "p99_latency_ms > 300",
            "auc_degradation > 0.05"
        ]
    }
}

Shadow Deployment #

Run the new model in parallel without serving its predictions. Compare against the champion.

NEAM
deployment_strategy ShadowDeploy {
    type: "shadow",
    duration: "7d",
    comparison: {
        champion: "churn_model_v3",
        challenger: "churn_model_v4",
        metrics: ["auc_roc", "f1", "precision_at_10", "latency_p99"]
    },
    promotion_criteria: {
        challenger_wins_on: ["auc_roc", "f1"],
        max_latency_increase_pct: 10
    }
}

Blue-Green Deployment #

Maintain two identical environments. Switch traffic instantaneously.

NEAM
deployment_strategy BlueGreenDeploy {
    type: "blue_green",
    environments: {
        blue: { endpoint: "/v1/churn/predict-blue", model: "v3" },
        green: { endpoint: "/v1/churn/predict-green", model: "v4" }
    },
    switch: {
        method: "dns",              // or "load_balancer"
        health_check_seconds: 30,
        rollback_timeout_seconds: 300
    }
}

A/B Test Deployment #

Split traffic for statistical comparison of business outcomes.

NEAM
deployment_strategy ABTestDeploy {
    type: "ab_test",
    variants: {
        control: { model: "v3", traffic_pct: 50 },
        treatment: { model: "v4", traffic_pct: 50 }
    },
    success_metric: "intervention_conversion_rate",
    duration: "14d",
    statistical: {
        test: "chi_squared",
        significance: 0.05,
        min_sample_size: 1000
    }
}

When to Use Each Strategy: - Canary: Default choice. Use when you have clear rollback metrics - Shadow: Use for high-risk models (fraud, credit scoring) where bad predictions have immediate financial impact - Blue-Green: Use when you need zero-downtime deployment and can afford running two environments - A/B: Use when you need to measure business impact, not just model metrics


Champion-Challenger Evaluation #

The champion_challenger declaration automates the decision of whether a retrained model should replace the current production model.

NEAM
champion_challenger ChurnChallenger {
    champion: {
        model: "churn_model_v3",
        registry: "mlflow",
        stage: "production"
    },
    challenger: {
        model: "churn_model_v4",
        registry: "mlflow",
        stage: "staging"
    },
    evaluation: {
        dataset: "ml_features.customer_360_holdout",
        metrics: [
            { name: "auc_roc", weight: 0.4, direction: "higher" },
            { name: "f1", weight: 0.3, direction: "higher" },
            { name: "inference_ms", weight: 0.2, direction: "lower" },
            { name: "model_size_mb", weight: 0.1, direction: "lower" }
        ],
        min_weighted_improvement: 0.01
    },
    promotion: {
        auto_promote: true,
        require_quality_gate: true,
        notify: ["data-team@company.com"],
        deployment_strategy: CanaryDeploy
    }
}
DIAGRAM Champion-Challenger Decision Flow
flowchart TD
  A["Retrained Model (Challenger)"] --> B["Evaluate on holdout set"]
  B --> C["Weighted score comparison"]
  C --> D["Challenger wins"]
  C --> E["Tie / within tolerance"]
  C --> F["Champion wins"]
  D --> G["Quality Gate check (blocking)"]
  E --> H["Keep Champion, Log result"]
  F --> I["Keep Champion, Archive Challenger"]
  G --> J["PASS"]
  G --> K["FAIL"]
  J --> L["Promote via Canary deployment"]
  K --> M["Block promotion, Investigate"]

Serving Infrastructure #

The serving_infra declaration manages inference infrastructure across three serving patterns.

NEAM
serving_infra ChurnServing {
    patterns: [
        {
            name: "real_time",
            endpoint: "/v1/churn/predict",
            framework: "fastapi",
            instances: { min: 2, max: 10 },
            autoscaling: {
                metric: "requests_per_second",
                target: 100,
                scale_up_cooldown: 60,
                scale_down_cooldown: 300
            },
            sla: {
                latency_p99_ms: 200,
                availability: 0.999,
                throughput_rps: 500
            }
        },
        {
            name: "batch",
            schedule: "0 6 * * *",          // Daily at 6 AM
            input: "ml_features.customer_360",
            output: "ml_predictions.churn_daily",
            compute: "databricks",
            timeout_minutes: 60
        }
    ]
}

Business KPI Tracking #

Model metrics (AUC, F1) are proxy measures. The MLOps Agent also tracks the business outcomes that the model is designed to influence.

NEAM
business_kpi_tracker ChurnKPIs {
    model: "ChurnModel",
    kpis: [
        {
            name: "churn_rate",
            query: "SELECT COUNT(CASE WHEN churned THEN 1 END)::float /
                    COUNT(*) FROM customers WHERE segment = 'enterprise'",
            target: 0.08,               // goal: reduce to 8%
            baseline: 0.14,             // current: 14%
            frequency: "weekly"
        },
        {
            name: "intervention_roi",
            query: "SELECT SUM(retained_arr) / SUM(intervention_cost)
                    FROM retention_campaigns WHERE model_version = $current",
            target: 5.0,                // $5 retained per $1 spent
            frequency: "monthly"
        },
        {
            name: "net_revenue_retention",
            query: "SELECT (end_arr - churn_arr + expansion_arr) / start_arr
                    FROM arr_summary WHERE period = $current_quarter",
            target: 1.10,               // 110% NRR
            frequency: "quarterly"
        }
    ],
    alerts: {
        kpi_degradation_pct: 10,        // alert if KPI worsens by >10%
        channel: "slack"
    }
}

Why Business KPIs Matter: A model can have perfect AUC and still fail the business. If the model correctly predicts churn but the retention team cannot act on the predictions fast enough, business KPIs degrade while model metrics remain green. The MLOps Agent tracks both to catch this disconnect.


The Complete MLOps Agent Declaration #

NEAM
// ═══ BUDGET ═══
budget MLOpsBudget { cost: 50.00, tokens: 500000 }

// ═══ MLOPS AGENT ═══
mlops agent ChurnMLOps {
    provider: "openai",
    model: "gpt-4o",
    budget: MLOpsBudget
}

// ═══ OPERATIONS ═══

// Start monitoring
let monitor_status = mlops_start_monitor(ChurnMLOps, ChurnDrift)
print("Drift monitoring: " + str(monitor_status))

// Check drift
let drift_result = mlops_check_drift(ChurnMLOps, ChurnDrift)
if drift_result.drift_detected {
    print("DRIFT DETECTED: " + str(drift_result.details))

    // Trigger retraining
    let retrain_result = mlops_retrain(ChurnMLOps, ChurnRetrain)

    // Evaluate champion vs challenger
    let eval_result = mlops_evaluate(ChurnMLOps, ChurnChallenger)

    if eval_result.challenger_wins {
        // Deploy via canary
        let deploy_result = mlops_deploy(ChurnMLOps, CanaryDeploy)
        print("Canary deployment started: " + str(deploy_result))
    }
}

// Track business KPIs
let kpi_report = mlops_track_kpis(ChurnMLOps, ChurnKPIs)
print("Business KPIs: " + str(kpi_report))

Industry Perspective #

MLOps maturity varies dramatically across the industry. Google's MLOps maturity model (2023) defines three levels:

The Neam MLOps Agent operates at Level 2. The DataSims experiment demonstrates this: continuous drift monitoring, automated retraining triggers, champion-challenger evaluation, canary deployment, and business KPI tracking -- all declaratively specified and automatically executed.

According to Algorithmia's 2024 survey, organizations spend 45% of their ML engineering time on deployment and monitoring -- more than on model development itself. The MLOps Agent automates the repetitive 80% of that 45%, freeing ML engineers to focus on the 20% that requires human judgment: defining monitoring thresholds, selecting deployment strategies, and interpreting business KPI trends.


Evidence: DataSims Experimental Proof #

Experiment: Ablation A4 -- System Without MLOps Agent #

Setup: The full SimShop churn prediction workflow was run 5 times with the MLOps Agent disabled (ablation no_mlops). All other agents remained active.

Results:

MetricFull SystemWithout MLOpsDelta
Deploy StrategycanarymanualDegraded
Deploy HealthhealthyunmonitoredDegraded
Drift Detectionactive--Lost
Model AUC0.8470.847No change
Test Coverage0.940.94No change
Quality GatepassedpassedNo change
Root Causesupport_quality_degradationsupport_quality_degradationNo change

Analysis:

Without the MLOps Agent, the system still builds and validates a model. The DataScientist Agent trains it, the DataTest Agent validates it, and the quality gate passes. But the model has no production lifecycle:

DIAGRAM With MLOps Agent
flowchart LR
  A["Model Trained"] --> B["Quality Gate"]
  B --> C["Canary Deploy (5% > 10% > 25% > 50% > 100%)"]
  C --> D["Drift Monitoring (hourly)"]
  D --> E["No drift"]
  D --> F["Drift detected"]
  E --> G["Continue"]
  F --> H["Retrain"]
  H --> I["Champion-Challenger"]
  I --> J["Promote"]
  I --> K["Keep current"]
DIAGRAM Without MLOps Agent
flowchart LR
  A["Model Trained"] --> B["Quality Gate"]
  B --> C["Manual Deploy (hope for the best)"]
  C --> D["No monitoring"]
  D --> E["No drift detection"]
  E --> F["No retraining"]
  F --> G["No rollback plan"]

The deployment strategy reverts to "manual" -- meaning someone must manually deploy the model, manually check if it is healthy, and manually decide when to retrain. The health status is "unmonitored" -- no one is watching for drift, degradation, or failure.

Key Finding: The MLOps Agent does not improve the initial model. It ensures the model stays good over time. Without it, the system is a one-shot deployment with no lifecycle management. In the SimShop experiment, this means the churn model would deploy once and silently degrade as customer behavior changes, CRM data lags, and product updates shift feature distributions -- exactly the scenario that caused Sarah's Tuesday deployment to fail three weeks later.

What "unmonitored" means in practice:

Reproducibility: 5/5 runs succeeded. Results are deterministic. Full data available at github.com/neam-lang/Data-Sims in evaluation/results/ablation_no_mlops.json.


Key Takeaways #