Chapter 20 -- The MLOps Agent: Production Guardian #
"Everyone has a plan until they get punched in the mouth." -- Mike Tyson
25 min read | Sarah (MLOps), Marcus (DS), Priya (DE), David (VP) | Part V: Analytical Intelligence
What you'll learn:
- Why Day-2 operations are where most ML projects actually fail
- The 6 types of drift: data, concept, prediction, label, feature, and schema
- Continuous training pipelines with automated retraining triggers
- Deployment strategies (canary, shadow, blue-green, A/B) with pros, cons, and selection criteria
- Champion-challenger evaluation for safe model promotion
- Serving infrastructure management across real-time, batch, and edge
- Business KPI tracking that connects model metrics to revenue impact
- DataSims proof: ablation A4 reverts deployment strategy to "manual" and health to "unmonitored"
The Problem: The Model That Worked on Tuesday #
Sarah deployed the churn model on a Tuesday. AUC of 0.847. Quality gates passed. Canary deployment looked clean. She closed the deployment ticket and moved on to the next project.
Three weeks later, the VP of Customer Success called. "Your model identified 200 customers as high-churn-risk this week. We reached out to all of them. Eighteen of them had already churned before the model scored them. Another forty said they had no intention of leaving and found the outreach annoying. What is going on?"
What happened was drift. Not a single dramatic failure, but a slow degradation that no one was monitoring. A new product launch changed customer behavior patterns. A pricing change shifted the spend distribution. A CRM migration introduced a 48-hour lag in the support ticket data. None of these changes broke the model -- they made it quietly, invisibly wrong.
This is the Day-2 problem. Building the model is Day 0. Deploying it is Day 1. Keeping it healthy in production -- while the world changes around it -- is Day 2 through Day infinity. And it is where most ML projects silently fail.
The 6 Types of Drift #
Drift is not a single phenomenon. The MLOps Agent monitors six distinct types, each with different detection methods and remediation strategies.
| Type | What Changes | Detection Method |
|---|---|---|
| Data Drift | Input feature distributions shift | KS test, PSI, Jensen-Shannon divergence |
| Concept Drift | P(Y|X) changes -- same inputs, different correct outputs | Performance monitoring, label delay tracking |
| Prediction Drift | Model output distribution shifts (even if accuracy holds) | Output distribution monitoring, KL divergence |
| Label Drift | Ground truth label distribution changes | Label distribution monitoring, class balance tracking |
| Feature Drift | Individual feature distributions shift (subset of data drift) | Per-feature PSI, correlation tracking |
| Schema Drift | Column names, types, or availability change | Schema comparison, contract validation |
drift_monitor ChurnDrift {
model: "ChurnModel",
baseline: {
dataset: "ml_features.customer_360_baseline",
timestamp: "2025-12-31"
},
monitors: [
{
type: "data_drift",
method: "psi", // Population Stability Index
threshold: 0.2, // >0.2 = significant drift
features: "all",
frequency: "hourly"
},
{
type: "concept_drift",
method: "performance_decay",
metric: "auc_roc",
threshold: 0.05, // >5% drop = alert
window: "7d",
frequency: "daily"
},
{
type: "prediction_drift",
method: "kl_divergence",
threshold: 0.1,
frequency: "hourly"
},
{
type: "feature_drift",
method: "psi",
threshold: 0.15,
features: ["days_since_last_order", "support_tickets_30d",
"login_trend_30d", "spend_trend_30d",
"cart_abandonment_rate"],
frequency: "hourly"
},
{
type: "schema_drift",
method: "contract_validation",
expected_schema: ChurnFeatures,
frequency: "on_ingestion"
}
],
alerts: {
channels: ["slack", "pagerduty"],
escalation: {
warning: "data-team",
critical: "oncall"
}
}
}
Key Insight: Data drift and concept drift are fundamentally different problems with different solutions. Data drift means the inputs changed -- maybe your feature pipeline broke, or the real world shifted. Concept drift means the relationship between inputs and outputs changed -- the model's logic is stale even if the data looks normal. The MLOps Agent monitors both independently.
Continuous Training Pipelines #
When drift is detected, the MLOps Agent can trigger automated retraining. The retraining_pipeline declaration defines when, how, and under what constraints retraining occurs.
retraining_pipeline ChurnRetrain {
model: "ChurnModel",
triggers: [
{
type: "drift_detected",
source: ChurnDrift,
min_severity: "warning",
cooldown_hours: 24 // don't retrain more than daily
},
{
type: "performance_decay",
metric: "auc_roc",
threshold: 0.80, // retrain if AUC drops below target
window: "7d"
},
{
type: "scheduled",
cron: "0 2 * * 0", // Weekly at 2 AM Sunday
always_run: false // skip if no drift detected
},
{
type: "data_volume",
min_new_labels: 1000, // retrain when 1000 new labels available
label_table: "ml_labels.churn_actuals"
}
],
training: {
use_experiment: "ChurnModel", // reuse DS agent's experiment config
data_window: "rolling_12_months",
validation: "temporal_split",
auto_hyperparameter_tune: true,
max_training_hours: 4
},
promotion: {
strategy: "champion_challenger",
min_improvement: 0.01, // new model must beat by >= 1%
metric: "auc_roc",
require_gate: true // must pass quality gate
}
}
Deployment Strategies #
The deployment_strategy declaration manages how new models reach production. Four strategies are supported, each with distinct risk profiles.
| Strategy | Risk | Rollback | Cost | Best For |
|---|---|---|---|---|
| Canary | Low | Fast | Low | Most cases |
| Shadow | None | N/A | Medium | High-risk models |
| Blue-Green | Low | Instant | High | Zero-downtime |
| A/B Test | Medium | Medium | Medium | Business impact |
Canary Deployment #
Route a small percentage of traffic to the new model. Monitor. Gradually increase if healthy.
deployment_strategy CanaryDeploy {
type: "canary",
stages: [
{ traffic_pct: 5, duration: "1h", gate: "latency_and_errors" },
{ traffic_pct: 10, duration: "4h", gate: "latency_errors_metrics" },
{ traffic_pct: 25, duration: "12h", gate: "full_metrics" },
{ traffic_pct: 50, duration: "24h", gate: "full_metrics" },
{ traffic_pct: 100, duration: "stable", gate: "none" }
],
rollback: {
auto: true,
conditions: [
"error_rate > 0.01",
"p99_latency_ms > 300",
"auc_degradation > 0.05"
]
}
}
Shadow Deployment #
Run the new model in parallel without serving its predictions. Compare against the champion.
deployment_strategy ShadowDeploy {
type: "shadow",
duration: "7d",
comparison: {
champion: "churn_model_v3",
challenger: "churn_model_v4",
metrics: ["auc_roc", "f1", "precision_at_10", "latency_p99"]
},
promotion_criteria: {
challenger_wins_on: ["auc_roc", "f1"],
max_latency_increase_pct: 10
}
}
Blue-Green Deployment #
Maintain two identical environments. Switch traffic instantaneously.
deployment_strategy BlueGreenDeploy {
type: "blue_green",
environments: {
blue: { endpoint: "/v1/churn/predict-blue", model: "v3" },
green: { endpoint: "/v1/churn/predict-green", model: "v4" }
},
switch: {
method: "dns", // or "load_balancer"
health_check_seconds: 30,
rollback_timeout_seconds: 300
}
}
A/B Test Deployment #
Split traffic for statistical comparison of business outcomes.
deployment_strategy ABTestDeploy {
type: "ab_test",
variants: {
control: { model: "v3", traffic_pct: 50 },
treatment: { model: "v4", traffic_pct: 50 }
},
success_metric: "intervention_conversion_rate",
duration: "14d",
statistical: {
test: "chi_squared",
significance: 0.05,
min_sample_size: 1000
}
}
When to Use Each Strategy: - Canary: Default choice. Use when you have clear rollback metrics - Shadow: Use for high-risk models (fraud, credit scoring) where bad predictions have immediate financial impact - Blue-Green: Use when you need zero-downtime deployment and can afford running two environments - A/B: Use when you need to measure business impact, not just model metrics
Champion-Challenger Evaluation #
The champion_challenger declaration automates the decision of whether a retrained model should replace the current production model.
champion_challenger ChurnChallenger {
champion: {
model: "churn_model_v3",
registry: "mlflow",
stage: "production"
},
challenger: {
model: "churn_model_v4",
registry: "mlflow",
stage: "staging"
},
evaluation: {
dataset: "ml_features.customer_360_holdout",
metrics: [
{ name: "auc_roc", weight: 0.4, direction: "higher" },
{ name: "f1", weight: 0.3, direction: "higher" },
{ name: "inference_ms", weight: 0.2, direction: "lower" },
{ name: "model_size_mb", weight: 0.1, direction: "lower" }
],
min_weighted_improvement: 0.01
},
promotion: {
auto_promote: true,
require_quality_gate: true,
notify: ["data-team@company.com"],
deployment_strategy: CanaryDeploy
}
}
flowchart TD A["Retrained Model (Challenger)"] --> B["Evaluate on holdout set"] B --> C["Weighted score comparison"] C --> D["Challenger wins"] C --> E["Tie / within tolerance"] C --> F["Champion wins"] D --> G["Quality Gate check (blocking)"] E --> H["Keep Champion, Log result"] F --> I["Keep Champion, Archive Challenger"] G --> J["PASS"] G --> K["FAIL"] J --> L["Promote via Canary deployment"] K --> M["Block promotion, Investigate"]
Serving Infrastructure #
The serving_infra declaration manages inference infrastructure across three serving patterns.
serving_infra ChurnServing {
patterns: [
{
name: "real_time",
endpoint: "/v1/churn/predict",
framework: "fastapi",
instances: { min: 2, max: 10 },
autoscaling: {
metric: "requests_per_second",
target: 100,
scale_up_cooldown: 60,
scale_down_cooldown: 300
},
sla: {
latency_p99_ms: 200,
availability: 0.999,
throughput_rps: 500
}
},
{
name: "batch",
schedule: "0 6 * * *", // Daily at 6 AM
input: "ml_features.customer_360",
output: "ml_predictions.churn_daily",
compute: "databricks",
timeout_minutes: 60
}
]
}
Business KPI Tracking #
Model metrics (AUC, F1) are proxy measures. The MLOps Agent also tracks the business outcomes that the model is designed to influence.
business_kpi_tracker ChurnKPIs {
model: "ChurnModel",
kpis: [
{
name: "churn_rate",
query: "SELECT COUNT(CASE WHEN churned THEN 1 END)::float /
COUNT(*) FROM customers WHERE segment = 'enterprise'",
target: 0.08, // goal: reduce to 8%
baseline: 0.14, // current: 14%
frequency: "weekly"
},
{
name: "intervention_roi",
query: "SELECT SUM(retained_arr) / SUM(intervention_cost)
FROM retention_campaigns WHERE model_version = $current",
target: 5.0, // $5 retained per $1 spent
frequency: "monthly"
},
{
name: "net_revenue_retention",
query: "SELECT (end_arr - churn_arr + expansion_arr) / start_arr
FROM arr_summary WHERE period = $current_quarter",
target: 1.10, // 110% NRR
frequency: "quarterly"
}
],
alerts: {
kpi_degradation_pct: 10, // alert if KPI worsens by >10%
channel: "slack"
}
}
Why Business KPIs Matter: A model can have perfect AUC and still fail the business. If the model correctly predicts churn but the retention team cannot act on the predictions fast enough, business KPIs degrade while model metrics remain green. The MLOps Agent tracks both to catch this disconnect.
The Complete MLOps Agent Declaration #
// ═══ BUDGET ═══
budget MLOpsBudget { cost: 50.00, tokens: 500000 }
// ═══ MLOPS AGENT ═══
mlops agent ChurnMLOps {
provider: "openai",
model: "gpt-4o",
budget: MLOpsBudget
}
// ═══ OPERATIONS ═══
// Start monitoring
let monitor_status = mlops_start_monitor(ChurnMLOps, ChurnDrift)
print("Drift monitoring: " + str(monitor_status))
// Check drift
let drift_result = mlops_check_drift(ChurnMLOps, ChurnDrift)
if drift_result.drift_detected {
print("DRIFT DETECTED: " + str(drift_result.details))
// Trigger retraining
let retrain_result = mlops_retrain(ChurnMLOps, ChurnRetrain)
// Evaluate champion vs challenger
let eval_result = mlops_evaluate(ChurnMLOps, ChurnChallenger)
if eval_result.challenger_wins {
// Deploy via canary
let deploy_result = mlops_deploy(ChurnMLOps, CanaryDeploy)
print("Canary deployment started: " + str(deploy_result))
}
}
// Track business KPIs
let kpi_report = mlops_track_kpis(ChurnMLOps, ChurnKPIs)
print("Business KPIs: " + str(kpi_report))
Industry Perspective #
MLOps maturity varies dramatically across the industry. Google's MLOps maturity model (2023) defines three levels:
- Level 0 (Manual): Data scientists hand off notebooks to engineers. No automation. No monitoring. This is where ablation A4 leaves the system.
- Level 1 (ML Pipeline Automation): Training is automated, but deployment and monitoring are manual.
- Level 2 (CI/CD for ML): Training, validation, deployment, and monitoring are fully automated with quality gates.
The Neam MLOps Agent operates at Level 2. The DataSims experiment demonstrates this: continuous drift monitoring, automated retraining triggers, champion-challenger evaluation, canary deployment, and business KPI tracking -- all declaratively specified and automatically executed.
According to Algorithmia's 2024 survey, organizations spend 45% of their ML engineering time on deployment and monitoring -- more than on model development itself. The MLOps Agent automates the repetitive 80% of that 45%, freeing ML engineers to focus on the 20% that requires human judgment: defining monitoring thresholds, selecting deployment strategies, and interpreting business KPI trends.
Evidence: DataSims Experimental Proof #
Experiment: Ablation A4 -- System Without MLOps Agent #
Setup: The full SimShop churn prediction workflow was run 5 times with the MLOps Agent disabled (ablation no_mlops). All other agents remained active.
Results:
| Metric | Full System | Without MLOps | Delta |
|---|---|---|---|
| Deploy Strategy | canary | manual | Degraded |
| Deploy Health | healthy | unmonitored | Degraded |
| Drift Detection | active | -- | Lost |
| Model AUC | 0.847 | 0.847 | No change |
| Test Coverage | 0.94 | 0.94 | No change |
| Quality Gate | passed | passed | No change |
| Root Cause | support_quality_degradation | support_quality_degradation | No change |
Analysis:
Without the MLOps Agent, the system still builds and validates a model. The DataScientist Agent trains it, the DataTest Agent validates it, and the quality gate passes. But the model has no production lifecycle:
flowchart LR A["Model Trained"] --> B["Quality Gate"] B --> C["Canary Deploy (5% > 10% > 25% > 50% > 100%)"] C --> D["Drift Monitoring (hourly)"] D --> E["No drift"] D --> F["Drift detected"] E --> G["Continue"] F --> H["Retrain"] H --> I["Champion-Challenger"] I --> J["Promote"] I --> K["Keep current"]
flowchart LR A["Model Trained"] --> B["Quality Gate"] B --> C["Manual Deploy (hope for the best)"] C --> D["No monitoring"] D --> E["No drift detection"] E --> F["No retraining"] F --> G["No rollback plan"]
The deployment strategy reverts to "manual" -- meaning someone must manually deploy the model, manually check if it is healthy, and manually decide when to retrain. The health status is "unmonitored" -- no one is watching for drift, degradation, or failure.
Key Finding: The MLOps Agent does not improve the initial model. It ensures the model stays good over time. Without it, the system is a one-shot deployment with no lifecycle management. In the SimShop experiment, this means the churn model would deploy once and silently degrade as customer behavior changes, CRM data lags, and product updates shift feature distributions -- exactly the scenario that caused Sarah's Tuesday deployment to fail three weeks later.
What "unmonitored" means in practice:
- No drift detection -- data distribution shifts go unnoticed
- No retraining triggers -- the model stagnates while the world changes
- No canary deployment -- new models go to 100% traffic immediately, with no safety net
- No rollback plan -- if the model fails, there is no automated recovery
- No business KPI tracking -- model metrics and business outcomes are disconnected
Reproducibility: 5/5 runs succeeded. Results are deterministic. Full data available at github.com/neam-lang/Data-Sims in evaluation/results/ablation_no_mlops.json.
Key Takeaways #
- Day-2 operations are where most ML projects silently fail -- not with a crash, but with a slow, unmonitored drift into irrelevance
- The 6 types of drift (data, concept, prediction, label, feature, schema) require distinct detection methods and remediation strategies
- Continuous training pipelines with automated triggers (drift, performance decay, schedule, data volume) prevent model staleness
- Four deployment strategies (canary, shadow, blue-green, A/B) offer different risk-reward tradeoffs; canary is the default for most cases
- Champion-challenger evaluation automates the "should we promote this retrained model?" decision with weighted multi-metric scoring
- Business KPI tracking bridges the gap between model metrics (AUC) and business outcomes (churn rate, revenue retention, intervention ROI)
- The MLOps Agent is the production guardian -- it does not build models, it keeps them healthy
- DataSims ablation A4 proves: without the MLOps Agent, deployment strategy reverts to "manual" and health becomes "unmonitored" -- a one-shot deployment with no lifecycle management