Chapter 23 — Error Handling and Self-Healing #

"The measure of intelligence is the ability to change." -- Albert Einstein


📖 20 min read | 👤 Priya (Data Engineer), Sarah (MLOps), David (VP Data) | 🏷️ Part VI: Orchestration

What you'll learn:


The Problem: The 3 AM Pipeline #

Priya's phone buzzes at 3:14 AM. The nightly ETL pipeline has failed. Again.

She opens her laptop and checks the logs. The churn feature pipeline crashed because the upstream events table had a new column (event_metadata_v2) that broke the fixed-schema SELECT statement. The pipeline did what most pipelines do when they encounter an unexpected column: it crashed. Hard.

The crash cascaded. The feature table was not updated. The model scoring job ran on stale features. The prediction API is now serving yesterday's scores. The monitoring dashboard shows a flatline, which the on-call engineer interpreted as "no drift" -- when the real story is "no data."

One schema change. One crash. An entire prediction system serving stale results for 12 hours before anyone noticed.

This is what happens when error handling is an afterthought. In the Neam agent stack, error handling is a first-class architectural concern.


The Four-Tier Error Handling Hierarchy #

The Neam agent stack handles errors through a four-tier hierarchy. Each tier is tried in order; if the current tier cannot resolve the error, it escalates to the next tier.

DIAGRAM Error Handling Hierarchy
flowchart TB
  T1["Tier 1: RETRY"]
  T1D["Same agent, same task, same parameters\nExponential backoff with jitter\nMax retries configurable per agent (default: 3)\nHandles: transient LLM failures, rate limits"]
  T2["Tier 2: FALLBACK"]
  T2D["Alternative agent or alternative provider\nLLM provider chain: OpenAI → Bedrock → Ollama\nAgent substitution: primary → backup specialist\nHandles: provider outages, agent specialization"]
  T3["Tier 3: GRACEFUL DEGRADATION"]
  T3D["Reduced functionality, not total failure\nSkip non-critical phases, flag degraded output\nPartial results with explicit quality warnings\nHandles: budget exhaustion, non-critical failures"]
  T4["Tier 4: ESCALATION"]
  T4D["Human notification with full context\nState checkpoint for manual resume\nRecommended actions based on error classification\nHandles: unrecoverable errors, policy violations"]

  T1 --> T1D
  T1 -- "Cannot resolve" --> T2
  T2 --> T2D
  T2 -- "Cannot resolve" --> T3
  T3 --> T3D
  T3 -- "Cannot resolve" --> T4
  T4 --> T4D

Tier 1: Retry with Backoff #

The simplest and most common recovery. When an agent call fails, the system retries with exponential backoff:

DIAGRAM Retry Timeline
flowchart LR
  A1["Attempt 1\nt=0s"] -- "FAIL\n(rate limit)" --> A2["Attempt 2\nt=2s"]
  A2 -- "FAIL\n(rate limit)" --> A3["Attempt 3\nt=6s"]
  A3 -- "FAIL\n(rate limit)" --> A4["Attempt 4\nt=14s"]
  A4 -- "SUCCESS" --> Done["Complete"]

  style A1 fill:#ff6b6b,color:#fff
  style A2 fill:#ff6b6b,color:#fff
  style A3 fill:#ff6b6b,color:#fff
  style A4 fill:#51cf66,color:#fff
  style Done fill:#51cf66,color:#fff
Backoff Configuration

Backoff formula: delay = base * 2^(attempt-1) + random_jitter

Default base: 1 second

Max retries: 3 (configurable per agent)

Retries are appropriate for transient failures -- errors that will resolve themselves with time:

Transient ErrorTypical Recovery Time
LLM rate limit1-5 seconds
Network timeout2-10 seconds
Database connection pool exhaustion1-3 seconds
Temporary API unavailability5-30 seconds

⚠️ Retries are NOT appropriate for deterministic failures. If the error is caused by bad input data or a logic error, retrying the same operation will produce the same failure. The system classifies errors before retrying: transient errors retry, deterministic errors skip to Tier 2.


Tier 2: Fallback #

When retries are exhausted or the error is classified as non-transient, the system falls back to an alternative:

LLM Provider Failover #

DIAGRAM LLM Provider Failover Chain
flowchart LR
  P["OpenAI\ngpt-4o\n(cloud)"] -- "FAIL" --> F1["AWS Bedrock\nClaude\n(cloud)"]
  F1 -- "FAIL" --> F2["Ollama\nlocal\n(on-prem)"]

  style P fill:#4dabf7,color:#fff
  style F1 fill:#fab005,color:#fff
  style F2 fill:#ff6b6b,color:#fff
Failover Criteria
  • 3 consecutive failures on primary
  • Response latency > 30s (timeout)
  • Provider reports capacity issues (429/503)

In Neam, provider failover is built into the agent declaration:

NEAM
agent ChurnDS {
    provider: "openai", model: "gpt-4o",
    // Fallback chain (tried in order if primary fails)
    fallback_providers: [
        { provider: "bedrock", model: "claude-3-5-sonnet" },
        { provider: "ollama", model: "llama3.1:70b" }
    ],
    budget: AgentBudget
}

Agent Substitution #

When a specialist agent cannot complete a task, the DIO can reassign to a different agent with overlapping capabilities:

Agent Substitution Example

Task: Feature engineering for churn model

Primary R: ETLAgent (specialized, preferred)

Fallback R: DataScientist (can do feature engineering, less specialized)

The DIO tracks agent capability overlap and uses it for substitution.

💡 Fallback does not mean equivalent quality. A local Ollama model will typically produce lower-quality outputs than GPT-4o. The system records the fallback in the audit trail so that downstream consumers know the output may have reduced quality.


Tier 3: Graceful Degradation #

When no fallback is available, the system degrades gracefully rather than crashing:

Budget Exhaustion #

This is the most common graceful degradation scenario. When an agent's budget is exhausted mid-task:

Budget Exhaustion -- Graceful Handling

Budget: $50.00 allocated to ChurnDS

Spent: $48.70 after feature engineering

Remaining: $1.30 (insufficient for model training)


OLD behavior (crash):

raise BudgetExhaustedError("$1.30 remaining")
Pipeline halts. All prior work lost.


NEW behavior (graceful degradation):

  1. Checkpoint current state
  2. Flag output as "budget_limited"
  3. Return partial results with quality warning
  4. Notify DIO for potential budget reallocation
  5. DIO can: extend budget, accept partial, escalate

Skipping Non-Critical Phases #

Not all phases are equally critical. The DIO classifies phases:

PhaseCriticalityDegradation Behavior
Requirements (BA)HIGHCannot skip
Feature EngineeringHIGHCannot skip
Model TrainingHIGHCannot skip
Causal AnalysisMEDIUMSkip with warning
Quality TestingHIGHCannot skip
DeploymentHIGHCannot skip
Monitoring SetupMEDIUMSkip with warning

If the Causal Agent fails and no fallback is available, the system can proceed without causal analysis -- but the output will be explicitly flagged:

JSON
{
  "status": "completed_with_degradation",
  "degraded_phases": ["causal_analysis"],
  "warning": "Causal analysis unavailable. Model deployed without root cause identification. Correlation-only analysis.",
  "quality_impact": "Root cause will show as 'unknown'"
}

This is exactly what the DataSims ablation_no_causal experiment measures: the system completes successfully, but root cause identification drops from support_quality_degradation to unknown.

🎯 Graceful degradation is not failure. It is the difference between a system that delivers 80% of the value and a system that delivers 0% because it crashed. In the DataSims experiments, every ablation still completed all 7 phases -- the system degraded gracefully rather than halting.


Tier 4: Escalation #

When automated recovery is impossible, the system escalates to a human with full context:

Escalation Notification

To: data-team@simshop.com

Subject: [ESCALATION] Churn pipeline requires human intervention


Summary: Quality gate failed after 3 retries

What failed:

  • Phase: model_training
  • Agent: ChurnDS
  • Error: AUC below threshold (0.61 < 0.80)
  • Retries: 3/3 exhausted
  • Fallback: Attempted (Bedrock), same result

State checkpoint: /checkpoints/churn-001-phase3.json

Resume command: neam resume /checkpoints/churn-001

Recommended actions:

  1. Review training data for quality issues
  2. Check if feature distributions have shifted
  3. Consider adjusting AUC threshold
  4. Resume from checkpoint after fixing

The escalation includes everything a human needs to diagnose and resume:


State Machine Checkpointing #

Every phase transition is checkpointed. If the pipeline fails at phase 5, it can resume from phase 5 without re-executing phases 1 through 4:

DIAGRAM State Machine Checkpointing
flowchart LR
  P1["Phase 1\nBA\nDONE"] -- "CP" --> P2["Phase 2\nFeat.\nDONE"]
  P2 -- "CP" --> P3["Phase 3\nModel\nDONE"]
  P3 -- "CP" --> P4["Phase 4\nCausal\nDONE"]
  P4 -- "CP\n(checkpoint)" --> P5["Phase 5\nTest\nFAIL"]

  style P1 fill:#51cf66,color:#fff
  style P2 fill:#51cf66,color:#fff
  style P3 fill:#51cf66,color:#fff
  style P4 fill:#51cf66,color:#fff
  style P5 fill:#ff6b6b,color:#fff
Resume from Checkpoint

$ neam resume /checkpoints/churn-001-phase4.json

Skips phases 1-4, resumes at phase 5 with full state.

Checkpoint contents:

JSON
{
  "pipeline_id": "churn-001",
  "phase": 4,
  "completed_phases": [
    {"phase": "requirements", "output_ref": "brd-001.json"},
    {"phase": "features", "output_ref": "features-001.parquet"},
    {"phase": "model", "output_ref": "model-001.pkl"},
    {"phase": "causal", "output_ref": "dag-001.json"}
  ],
  "agent_states": {
    "ChurnBA": {"status": "idle", "budget_remaining": 42.30},
    "ChurnDS": {"status": "idle", "budget_remaining": 38.15},
    "ChurnCausal": {"status": "idle", "budget_remaining": 45.00}
  },
  "timestamp": "2026-03-15T06:14:17Z"
}

Self-Healing Patterns #

Pattern 1: Schema Break Recovery #

When an upstream table schema changes:

Schema Break Self-Healing

1. ETL agent detects schema mismatch

  • Expected: [customer_id, name, email, signup_date]
  • Actual: [customer_id, name, email, signup_date, email_verified]

2. Classify the change:

  • Column added (safe: extend schema)
  • Column removed (dangerous: check dependencies)
  • Column renamed (dangerous: update references)
  • Type changed (dangerous: validate compatibility)

3. For safe changes (column added):

  • Auto-extend downstream schemas
  • Log change in audit trail
  • Notify governance agent for catalog update
  • Continue pipeline

4. For dangerous changes:

  • Halt affected pipelines
  • Run impact analysis (which features affected?)
  • Escalate to human with impact report

Pattern 2: Pipeline Self-Heal #

When a pipeline step fails but the overall pipeline can recover:

DIAGRAM Pipeline Self-Heal Flow
flowchart TB
  S1["Step 1: Extract\nSUCCESS"] --> S2["Step 2: Transform\nFAIL (null values in join key)"]
  S2 --> D["Detect: 3% null rate in customer_id"]
  D --> C["Classify: data quality issue, not code bug"]
  C --> H["Heal: Apply imputation strategy from Agent.MD\n(drop rows with null join keys)"]
  H --> L["Log: Dropped 3,247 rows (3%) with null customer_id"]
  L --> R["Resume"]
  R --> S3["Step 3: Load\nSUCCESS (with reduced row count)"]
  S3 --> S4["Step 4: Validate\nSUCCESS (row count within tolerance)"]

  style S1 fill:#51cf66,color:#fff
  style S2 fill:#ff6b6b,color:#fff
  style D fill:#fab005,color:#fff
  style C fill:#fab005,color:#fff
  style H fill:#fab005,color:#fff
  style L fill:#fab005,color:#fff
  style R fill:#fab005,color:#fff
  style S3 fill:#51cf66,color:#fff
  style S4 fill:#51cf66,color:#fff

Pattern 3: LLM Provider Failover #

DIAGRAM LLM Provider Failover Sequence
sequenceDiagram
  participant Agent as ChurnDS
  participant OAI as OpenAI gpt-4o
  participant BR as AWS Bedrock
Claude 3.5 Sonnet Agent->>OAI: Call 1: feature selection OAI-->>Agent: SUCCESS Agent->>OAI: Call 2: hyperparameter search OAI-->>Agent: SUCCESS Agent->>OAI: Call 3 OAI-->>Agent: FAIL (503 capacity) Agent->>OAI: Call 4 (retry) OAI-->>Agent: FAIL (503) Agent->>OAI: Call 5 (retry) OAI-->>Agent: FAIL (503) Note over Agent: Failover to Bedrock Agent->>BR: Call 6: model evaluation BR-->>Agent: SUCCESS Agent->>BR: Call 7: model export BR-->>Agent: SUCCESS
Audit Log

"Calls 1-2 via OpenAI gpt-4o. Calls 6-7 via Bedrock Claude 3.5 Sonnet (failover). Reason: OpenAI 503."

Pattern 4: Budget Reallocation #

When one agent exhausts its budget but the overall DIO budget has surplus:

AgentAllocatedSpentRemaining
ChurnBA$50.00$12.30$37.70
ChurnDS$50.00$48.70$1.30 (Exhausted)
ChurnCausal$50.00$31.20$18.80
ChurnTester$50.00$15.00$35.00
ChurnMLOps$50.00$22.00$28.00
DIO Budget Reallocation

DIO Budget: $500.00

DIO action: Reallocate $20.00 from ChurnBA surplus to ChurnDS

New ChurnDS budget: $71.30 ($50.00 + $20.00 realloc + $1.30 remaining)

ChurnBA new remaining: $17.70

💡 Budget reallocation requires DIO-level authority. Individual agents cannot transfer budget between themselves. This is a security invariant -- see Chapter 24.


Error Classification #

The system classifies every error to determine the appropriate recovery tier:

Error TypeTierRecovery Action
LLM rate limit (429)1Retry with backoff
Network timeout1Retry with backoff
LLM provider outage2Failover to backup provider
Agent capability gap2Substitute with backup agent
Budget exhaustion3Graceful degradation + realloc
Non-critical phase fail3Skip with warning flag
Quality gate failure4Escalate to human
Policy violation4Halt and escalate immediately
Data corruption detected4Halt, checkpoint, escalate

DataSims Evidence: Error Handling in Practice #

The DataSims ablation experiments demonstrate error handling outcomes across different failure scenarios:

AblationSimulated FailureRecoveryOutcome
ablation_no_testQuality gates removedGraceful degradationGate: "skipped" (not "crashed")
ablation_no_mlopsNo deployment agentFallback to manual deployStrategy: "manual" (not "failed")
ablation_no_causalNo causal analysisSkip non-critical phaseRoot cause: "unknown" (not "error")
ablation_no_gatesGates disabledDegraded governanceGate: "bypassed" (not "crashed")

In every ablation, the system completed all 7 phases. No ablation caused a crash. This is the self-healing design in action: the system degrades to a lower quality level rather than failing entirely.

From the DataSims evaluation, the full system cost was $23.50 per run. Error handling overhead (retries, fallbacks) added less than 5% to the total cost in normal operation.


Key Takeaways #

For Further Exploration #