Chapter 23 — Error Handling and Self-Healing #

"The measure of intelligence is the ability to change." -- Albert Einstein

📖 20 min read | 👤 Priya (Data Engineer), Sarah (MLOps), David (VP Data) | 🏷️ Part VI: Orchestration

What you'll learn:

The four-tier error handling hierarchy: Retry, Fallback, Graceful Degradation, Escalation
How state machine checkpointing enables recovery without re-execution
Self-healing patterns for schema breaks, budget exhaustion, and provider failures
Why crash-on-error is the worst possible design for agent systems

The Problem: The 3 AM Pipeline #

Priya's phone buzzes at 3:14 AM. The nightly ETL pipeline has failed. Again.

She opens her laptop and checks the logs. The churn feature pipeline crashed because the upstream events table had a new column (event_metadata_v2) that broke the fixed-schema SELECT statement. The pipeline did what most pipelines do when they encounter an unexpected column: it crashed. Hard.

The crash cascaded. The feature table was not updated. The model scoring job ran on stale features. The prediction API is now serving yesterday's scores. The monitoring dashboard shows a flatline, which the on-call engineer interpreted as "no drift" -- when the real story is "no data."

One schema change. One crash. An entire prediction system serving stale results for 12 hours before anyone noticed.

This is what happens when error handling is an afterthought. In the Neam agent stack, error handling is a first-class architectural concern.

The Four-Tier Error Handling Hierarchy #

The Neam agent stack handles errors through a four-tier hierarchy. Each tier is tried in order; if the current tier cannot resolve the error, it escalates to the next tier.

flowchart TB
  T1["Tier 1: RETRY"]
  T1D["Same agent, same task, same parameters\nExponential backoff with jitter\nMax retries configurable per agent (default: 3)\nHandles: transient LLM failures, rate limits"]
  T2["Tier 2: FALLBACK"]
  T2D["Alternative agent or alternative provider\nLLM provider chain: OpenAI → Bedrock → Ollama\nAgent substitution: primary → backup specialist\nHandles: provider outages, agent specialization"]
  T3["Tier 3: GRACEFUL DEGRADATION"]
  T3D["Reduced functionality, not total failure\nSkip non-critical phases, flag degraded output\nPartial results with explicit quality warnings\nHandles: budget exhaustion, non-critical failures"]
  T4["Tier 4: ESCALATION"]
  T4D["Human notification with full context\nState checkpoint for manual resume\nRecommended actions based on error classification\nHandles: unrecoverable errors, policy violations"]

  T1 --> T1D
  T1 -- "Cannot resolve" --> T2
  T2 --> T2D
  T2 -- "Cannot resolve" --> T3
  T3 --> T3D
  T3 -- "Cannot resolve" --> T4
  T4 --> T4D

Tier 1: Retry with Backoff #

The simplest and most common recovery. When an agent call fails, the system retries with exponential backoff:

flowchart LR
  A1["Attempt 1\nt=0s"] -- "FAIL\n(rate limit)" --> A2["Attempt 2\nt=2s"]
  A2 -- "FAIL\n(rate limit)" --> A3["Attempt 3\nt=6s"]
  A3 -- "FAIL\n(rate limit)" --> A4["Attempt 4\nt=14s"]
  A4 -- "SUCCESS" --> Done["Complete"]

  style A1 fill:#ff6b6b,color:#fff
  style A2 fill:#ff6b6b,color:#fff
  style A3 fill:#ff6b6b,color:#fff
  style A4 fill:#51cf66,color:#fff
  style Done fill:#51cf66,color:#fff

Backoff Configuration

Backoff formula: delay = base * 2^(attempt-1) + random_jitter

Default base: 1 second

Max retries: 3 (configurable per agent)

Retries are appropriate for transient failures -- errors that will resolve themselves with time:

Transient Error	Typical Recovery Time
LLM rate limit	1-5 seconds
Network timeout	2-10 seconds
Database connection pool exhaustion	1-3 seconds
Temporary API unavailability	5-30 seconds

⚠️ Retries are NOT appropriate for deterministic failures. If the error is caused by bad input data or a logic error, retrying the same operation will produce the same failure. The system classifies errors before retrying: transient errors retry, deterministic errors skip to Tier 2.

Tier 2: Fallback #

When retries are exhausted or the error is classified as non-transient, the system falls back to an alternative:

LLM Provider Failover #

flowchart LR
  P["OpenAI\ngpt-4o\n(cloud)"] -- "FAIL" --> F1["AWS Bedrock\nClaude\n(cloud)"]
  F1 -- "FAIL" --> F2["Ollama\nlocal\n(on-prem)"]

  style P fill:#4dabf7,color:#fff
  style F1 fill:#fab005,color:#fff
  style F2 fill:#ff6b6b,color:#fff

Failover Criteria

3 consecutive failures on primary
Response latency > 30s (timeout)
Provider reports capacity issues (429/503)

In Neam, provider failover is built into the agent declaration:

NEAM

agent ChurnDS {
    provider: "openai", model: "gpt-4o",
    // Fallback chain (tried in order if primary fails)
    fallback_providers: [
        { provider: "bedrock", model: "claude-3-5-sonnet" },
        { provider: "ollama", model: "llama3.1:70b" }
    ],
    budget: AgentBudget
}

Agent Substitution #

When a specialist agent cannot complete a task, the DIO can reassign to a different agent with overlapping capabilities:

Agent Substitution Example

Task: Feature engineering for churn model

Primary R: ETLAgent (specialized, preferred)

Fallback R: DataScientist (can do feature engineering, less specialized)

The DIO tracks agent capability overlap and uses it for substitution.

💡 Fallback does not mean equivalent quality. A local Ollama model will typically produce lower-quality outputs than GPT-4o. The system records the fallback in the audit trail so that downstream consumers know the output may have reduced quality.

Tier 3: Graceful Degradation #

When no fallback is available, the system degrades gracefully rather than crashing:

Budget Exhaustion #

This is the most common graceful degradation scenario. When an agent's budget is exhausted mid-task:

Budget Exhaustion -- Graceful Handling

Budget: $50.00 allocated to ChurnDS

Spent: $48.70 after feature engineering

Remaining: $1.30 (insufficient for model training)

OLD behavior (crash):

raise BudgetExhaustedError("$1.30 remaining")
Pipeline halts. All prior work lost.

NEW behavior (graceful degradation):

Checkpoint current state
Flag output as "budget_limited"
Return partial results with quality warning
Notify DIO for potential budget reallocation
DIO can: extend budget, accept partial, escalate

Skipping Non-Critical Phases #

Not all phases are equally critical. The DIO classifies phases:

Phase	Criticality	Degradation Behavior
Requirements (BA)	HIGH	Cannot skip
Feature Engineering	HIGH	Cannot skip
Model Training	HIGH	Cannot skip
Causal Analysis	MEDIUM	Skip with warning
Quality Testing	HIGH	Cannot skip
Deployment	HIGH	Cannot skip
Monitoring Setup	MEDIUM	Skip with warning

If the Causal Agent fails and no fallback is available, the system can proceed without causal analysis -- but the output will be explicitly flagged:

JSON

{
  "status": "completed_with_degradation",
  "degraded_phases": ["causal_analysis"],
  "warning": "Causal analysis unavailable. Model deployed without root cause identification. Correlation-only analysis.",
  "quality_impact": "Root cause will show as 'unknown'"
}

This is exactly what the DataSims ablation_no_causal experiment measures: the system completes successfully, but root cause identification drops from support_quality_degradation to unknown.

🎯 Graceful degradation is not failure. It is the difference between a system that delivers 80% of the value and a system that delivers 0% because it crashed. In the DataSims experiments, every ablation still completed all 7 phases -- the system degraded gracefully rather than halting.

Tier 4: Escalation #

When automated recovery is impossible, the system escalates to a human with full context:

Escalation Notification

To: data-team@simshop.com

Subject: [ESCALATION] Churn pipeline requires human intervention

Summary: Quality gate failed after 3 retries

What failed:

Phase: model_training
Agent: ChurnDS
Error: AUC below threshold (0.61 < 0.80)
Retries: 3/3 exhausted
Fallback: Attempted (Bedrock), same result

State checkpoint: /checkpoints/churn-001-phase3.json

Resume command: neam resume /checkpoints/churn-001

Recommended actions:

Review training data for quality issues
Check if feature distributions have shifted
Consider adjusting AUC threshold
Resume from checkpoint after fixing

The escalation includes everything a human needs to diagnose and resume:

What failed and why
What was already tried (retries, fallbacks)
A state checkpoint for resuming without re-executing prior phases
Recommended diagnostic actions

State Machine Checkpointing #

Every phase transition is checkpointed. If the pipeline fails at phase 5, it can resume from phase 5 without re-executing phases 1 through 4:

flowchart LR
  P1["Phase 1\nBA\nDONE"] -- "CP" --> P2["Phase 2\nFeat.\nDONE"]
  P2 -- "CP" --> P3["Phase 3\nModel\nDONE"]
  P3 -- "CP" --> P4["Phase 4\nCausal\nDONE"]
  P4 -- "CP\n(checkpoint)" --> P5["Phase 5\nTest\nFAIL"]

  style P1 fill:#51cf66,color:#fff
  style P2 fill:#51cf66,color:#fff
  style P3 fill:#51cf66,color:#fff
  style P4 fill:#51cf66,color:#fff
  style P5 fill:#ff6b6b,color:#fff

Resume from Checkpoint

$ neam resume /checkpoints/churn-001-phase4.json

Skips phases 1-4, resumes at phase 5 with full state.

Checkpoint contents:

JSON

{
  "pipeline_id": "churn-001",
  "phase": 4,
  "completed_phases": [
    {"phase": "requirements", "output_ref": "brd-001.json"},
    {"phase": "features", "output_ref": "features-001.parquet"},
    {"phase": "model", "output_ref": "model-001.pkl"},
    {"phase": "causal", "output_ref": "dag-001.json"}
  ],
  "agent_states": {
    "ChurnBA": {"status": "idle", "budget_remaining": 42.30},
    "ChurnDS": {"status": "idle", "budget_remaining": 38.15},
    "ChurnCausal": {"status": "idle", "budget_remaining": 45.00}
  },
  "timestamp": "2026-03-15T06:14:17Z"
}

Self-Healing Patterns #

Pattern 1: Schema Break Recovery #

When an upstream table schema changes:

Schema Break Self-Healing

1. ETL agent detects schema mismatch

Expected: [customer_id, name, email, signup_date]
Actual: [customer_id, name, email, signup_date, email_verified]

2. Classify the change:

Column added (safe: extend schema)
Column removed (dangerous: check dependencies)
Column renamed (dangerous: update references)
Type changed (dangerous: validate compatibility)

3. For safe changes (column added):

Auto-extend downstream schemas
Log change in audit trail
Notify governance agent for catalog update
Continue pipeline

4. For dangerous changes:

Halt affected pipelines
Run impact analysis (which features affected?)
Escalate to human with impact report

Pattern 2: Pipeline Self-Heal #

When a pipeline step fails but the overall pipeline can recover:

flowchart TB
  S1["Step 1: Extract\nSUCCESS"] --> S2["Step 2: Transform\nFAIL (null values in join key)"]
  S2 --> D["Detect: 3% null rate in customer_id"]
  D --> C["Classify: data quality issue, not code bug"]
  C --> H["Heal: Apply imputation strategy from Agent.MD\n(drop rows with null join keys)"]
  H --> L["Log: Dropped 3,247 rows (3%) with null customer_id"]
  L --> R["Resume"]
  R --> S3["Step 3: Load\nSUCCESS (with reduced row count)"]
  S3 --> S4["Step 4: Validate\nSUCCESS (row count within tolerance)"]

  style S1 fill:#51cf66,color:#fff
  style S2 fill:#ff6b6b,color:#fff
  style D fill:#fab005,color:#fff
  style C fill:#fab005,color:#fff
  style H fill:#fab005,color:#fff
  style L fill:#fab005,color:#fff
  style R fill:#fab005,color:#fff
  style S3 fill:#51cf66,color:#fff
  style S4 fill:#51cf66,color:#fff

Pattern 3: LLM Provider Failover #

sequenceDiagram
  participant Agent as ChurnDS
  participant OAI as OpenAI gpt-4o
  participant BR as AWS Bedrock
Claude 3.5 Sonnet

  Agent->>OAI: Call 1: feature selection
  OAI-->>Agent: SUCCESS
  Agent->>OAI: Call 2: hyperparameter search
  OAI-->>Agent: SUCCESS
  Agent->>OAI: Call 3
  OAI-->>Agent: FAIL (503 capacity)
  Agent->>OAI: Call 4 (retry)
  OAI-->>Agent: FAIL (503)
  Agent->>OAI: Call 5 (retry)
  OAI-->>Agent: FAIL (503)
  Note over Agent: Failover to Bedrock
  Agent->>BR: Call 6: model evaluation
  BR-->>Agent: SUCCESS
  Agent->>BR: Call 7: model export
  BR-->>Agent: SUCCESS

Audit Log

"Calls 1-2 via OpenAI gpt-4o. Calls 6-7 via Bedrock Claude 3.5 Sonnet (failover). Reason: OpenAI 503."

Pattern 4: Budget Reallocation #

When one agent exhausts its budget but the overall DIO budget has surplus:

Agent	Allocated	Spent	Remaining
ChurnBA	$50.00	$12.30	$37.70
ChurnDS	$50.00	$48.70	$1.30 (Exhausted)
ChurnCausal	$50.00	$31.20	$18.80
ChurnTester	$50.00	$15.00	$35.00
ChurnMLOps	$50.00	$22.00	$28.00

DIO Budget Reallocation

DIO Budget: $500.00

DIO action: Reallocate $20.00 from ChurnBA surplus to ChurnDS

New ChurnDS budget: $71.30 ($50.00 + $20.00 realloc + $1.30 remaining)

ChurnBA new remaining: $17.70

💡 Budget reallocation requires DIO-level authority. Individual agents cannot transfer budget between themselves. This is a security invariant -- see Chapter 24.

Error Classification #

The system classifies every error to determine the appropriate recovery tier:

Error Type	Tier	Recovery Action
LLM rate limit (429)	1	Retry with backoff
Network timeout	1	Retry with backoff
LLM provider outage	2	Failover to backup provider
Agent capability gap	2	Substitute with backup agent
Budget exhaustion	3	Graceful degradation + realloc
Non-critical phase fail	3	Skip with warning flag
Quality gate failure	4	Escalate to human
Policy violation	4	Halt and escalate immediately
Data corruption detected	4	Halt, checkpoint, escalate

DataSims Evidence: Error Handling in Practice #

The DataSims ablation experiments demonstrate error handling outcomes across different failure scenarios:

Ablation	Simulated Failure	Recovery	Outcome
`ablation_no_test`	Quality gates removed	Graceful degradation	Gate: "skipped" (not "crashed")
`ablation_no_mlops`	No deployment agent	Fallback to manual deploy	Strategy: "manual" (not "failed")
`ablation_no_causal`	No causal analysis	Skip non-critical phase	Root cause: "unknown" (not "error")
`ablation_no_gates`	Gates disabled	Degraded governance	Gate: "bypassed" (not "crashed")

In every ablation, the system completed all 7 phases. No ablation caused a crash. This is the self-healing design in action: the system degrades to a lower quality level rather than failing entirely.

From the DataSims evaluation, the full system cost was $23.50 per run. Error handling overhead (retries, fallbacks) added less than 5% to the total cost in normal operation.

Key Takeaways #

The four-tier error hierarchy (Retry, Fallback, Graceful Degradation, Escalation) prevents cascading failures
State machine checkpointing enables resume-from-failure without re-executing completed phases
Schema breaks are classified by severity: safe changes auto-heal, dangerous changes escalate
Budget exhaustion triggers graceful degradation and DIO-level reallocation, not crashes
LLM provider failover chains ensure no single provider outage halts the pipeline
Every error is classified to determine the appropriate recovery tier
DataSims ablation experiments prove the self-healing design: all ablations completed without crashes

For Further Exploration #

DataSims Repository -- Ablation results showing graceful degradation in action
Chapter 22 -- How each coordination mode handles errors differently
Chapter 24 -- Security implications of error handling (budget reallocation authority)