Chapter 23 — Error Handling and Self-Healing #
"The measure of intelligence is the ability to change." -- Albert Einstein
📖 20 min read | 👤 Priya (Data Engineer), Sarah (MLOps), David (VP Data) | 🏷️ Part VI: Orchestration
What you'll learn:
- The four-tier error handling hierarchy: Retry, Fallback, Graceful Degradation, Escalation
- How state machine checkpointing enables recovery without re-execution
- Self-healing patterns for schema breaks, budget exhaustion, and provider failures
- Why crash-on-error is the worst possible design for agent systems
The Problem: The 3 AM Pipeline #
Priya's phone buzzes at 3:14 AM. The nightly ETL pipeline has failed. Again.
She opens her laptop and checks the logs. The churn feature pipeline crashed because the upstream events table had a new column (event_metadata_v2) that broke the fixed-schema SELECT statement. The pipeline did what most pipelines do when they encounter an unexpected column: it crashed. Hard.
The crash cascaded. The feature table was not updated. The model scoring job ran on stale features. The prediction API is now serving yesterday's scores. The monitoring dashboard shows a flatline, which the on-call engineer interpreted as "no drift" -- when the real story is "no data."
One schema change. One crash. An entire prediction system serving stale results for 12 hours before anyone noticed.
This is what happens when error handling is an afterthought. In the Neam agent stack, error handling is a first-class architectural concern.
The Four-Tier Error Handling Hierarchy #
The Neam agent stack handles errors through a four-tier hierarchy. Each tier is tried in order; if the current tier cannot resolve the error, it escalates to the next tier.
flowchart TB T1["Tier 1: RETRY"] T1D["Same agent, same task, same parameters\nExponential backoff with jitter\nMax retries configurable per agent (default: 3)\nHandles: transient LLM failures, rate limits"] T2["Tier 2: FALLBACK"] T2D["Alternative agent or alternative provider\nLLM provider chain: OpenAI → Bedrock → Ollama\nAgent substitution: primary → backup specialist\nHandles: provider outages, agent specialization"] T3["Tier 3: GRACEFUL DEGRADATION"] T3D["Reduced functionality, not total failure\nSkip non-critical phases, flag degraded output\nPartial results with explicit quality warnings\nHandles: budget exhaustion, non-critical failures"] T4["Tier 4: ESCALATION"] T4D["Human notification with full context\nState checkpoint for manual resume\nRecommended actions based on error classification\nHandles: unrecoverable errors, policy violations"] T1 --> T1D T1 -- "Cannot resolve" --> T2 T2 --> T2D T2 -- "Cannot resolve" --> T3 T3 --> T3D T3 -- "Cannot resolve" --> T4 T4 --> T4D
Tier 1: Retry with Backoff #
The simplest and most common recovery. When an agent call fails, the system retries with exponential backoff:
flowchart LR A1["Attempt 1\nt=0s"] -- "FAIL\n(rate limit)" --> A2["Attempt 2\nt=2s"] A2 -- "FAIL\n(rate limit)" --> A3["Attempt 3\nt=6s"] A3 -- "FAIL\n(rate limit)" --> A4["Attempt 4\nt=14s"] A4 -- "SUCCESS" --> Done["Complete"] style A1 fill:#ff6b6b,color:#fff style A2 fill:#ff6b6b,color:#fff style A3 fill:#ff6b6b,color:#fff style A4 fill:#51cf66,color:#fff style Done fill:#51cf66,color:#fff
Backoff formula: delay = base * 2^(attempt-1) + random_jitter
Default base: 1 second
Max retries: 3 (configurable per agent)
Retries are appropriate for transient failures -- errors that will resolve themselves with time:
| Transient Error | Typical Recovery Time |
|---|---|
| LLM rate limit | 1-5 seconds |
| Network timeout | 2-10 seconds |
| Database connection pool exhaustion | 1-3 seconds |
| Temporary API unavailability | 5-30 seconds |
⚠️ Retries are NOT appropriate for deterministic failures. If the error is caused by bad input data or a logic error, retrying the same operation will produce the same failure. The system classifies errors before retrying: transient errors retry, deterministic errors skip to Tier 2.
Tier 2: Fallback #
When retries are exhausted or the error is classified as non-transient, the system falls back to an alternative:
LLM Provider Failover #
flowchart LR P["OpenAI\ngpt-4o\n(cloud)"] -- "FAIL" --> F1["AWS Bedrock\nClaude\n(cloud)"] F1 -- "FAIL" --> F2["Ollama\nlocal\n(on-prem)"] style P fill:#4dabf7,color:#fff style F1 fill:#fab005,color:#fff style F2 fill:#ff6b6b,color:#fff
- 3 consecutive failures on primary
- Response latency > 30s (timeout)
- Provider reports capacity issues (429/503)
In Neam, provider failover is built into the agent declaration:
agent ChurnDS {
provider: "openai", model: "gpt-4o",
// Fallback chain (tried in order if primary fails)
fallback_providers: [
{ provider: "bedrock", model: "claude-3-5-sonnet" },
{ provider: "ollama", model: "llama3.1:70b" }
],
budget: AgentBudget
}
Agent Substitution #
When a specialist agent cannot complete a task, the DIO can reassign to a different agent with overlapping capabilities:
Task: Feature engineering for churn model
Primary R: ETLAgent (specialized, preferred)
Fallback R: DataScientist (can do feature engineering, less specialized)
The DIO tracks agent capability overlap and uses it for substitution.
💡 Fallback does not mean equivalent quality. A local Ollama model will typically produce lower-quality outputs than GPT-4o. The system records the fallback in the audit trail so that downstream consumers know the output may have reduced quality.
Tier 3: Graceful Degradation #
When no fallback is available, the system degrades gracefully rather than crashing:
Budget Exhaustion #
This is the most common graceful degradation scenario. When an agent's budget is exhausted mid-task:
Budget: $50.00 allocated to ChurnDS
Spent: $48.70 after feature engineering
Remaining: $1.30 (insufficient for model training)
OLD behavior (crash):
raise BudgetExhaustedError("$1.30 remaining")
Pipeline halts. All prior work lost.
NEW behavior (graceful degradation):
- Checkpoint current state
- Flag output as "budget_limited"
- Return partial results with quality warning
- Notify DIO for potential budget reallocation
- DIO can: extend budget, accept partial, escalate
Skipping Non-Critical Phases #
Not all phases are equally critical. The DIO classifies phases:
| Phase | Criticality | Degradation Behavior |
|---|---|---|
| Requirements (BA) | HIGH | Cannot skip |
| Feature Engineering | HIGH | Cannot skip |
| Model Training | HIGH | Cannot skip |
| Causal Analysis | MEDIUM | Skip with warning |
| Quality Testing | HIGH | Cannot skip |
| Deployment | HIGH | Cannot skip |
| Monitoring Setup | MEDIUM | Skip with warning |
If the Causal Agent fails and no fallback is available, the system can proceed without causal analysis -- but the output will be explicitly flagged:
{
"status": "completed_with_degradation",
"degraded_phases": ["causal_analysis"],
"warning": "Causal analysis unavailable. Model deployed without root cause identification. Correlation-only analysis.",
"quality_impact": "Root cause will show as 'unknown'"
}
This is exactly what the DataSims ablation_no_causal experiment measures: the system completes successfully, but root cause identification drops from support_quality_degradation to unknown.
🎯 Graceful degradation is not failure. It is the difference between a system that delivers 80% of the value and a system that delivers 0% because it crashed. In the DataSims experiments, every ablation still completed all 7 phases -- the system degraded gracefully rather than halting.
Tier 4: Escalation #
When automated recovery is impossible, the system escalates to a human with full context:
To: data-team@simshop.com
Subject: [ESCALATION] Churn pipeline requires human intervention
Summary: Quality gate failed after 3 retries
What failed:
- Phase: model_training
- Agent: ChurnDS
- Error: AUC below threshold (0.61 < 0.80)
- Retries: 3/3 exhausted
- Fallback: Attempted (Bedrock), same result
State checkpoint: /checkpoints/churn-001-phase3.json
Resume command: neam resume /checkpoints/churn-001
Recommended actions:
- Review training data for quality issues
- Check if feature distributions have shifted
- Consider adjusting AUC threshold
- Resume from checkpoint after fixing
The escalation includes everything a human needs to diagnose and resume:
- What failed and why
- What was already tried (retries, fallbacks)
- A state checkpoint for resuming without re-executing prior phases
- Recommended diagnostic actions
State Machine Checkpointing #
Every phase transition is checkpointed. If the pipeline fails at phase 5, it can resume from phase 5 without re-executing phases 1 through 4:
flowchart LR P1["Phase 1\nBA\nDONE"] -- "CP" --> P2["Phase 2\nFeat.\nDONE"] P2 -- "CP" --> P3["Phase 3\nModel\nDONE"] P3 -- "CP" --> P4["Phase 4\nCausal\nDONE"] P4 -- "CP\n(checkpoint)" --> P5["Phase 5\nTest\nFAIL"] style P1 fill:#51cf66,color:#fff style P2 fill:#51cf66,color:#fff style P3 fill:#51cf66,color:#fff style P4 fill:#51cf66,color:#fff style P5 fill:#ff6b6b,color:#fff
$ neam resume /checkpoints/churn-001-phase4.json
Skips phases 1-4, resumes at phase 5 with full state.
Checkpoint contents:
{
"pipeline_id": "churn-001",
"phase": 4,
"completed_phases": [
{"phase": "requirements", "output_ref": "brd-001.json"},
{"phase": "features", "output_ref": "features-001.parquet"},
{"phase": "model", "output_ref": "model-001.pkl"},
{"phase": "causal", "output_ref": "dag-001.json"}
],
"agent_states": {
"ChurnBA": {"status": "idle", "budget_remaining": 42.30},
"ChurnDS": {"status": "idle", "budget_remaining": 38.15},
"ChurnCausal": {"status": "idle", "budget_remaining": 45.00}
},
"timestamp": "2026-03-15T06:14:17Z"
}
Self-Healing Patterns #
Pattern 1: Schema Break Recovery #
When an upstream table schema changes:
1. ETL agent detects schema mismatch
- Expected: [customer_id, name, email, signup_date]
- Actual: [customer_id, name, email, signup_date, email_verified]
2. Classify the change:
- Column added (safe: extend schema)
- Column removed (dangerous: check dependencies)
- Column renamed (dangerous: update references)
- Type changed (dangerous: validate compatibility)
3. For safe changes (column added):
- Auto-extend downstream schemas
- Log change in audit trail
- Notify governance agent for catalog update
- Continue pipeline
4. For dangerous changes:
- Halt affected pipelines
- Run impact analysis (which features affected?)
- Escalate to human with impact report
Pattern 2: Pipeline Self-Heal #
When a pipeline step fails but the overall pipeline can recover:
flowchart TB S1["Step 1: Extract\nSUCCESS"] --> S2["Step 2: Transform\nFAIL (null values in join key)"] S2 --> D["Detect: 3% null rate in customer_id"] D --> C["Classify: data quality issue, not code bug"] C --> H["Heal: Apply imputation strategy from Agent.MD\n(drop rows with null join keys)"] H --> L["Log: Dropped 3,247 rows (3%) with null customer_id"] L --> R["Resume"] R --> S3["Step 3: Load\nSUCCESS (with reduced row count)"] S3 --> S4["Step 4: Validate\nSUCCESS (row count within tolerance)"] style S1 fill:#51cf66,color:#fff style S2 fill:#ff6b6b,color:#fff style D fill:#fab005,color:#fff style C fill:#fab005,color:#fff style H fill:#fab005,color:#fff style L fill:#fab005,color:#fff style R fill:#fab005,color:#fff style S3 fill:#51cf66,color:#fff style S4 fill:#51cf66,color:#fff
Pattern 3: LLM Provider Failover #
sequenceDiagram participant Agent as ChurnDS participant OAI as OpenAI gpt-4o participant BR as AWS Bedrock
Claude 3.5 Sonnet Agent->>OAI: Call 1: feature selection OAI-->>Agent: SUCCESS Agent->>OAI: Call 2: hyperparameter search OAI-->>Agent: SUCCESS Agent->>OAI: Call 3 OAI-->>Agent: FAIL (503 capacity) Agent->>OAI: Call 4 (retry) OAI-->>Agent: FAIL (503) Agent->>OAI: Call 5 (retry) OAI-->>Agent: FAIL (503) Note over Agent: Failover to Bedrock Agent->>BR: Call 6: model evaluation BR-->>Agent: SUCCESS Agent->>BR: Call 7: model export BR-->>Agent: SUCCESS
"Calls 1-2 via OpenAI gpt-4o. Calls 6-7 via Bedrock Claude 3.5 Sonnet (failover). Reason: OpenAI 503."
Pattern 4: Budget Reallocation #
When one agent exhausts its budget but the overall DIO budget has surplus:
| Agent | Allocated | Spent | Remaining |
|---|---|---|---|
| ChurnBA | $50.00 | $12.30 | $37.70 |
| ChurnDS | $50.00 | $48.70 | $1.30 (Exhausted) |
| ChurnCausal | $50.00 | $31.20 | $18.80 |
| ChurnTester | $50.00 | $15.00 | $35.00 |
| ChurnMLOps | $50.00 | $22.00 | $28.00 |
DIO Budget: $500.00
DIO action: Reallocate $20.00 from ChurnBA surplus to ChurnDS
New ChurnDS budget: $71.30 ($50.00 + $20.00 realloc + $1.30 remaining)
ChurnBA new remaining: $17.70
💡 Budget reallocation requires DIO-level authority. Individual agents cannot transfer budget between themselves. This is a security invariant -- see Chapter 24.
Error Classification #
The system classifies every error to determine the appropriate recovery tier:
| Error Type | Tier | Recovery Action |
|---|---|---|
| LLM rate limit (429) | 1 | Retry with backoff |
| Network timeout | 1 | Retry with backoff |
| LLM provider outage | 2 | Failover to backup provider |
| Agent capability gap | 2 | Substitute with backup agent |
| Budget exhaustion | 3 | Graceful degradation + realloc |
| Non-critical phase fail | 3 | Skip with warning flag |
| Quality gate failure | 4 | Escalate to human |
| Policy violation | 4 | Halt and escalate immediately |
| Data corruption detected | 4 | Halt, checkpoint, escalate |
DataSims Evidence: Error Handling in Practice #
The DataSims ablation experiments demonstrate error handling outcomes across different failure scenarios:
| Ablation | Simulated Failure | Recovery | Outcome |
|---|---|---|---|
ablation_no_test | Quality gates removed | Graceful degradation | Gate: "skipped" (not "crashed") |
ablation_no_mlops | No deployment agent | Fallback to manual deploy | Strategy: "manual" (not "failed") |
ablation_no_causal | No causal analysis | Skip non-critical phase | Root cause: "unknown" (not "error") |
ablation_no_gates | Gates disabled | Degraded governance | Gate: "bypassed" (not "crashed") |
In every ablation, the system completed all 7 phases. No ablation caused a crash. This is the self-healing design in action: the system degrades to a lower quality level rather than failing entirely.
From the DataSims evaluation, the full system cost was $23.50 per run. Error handling overhead (retries, fallbacks) added less than 5% to the total cost in normal operation.
Key Takeaways #
- The four-tier error hierarchy (Retry, Fallback, Graceful Degradation, Escalation) prevents cascading failures
- State machine checkpointing enables resume-from-failure without re-executing completed phases
- Schema breaks are classified by severity: safe changes auto-heal, dangerous changes escalate
- Budget exhaustion triggers graceful degradation and DIO-level reallocation, not crashes
- LLM provider failover chains ensure no single provider outage halts the pipeline
- Every error is classified to determine the appropriate recovery tier
- DataSims ablation experiments prove the self-healing design: all ablations completed without crashes
For Further Exploration #
- DataSims Repository -- Ablation results showing graceful degradation in action
- Chapter 22 -- How each coordination mode handles errors differently
- Chapter 24 -- Security implications of error handling (budget reallocation authority)