Chapter 9: The ETL Agent — SQL-First Warehousing #

"The goal is to turn data into information, and information into insight." — Carly Fiorina


Reading time: 30 min | Personas: Priya (Data Engineer), Kim (Data Analyst), Marcus (Data Scientist) | Part III: Data Infrastructure Agents


What You'll Learn #


The Problem: The Warehouse That Nobody Trusted #

Kim is a Data Analyst at SimShop. Every Monday morning, she opens the weekly revenue dashboard and sees a number. She does not trust it. Last month, the dashboard showed a 40% revenue spike on a Tuesday. It turned out that the ETL pipeline had loaded Saturday's batch twice — a deduplication failure in the staging layer. The month before, the "active customers" metric disagreed between the marketing dashboard and the finance report by 12%, because each team defined "active" differently.

Kim's frustration is shared by 73% of data professionals surveyed by Monte Carlo (2023): they do not trust their own data pipelines. The problem is not SQL skill or warehouse capacity. It is the gap between business meaning and technical implementation. "Revenue" is a word. SUM(fct_sales.revenue) is a computation. When the mapping between the two is implicit — buried in dbt models, Looker views, or tribal knowledge — trust erodes.

The ETL Agent closes this gap with three innovations: a semantic layer that makes business definitions executable, auto-modeling that proposes dimensional designs from OLTP schemas, and self-healing that detects and fixes pipeline failures without human intervention.


The ETL Agent: Brain, Not Muscle #

The ETL Agent is a specialized subtype of the Data Agent (Chapter 8). It inherits sources, schemas, pipelines, quality gates, lineage, and catalog registration. What it adds is warehouse intelligence: the ability to think about data in terms of facts, dimensions, metrics, and business entities.

Critically, the ETL Agent is the brain, not the muscle. It generates SQL, submits it to a target warehouse, monitors execution, validates results, and self-heals on failure. All data-heavy computation is pushed down to the warehouse engine.

DIAGRAM The ETL Agent Lifecycle
flowchart LR
  A["PERCEIVE"] --> B["MODEL"]
  B --> C["GENERATE"]
  C --> D["EXECUTE"]
  D --> E["VALIDATE"]
  E --> F["SERVE"]

  A1["Connect to sources\nDiscover schemas\nProfile data\nDetect PII"] -.-> A
  B1["Propose star schema\nPropose facts + dims\nRecommend SCD"] -.-> B
  C1["Generate staging SQL\nGenerate integration SQL\nGenerate mart SQL"] -.-> C
  D1["Submit SQL to warehouse\nMonitor execution\nTrack cost\nSelf-heal"] -.-> D
  E1["Quality gates\nDrift detect\nReferential integrity\nRecord lineage"] -.-> E
  F1["Update semantic layer\nRegister in catalog\nNL2SQL ready"] -.-> F

What the ETL Agent Adds to the Data Agent #

CapabilityData Agent (Ch. 8)ETL Agent (Ch. 9)
Pipeline modelGeneric E/T/LSQL-first ELT with dialect-aware generation
Schema modelingFlat schemasDimensional modeling (star, snowflake, Data Vault)
Pipeline architectureSingle pipelineMulti-layer (staging, integration, marts)
Compute integrationCompute routingCompute pushdown with SQL transpilation
QualityQuality gatesQuality gates + self-healing
CatalogRegistrationSemantic layer with business glossary, metrics
DesignManualAI-planned (NL2SQL, auto-modeling)

Dimensional Modeling in Neam #

The ETL Agent supports three modeling methodologies, each suited to different organizational needs.

Kimball Star Schema #

The most common approach. Facts at the center, dimensions around the edges. Optimized for query performance and analyst comprehension.

DIAGRAM Kimball Star Schema
flowchart TD
  F["fct_orders"]
  D1["dim_customer"] --- F
  D2["dim_product"] --- F
  D3["dim_date"] --- F
  D4["dim_channel"] --- F
  D5["dim_store"] --- F
  D6["dim_promotion"] --- F

Inmon (Enterprise Data Warehouse) #

Normalized 3NF in the integration layer, with star schemas built downstream in data marts. Prioritizes data consistency over query speed.

Data Vault 2.0 #

Hub-Link-Satellite architecture. Hubs hold business keys, Links hold relationships, Satellites hold descriptive attributes with full history. Best for environments with frequent schema change.


Mart Declarations #

Mart declarations are the ETL Agent's way of expressing dimensional models. A mart groups related fact tables and dimension tables, declares grain, measures, and SCD strategies.

NEAM
// ── Fact Table Declaration ──
mart SalesMart {
    facts: ["fct_orders", "fct_daily_revenue"],

    dimensions: [
        "dim_customer", "dim_product", "dim_date",
        "dim_channel", "dim_store", "dim_promotion"
    ],

    // Slowly Changing Dimension strategies
    scd: {
        dim_customer: "type2",    // Full history tracking
        dim_product: "type1",     // Overwrite (latest only)
        dim_store: "type2"        // Full history tracking
    },

    // Conformed dimensions shared across marts
    conformed: ["dim_date", "dim_customer"],

    // Grain defines the level of detail
    grain: "one row per order line item",

    // Measures define computable metrics on the fact table
    measures: {
        total_revenue: "SUM(amount)",
        order_count: "COUNT(DISTINCT order_id)",
        avg_order_value: "SUM(amount) / COUNT(DISTINCT order_id)",
        items_sold: "SUM(quantity)"
    }
}

SCD Strategies #

StrategyBehaviorUse When
type1Overwrite with latest valueYou only need current state (e.g., product name)
type2Add new row with validity datesYou need full history (e.g., customer address changes)
type3Add column for previous valueYou need limited history (e.g., previous and current)

Anti-Pattern: Using SCD Type 1 for customer dimensions. When a customer changes their address and you overwrite it, you lose the ability to analyze historical orders by the address that was current at the time of the order. Use Type 2 for any dimension where history matters for analytics.


The Semantic Layer #

The semantic layer bridges the gap between business language and SQL. It defines metrics, entities, synonyms, and time intelligence in one place, so that every dashboard, report, and NL2SQL query uses the same definitions.

NEAM
semantic SalesGlossary {
    // ── Named Metrics ──
    metrics: {
        revenue: {
            sql: "SUM(fct_orders.amount)",
            description: "Total revenue from completed orders",
            type: "measure",
            owner: "finance_team"
        },
        aov: {
            sql: "SUM(fct_orders.amount) / COUNT(DISTINCT fct_orders.order_id)",
            description: "Average order value",
            type: "ratio",
            owner: "analytics_team"
        },
        active_customers: {
            sql: "COUNT(DISTINCT CASE WHEN dim_customer.last_activity > DATEADD(day, -30, CURRENT_DATE) THEN dim_customer.customer_id END)",
            description: "Customers with activity in the last 30 days",
            type: "measure",
            owner: "marketing_team"
        },
        churn_rate: {
            sql: "1.0 - (active_customers / total_customers)",
            description: "Percentage of customers who became inactive",
            type: "ratio",
            owner: "retention_team"
        }
    },

    // ── Business Entities ──
    entities: {
        customer: { table: "dim_customer", key: "customer_key" },
        product: { table: "dim_product", key: "product_key" },
        order: { table: "fct_orders", key: "order_key" }
    },

    // ── Synonyms (business terms → technical names) ──
    synonyms: {
        "sales": "revenue",
        "income": "revenue",
        "buyers": "dim_customer",
        "items": "dim_product",
        "AOV": "aov"
    },

    // ── Time Intelligence ──
    time_intelligence: {
        fiscal_year_start: "April",
        week_start: "Monday",
        default_timezone: "UTC"
    }
}

Insight: The semantic layer is what makes NL2SQL reliable. When Kim asks "what was revenue last quarter?", the ETL Agent does not guess what "revenue" means by scanning SQL. It looks up the revenue metric in the semantic layer and generates SUM(fct_orders.amount) with the correct filters. One definition, used everywhere.


NL2SQL: Natural Language to SQL #

The ETL Agent includes NL2SQL intelligence that translates natural language questions into SQL, respecting the semantic layer and the target dialect.

NEAM
// Ask a question in natural language
let answer = ask(SalesWarehouse, "What was revenue by channel last quarter?")
print(answer)

Under the hood, the ETL Agent:

  1. Resolves "revenue" to SUM(fct_orders.amount) via the semantic layer
  2. Resolves "channel" to dim_channel.channel_name via entities
  3. Resolves "last quarter" using the time intelligence fiscal calendar
  4. Generates dialect-specific SQL for the target warehouse
  5. Submits the SQL for execution
  6. Returns structured results

Generated SQL (Snowflake dialect) #

SQL
SELECT
    dc.channel_name AS channel,
    SUM(fo.amount) AS revenue
FROM analytics.fct_orders fo
JOIN analytics.dim_channel dc ON fo.channel_key = dc.channel_key
JOIN analytics.dim_date dd ON fo.date_key = dd.date_key
WHERE dd.fiscal_quarter = DATEADD(quarter, -1, CURRENT_DATE)
GROUP BY dc.channel_name
ORDER BY revenue DESC;

Multi-Dialect SQL Transpilation #

The same Neam program can target different warehouses. The ETL Agent transpiles SQL to the target dialect automatically.

DIAGRAM SQL Transpilation
flowchart LR
  A["Neam Logical SQL\n(dialect-agnostic)"] --> B["Snowflake SQL\n(Snowpark syntax)"]
  A --> C["BigQuery SQL\n(Standard SQL + BQ functions)"]
  A --> D["Redshift SQL\n(PostgreSQL dialect + Redshift)"]
  A --> E["Spark SQL\n(HiveQL-compatible)"]
  A --> F["PostgreSQL\n(standard pg)"]
  A --> G["Teradata SQL\n(Vantage syntax)"]

Transpilation Examples #

OperationSnowflakeBigQueryRedshift
Date diffDATEDIFF(day, a, b)DATE_DIFF(b, a, DAY)DATEDIFF(day, a, b)
String concat`a \\b`CONCAT(a, b)`a \\b`
Current timestampCURRENT_TIMESTAMP()CURRENT_TIMESTAMP()GETDATE()
Create-or-replaceCREATE OR REPLACE TABLECREATE OR REPLACE TABLEDROP TABLE IF EXISTS; CREATE TABLE
Merge/UpsertMERGE INTO ... USINGMERGE INTO ... USINGDELETE + INSERT (no native MERGE)

The ETL Agent handles these dialect differences automatically. You write logical SQL once; it generates the correct syntax for each platform.


Auto-Modeling: AI-Driven Dimensional Design #

When building a new warehouse or onboarding a new source system, the ETL Agent can propose a dimensional model from OLTP schemas.

NEAM
etl agent AutoDesigner {
    provider: "openai", model: "gpt-4o",
    sources: [SimShopOLTP],
    warehouse: SnowflakeWH,

    auto_model: {
        strategy: "kimball",
        source_schema: "simshop_oltp",
        propose: true,
        require_approval: true
    }
}

Auto-Modeling Process #

DIAGRAM Auto-Modeling Process
flowchart LR
  A["OLTP Schema\n20 tables\nForeign keys\nData profiling"] --> B["Profile & Analyze\nCardinality analysis\nJoin pattern detection\nTemporal analysis\nPII detection"]
  B --> C["Propose Model\nStar schema with\n5 facts, 6 dims\nSCD recommendations\nGrain suggestions"]
  C --> D["Human Review\nApprove, modify,\nor reject"]

The agent examines foreign key relationships, cardinality ratios, temporal columns, and naming patterns to propose:

When require_approval: true, the proposal is presented to the data architect for review before any SQL is generated. The agent is the brain; the human has the final say.

Try It: Point the ETL Agent at the SimShop OLTP schema (20 tables) in DataSims with auto_model enabled. It should propose a star schema similar to the pre-built simshop_dw warehouse (15 tables: 5 fact tables, 6 dimensions, 4 aggregates).


Self-Healing Pipelines #

Pipelines fail. Sources go down. Schemas drift. Data arrives late. The ETL Agent's self-healing capability detects failures, diagnoses root causes, and applies automated remediation.

NEAM
etl agent ResilientWarehouse {
    provider: "openai", model: "gpt-4o",
    sources: [SimShopOLTP],
    warehouse: SnowflakeWH,

    self_heal: {
        enabled: true,
        max_retries: 3,
        strategies: ["retry", "schema_migrate", "fallback", "alert"],
        require_approval_for: ["schema_migrate"]
    },

    on_failure: "auto_fix"
}

Self-Healing Decision Tree #

DIAGRAM Self-Healing Decision Tree
flowchart TD
  PF["Pipeline Failure"] --> TE{"Transient Error?\n(timeout, network)"}
  PF --> SD{"Schema Drift?\n(column added/renamed)"}
  PF --> DQ{"Data Quality Failure?"}
  PF --> SU{"Source Unavailable?"}

  TE -->|Yes| R["Retry\n(up to max_retries)"]
  R --> RS{"Success?"}
  RS -->|YES| Resume1["Resume"]
  RS -->|NO| SD

  SD -->|Yes| SM["Schema Migration"]
  SM --> NA{"Needs approval?"}
  NA -->|YES| HR["Human review"]
  NA -->|NO| ALT["Apply ALTER TABLE,\nresume"]
  HR --> ALT

  DQ -->|Yes| QR["Quarantine bad rows,\nalert, continue"]

  SU -->|Yes| FB["Fallback to cached data,\nalert"]

Root Cause Detection #

The self-healing engine uses the LLM to diagnose failures:

  1. Collect context: Error message, pipeline state, recent schema changes, data profiling results
  2. Classify failure: Transient vs. structural vs. data quality
  3. Propose remediation: Retry, schema migration, quarantine, fallback, or escalation
  4. Execute or escalate: Automated for safe operations, human approval for destructive ones

Anti-Pattern: Setting require_approval_for: [] (empty) in production. This allows the agent to apply schema migrations without human review. Schema migrations can drop columns, change types, and break downstream consumers. Always require approval for structural changes in production.


Full Working Example: SimShop Sales Warehouse #

NEAM
// ════════════════════════════════════════════════════════
// SimShop Sales Warehouse — Full ETL Agent
// ════════════════════════════════════════════════════════

// ── Budget ──
budget WarehouseBudget { cost: 25.00, tokens: 200000 }

// ── Source System ──
source SimShopOLTP {
    type: "postgres",
    connection: env("SIMSHOP_PG_URL"),
    mode: "cdc",
    track_tables: ["orders", "order_items", "customers",
                   "products", "product_categories"]
}

// ── Target Warehouse ──
compute SimShopDW {
    engine: "postgres",
    connection: env("SIMSHOP_PG_URL"),
    database: "simshop"
}

// ── Semantic Layer ──
semantic SimShopGlossary {
    metrics: {
        revenue: {
            sql: "SUM(fct_orders.total_amount)",
            description: "Total revenue from completed orders",
            type: "measure",
            owner: "finance_team"
        },
        aov: {
            sql: "SUM(fct_orders.total_amount) / COUNT(DISTINCT fct_orders.order_id)",
            description: "Average order value",
            type: "ratio",
            owner: "analytics_team"
        },
        active_customers: {
            sql: "COUNT(DISTINCT CASE WHEN dc.last_activity > CURRENT_DATE - INTERVAL '30 days' THEN dc.customer_id END)",
            description: "Customers with activity in the last 30 days",
            type: "measure",
            owner: "marketing_team"
        }
    },
    entities: {
        customer: { table: "dim_customers", key: "customer_key" },
        product: { table: "dim_products", key: "product_key" },
        order: { table: "fct_orders", key: "order_key" }
    },
    synonyms: {
        "sales": "revenue",
        "income": "revenue",
        "buyers": "dim_customers"
    },
    time_intelligence: {
        fiscal_year_start: "January",
        week_start: "Monday",
        default_timezone: "UTC"
    }
}

// ── Quality Gate ──
quality WarehouseQuality {
    level: "strict",
    freshness: "1h",
    completeness: 0.98,
    uniqueness: ["order_id"],
    drift_detection: true,
    referential_integrity: true,
    on_fail: "gate_and_alert"
}

// ── The ETL Agent ──
etl agent SimShopWarehouse {
    provider: "openai",
    model: "gpt-4o",
    system: "You are a data warehouse architect for SimShop e-commerce.",
    temperature: 0.2,

    sources: [SimShopOLTP],
    warehouse: SimShopDW,

    model_type: "star",

    layers: {
        staging: {
            prefix: "stg_",
            operations: ["rename", "cast", "deduplicate", "filter_deleted"],
            materialization: "view"
        },
        integration: {
            prefix: "int_",
            operations: ["join", "conform", "business_logic"],
            materialization: "incremental"
        },
        marts: [
            mart SalesMart {
                facts: ["fct_orders", "fct_daily_revenue"],
                dimensions: ["dim_customers", "dim_products",
                            "dim_date", "dim_channels"],
                scd: {
                    dim_customers: "type2",
                    dim_products: "type1"
                },
                conformed: ["dim_date", "dim_customers"],
                grain: "one row per order line item",
                measures: {
                    total_revenue: "SUM(total_amount)",
                    order_count: "COUNT(DISTINCT order_id)",
                    avg_order_value: "SUM(total_amount) / COUNT(DISTINCT order_id)"
                }
            }
        ]
    },

    semantic: SimShopGlossary,
    quality: WarehouseQuality,

    self_heal: {
        enabled: true,
        max_retries: 3,
        strategies: ["retry", "schema_migrate", "fallback", "alert"],
        require_approval_for: ["schema_migrate"]
    },

    auto_model: {
        strategy: "kimball",
        source_schema: "simshop_oltp",
        propose: true,
        require_approval: true
    },

    lineage: true,
    budget: WarehouseBudget
}

// ── Execute Pipeline ──
let result = run_pipeline(SimShopWarehouse)
print(result)

// ── NL2SQL Query ──
let revenue = ask(SimShopWarehouse, "What was revenue by channel last quarter?")
print(revenue)

ETL Pipeline Schedule from DataSims #

The DataSims SimShop environment includes 15 pre-defined ETL pipelines that model a realistic warehouse schedule:

TimePipelineSource → Target
2:00 AMraw_to_staging_customersoltp → staging
2:00 AMraw_to_staging_ordersoltp → staging
3:00 AMraw_to_staging_eventsoltp → staging
4:00 AMstaging_to_dim_customers (SCD2)staging → dw
4:00 AMstaging_to_dim_productsstaging → dw
5:00 AMstaging_to_fact_ordersstaging → dw
5:00 AMstaging_to_fact_activitystaging → dw
6:00 AMdw_to_churn_featuresdw → features
6:00 AMdw_to_rec_featuresdw → features
6:00 AMdw_to_ltv_featuresdw → features
7:00 AMchurn_model_scoringfeatures → predictions
8:00 AMdaily_revenue_aggdw → reports
1st/momonthly_segment_aggdw → reports
9:00 AMdata_quality_checksdw → quality
10:00 AMdrift_detectionpredictions → monitoring

This schedule shows the layered dependency chain that the ETL Agent must respect: staging before dimensions, dimensions before facts, facts before features, features before scoring. The ETL Agent's layer architecture (staging -> integration -> marts) maps directly to this real-world scheduling pattern.


Industry Perspective #

The ETL Agent's design reflects three converging trends in modern data engineering:

The Semantic Layer Renaissance (2022-present): Tools like dbt Semantic Layer, Cube.dev, and AtScale have popularized the idea of a shared metric layer. Neam's semantic declaration brings this into the agent system, so NL2SQL queries and dashboard definitions use the same metric definitions.

SQL-First ELT (dbt, 2016-present): The shift from ETL (transform outside the warehouse) to ELT (transform inside the warehouse using SQL) has been the dominant trend in data engineering. The ETL Agent embodies this: it generates SQL and pushes execution to the warehouse.

AI-Assisted Data Modeling (emerging, 2024-present): Research from MIT, Stanford, and industry labs demonstrates that LLMs can propose dimensional models from OLTP schemas with 70-80% accuracy. The ETL Agent's auto_model capability operationalizes this research with a human-in-the-loop approval step.


The Evidence #

The DataSims environment tests ETL Agent capabilities across five dimensions:

  1. Dimensional modeling accuracy: The ETL Agent's auto-model proposal is compared against the pre-built simshop_dw star schema (15 tables). Measured by table coverage, grain correctness, and SCD strategy appropriateness.
  1. Self-healing reliability: DataSims injects a schema change at month 18 (column rename in events) and a source outage at month 15 (48-hour events gap). The ETL Agent must detect, diagnose, and recover from both.
  1. NL2SQL accuracy: The semantic layer is tested against 50 natural language questions with known correct SQL, spanning simple aggregation, multi-join, time intelligence, and complex filtering.
  1. Pipeline correctness: Row counts, checksums, and referential integrity are validated after each pipeline run against the data_quality.check_results table.
  1. Cost efficiency: LLM token usage and warehouse compute time are tracked via budget enforcement. The ETL Agent's cost is compared against manual development time.

To reproduce these experiments, see DataSims. For the language specification, see Neam: The AI-Native Programming Language.


Key Takeaways #

  1. The ETL Agent is a specialized Data Agent that adds warehouse intelligence: dimensional modeling, semantic layers, NL2SQL, SQL transpilation, auto-modeling, and self-healing.
  2. Mart declarations express facts (grain + measures), dimensions (SCD strategies), and conformed dimensions in a single block.
  3. The semantic layer provides a single source of truth for business metrics, entities, and synonyms — eliminating the "12% disagreement" problem.
  4. NL2SQL resolves natural language against the semantic layer and generates dialect-specific SQL, making data accessible to non-SQL users.
  5. Multi-dialect transpilation means the same Neam program targets Snowflake, BigQuery, Redshift, Spark SQL, and PostgreSQL without code changes.
  6. Auto-modeling proposes dimensional designs from OLTP schemas with human approval, accelerating warehouse bootstrapping.
  7. Self-healing detects transient failures, schema drift, and data quality issues, applying automated remediation within guardrails.