Chapter 9: The ETL Agent — SQL-First Warehousing #

"The goal is to turn data into information, and information into insight." — Carly Fiorina

Reading time: 30 min | Personas: Priya (Data Engineer), Kim (Data Analyst), Marcus (Data Scientist) | Part III: Data Infrastructure Agents

What You'll Learn #

How the ETL Agent extends the Data Agent with SQL-first warehouse intelligence
How to declare dimensional models (Kimball star schema, Inmon, Data Vault 2.0)
How mart declarations define fact tables (grain + measures) and dimension tables (SCD strategies)
How the semantic layer provides a business glossary and named metrics
How NL2SQL translates natural language questions to SQL per dialect
How multi-dialect SQL transpilation targets Snowflake, BigQuery, Redshift, and Spark SQL
How auto-modeling proposes star/snowflake schemas from OLTP sources
How self-healing detects root causes and applies automated remediation

The Problem: The Warehouse That Nobody Trusted #

Kim is a Data Analyst at SimShop. Every Monday morning, she opens the weekly revenue dashboard and sees a number. She does not trust it. Last month, the dashboard showed a 40% revenue spike on a Tuesday. It turned out that the ETL pipeline had loaded Saturday's batch twice — a deduplication failure in the staging layer. The month before, the "active customers" metric disagreed between the marketing dashboard and the finance report by 12%, because each team defined "active" differently.

Kim's frustration is shared by 73% of data professionals surveyed by Monte Carlo (2023): they do not trust their own data pipelines. The problem is not SQL skill or warehouse capacity. It is the gap between business meaning and technical implementation. "Revenue" is a word. SUM(fct_sales.revenue) is a computation. When the mapping between the two is implicit — buried in dbt models, Looker views, or tribal knowledge — trust erodes.

The ETL Agent closes this gap with three innovations: a semantic layer that makes business definitions executable, auto-modeling that proposes dimensional designs from OLTP schemas, and self-healing that detects and fixes pipeline failures without human intervention.

The ETL Agent: Brain, Not Muscle #

The ETL Agent is a specialized subtype of the Data Agent (Chapter 8). It inherits sources, schemas, pipelines, quality gates, lineage, and catalog registration. What it adds is warehouse intelligence: the ability to think about data in terms of facts, dimensions, metrics, and business entities.

Critically, the ETL Agent is the brain, not the muscle. It generates SQL, submits it to a target warehouse, monitors execution, validates results, and self-heals on failure. All data-heavy computation is pushed down to the warehouse engine.

flowchart LR
  A["PERCEIVE"] --> B["MODEL"]
  B --> C["GENERATE"]
  C --> D["EXECUTE"]
  D --> E["VALIDATE"]
  E --> F["SERVE"]

  A1["Connect to sources\nDiscover schemas\nProfile data\nDetect PII"] -.-> A
  B1["Propose star schema\nPropose facts + dims\nRecommend SCD"] -.-> B
  C1["Generate staging SQL\nGenerate integration SQL\nGenerate mart SQL"] -.-> C
  D1["Submit SQL to warehouse\nMonitor execution\nTrack cost\nSelf-heal"] -.-> D
  E1["Quality gates\nDrift detect\nReferential integrity\nRecord lineage"] -.-> E
  F1["Update semantic layer\nRegister in catalog\nNL2SQL ready"] -.-> F

What the ETL Agent Adds to the Data Agent #

Capability	Data Agent (Ch. 8)	ETL Agent (Ch. 9)
Pipeline model	Generic E/T/L	SQL-first ELT with dialect-aware generation
Schema modeling	Flat schemas	Dimensional modeling (star, snowflake, Data Vault)
Pipeline architecture	Single pipeline	Multi-layer (staging, integration, marts)
Compute integration	Compute routing	Compute pushdown with SQL transpilation
Quality	Quality gates	Quality gates + self-healing
Catalog	Registration	Semantic layer with business glossary, metrics
Design	Manual	AI-planned (NL2SQL, auto-modeling)

Dimensional Modeling in Neam #

The ETL Agent supports three modeling methodologies, each suited to different organizational needs.

Kimball Star Schema #

The most common approach. Facts at the center, dimensions around the edges. Optimized for query performance and analyst comprehension.

flowchart TD
  F["fct_orders"]
  D1["dim_customer"] --- F
  D2["dim_product"] --- F
  D3["dim_date"] --- F
  D4["dim_channel"] --- F
  D5["dim_store"] --- F
  D6["dim_promotion"] --- F

Inmon (Enterprise Data Warehouse) #

Normalized 3NF in the integration layer, with star schemas built downstream in data marts. Prioritizes data consistency over query speed.

Data Vault 2.0 #

Hub-Link-Satellite architecture. Hubs hold business keys, Links hold relationships, Satellites hold descriptive attributes with full history. Best for environments with frequent schema change.

Mart Declarations #

Mart declarations are the ETL Agent's way of expressing dimensional models. A mart groups related fact tables and dimension tables, declares grain, measures, and SCD strategies.

NEAM

// ── Fact Table Declaration ──
mart SalesMart {
    facts: ["fct_orders", "fct_daily_revenue"],

    dimensions: [
        "dim_customer", "dim_product", "dim_date",
        "dim_channel", "dim_store", "dim_promotion"
    ],

    // Slowly Changing Dimension strategies
    scd: {
        dim_customer: "type2",    // Full history tracking
        dim_product: "type1",     // Overwrite (latest only)
        dim_store: "type2"        // Full history tracking
    },

    // Conformed dimensions shared across marts
    conformed: ["dim_date", "dim_customer"],

    // Grain defines the level of detail
    grain: "one row per order line item",

    // Measures define computable metrics on the fact table
    measures: {
        total_revenue: "SUM(amount)",
        order_count: "COUNT(DISTINCT order_id)",
        avg_order_value: "SUM(amount) / COUNT(DISTINCT order_id)",
        items_sold: "SUM(quantity)"
    }
}

SCD Strategies #

Strategy	Behavior	Use When
`type1`	Overwrite with latest value	You only need current state (e.g., product name)
`type2`	Add new row with validity dates	You need full history (e.g., customer address changes)
`type3`	Add column for previous value	You need limited history (e.g., previous and current)

Anti-Pattern: Using SCD Type 1 for customer dimensions. When a customer changes their address and you overwrite it, you lose the ability to analyze historical orders by the address that was current at the time of the order. Use Type 2 for any dimension where history matters for analytics.

The Semantic Layer #

The semantic layer bridges the gap between business language and SQL. It defines metrics, entities, synonyms, and time intelligence in one place, so that every dashboard, report, and NL2SQL query uses the same definitions.

NEAM

semantic SalesGlossary {
    // ── Named Metrics ──
    metrics: {
        revenue: {
            sql: "SUM(fct_orders.amount)",
            description: "Total revenue from completed orders",
            type: "measure",
            owner: "finance_team"
        },
        aov: {
            sql: "SUM(fct_orders.amount) / COUNT(DISTINCT fct_orders.order_id)",
            description: "Average order value",
            type: "ratio",
            owner: "analytics_team"
        },
        active_customers: {
            sql: "COUNT(DISTINCT CASE WHEN dim_customer.last_activity > DATEADD(day, -30, CURRENT_DATE) THEN dim_customer.customer_id END)",
            description: "Customers with activity in the last 30 days",
            type: "measure",
            owner: "marketing_team"
        },
        churn_rate: {
            sql: "1.0 - (active_customers / total_customers)",
            description: "Percentage of customers who became inactive",
            type: "ratio",
            owner: "retention_team"
        }
    },

    // ── Business Entities ──
    entities: {
        customer: { table: "dim_customer", key: "customer_key" },
        product: { table: "dim_product", key: "product_key" },
        order: { table: "fct_orders", key: "order_key" }
    },

    // ── Synonyms (business terms → technical names) ──
    synonyms: {
        "sales": "revenue",
        "income": "revenue",
        "buyers": "dim_customer",
        "items": "dim_product",
        "AOV": "aov"
    },

    // ── Time Intelligence ──
    time_intelligence: {
        fiscal_year_start: "April",
        week_start: "Monday",
        default_timezone: "UTC"
    }
}

Insight: The semantic layer is what makes NL2SQL reliable. When Kim asks "what was revenue last quarter?", the ETL Agent does not guess what "revenue" means by scanning SQL. It looks up the revenue metric in the semantic layer and generates SUM(fct_orders.amount) with the correct filters. One definition, used everywhere.

NL2SQL: Natural Language to SQL #

The ETL Agent includes NL2SQL intelligence that translates natural language questions into SQL, respecting the semantic layer and the target dialect.

NEAM

// Ask a question in natural language
let answer = ask(SalesWarehouse, "What was revenue by channel last quarter?")
print(answer)

Under the hood, the ETL Agent:

Resolves "revenue" to SUM(fct_orders.amount) via the semantic layer
Resolves "channel" to dim_channel.channel_name via entities
Resolves "last quarter" using the time intelligence fiscal calendar
Generates dialect-specific SQL for the target warehouse
Submits the SQL for execution
Returns structured results

Generated SQL (Snowflake dialect) #

SQL

SELECT
    dc.channel_name AS channel,
    SUM(fo.amount) AS revenue
FROM analytics.fct_orders fo
JOIN analytics.dim_channel dc ON fo.channel_key = dc.channel_key
JOIN analytics.dim_date dd ON fo.date_key = dd.date_key
WHERE dd.fiscal_quarter = DATEADD(quarter, -1, CURRENT_DATE)
GROUP BY dc.channel_name
ORDER BY revenue DESC;

Multi-Dialect SQL Transpilation #

The same Neam program can target different warehouses. The ETL Agent transpiles SQL to the target dialect automatically.

flowchart LR
  A["Neam Logical SQL\n(dialect-agnostic)"] --> B["Snowflake SQL\n(Snowpark syntax)"]
  A --> C["BigQuery SQL\n(Standard SQL + BQ functions)"]
  A --> D["Redshift SQL\n(PostgreSQL dialect + Redshift)"]
  A --> E["Spark SQL\n(HiveQL-compatible)"]
  A --> F["PostgreSQL\n(standard pg)"]
  A --> G["Teradata SQL\n(Vantage syntax)"]

Transpilation Examples #

Operation	Snowflake	BigQuery	Redshift
Date diff	`DATEDIFF(day, a, b)`	`DATE_DIFF(b, a, DAY)`	`DATEDIFF(day, a, b)`
String concat	`a \	\	b`	`CONCAT(a, b)`	`a \	\	b`
Current timestamp	`CURRENT_TIMESTAMP()`	`CURRENT_TIMESTAMP()`	`GETDATE()`
Create-or-replace	`CREATE OR REPLACE TABLE`	`CREATE OR REPLACE TABLE`	`DROP TABLE IF EXISTS; CREATE TABLE`
Merge/Upsert	`MERGE INTO ... USING`	`MERGE INTO ... USING`	`DELETE + INSERT` (no native MERGE)

The ETL Agent handles these dialect differences automatically. You write logical SQL once; it generates the correct syntax for each platform.

Auto-Modeling: AI-Driven Dimensional Design #

When building a new warehouse or onboarding a new source system, the ETL Agent can propose a dimensional model from OLTP schemas.

NEAM

etl agent AutoDesigner {
    provider: "openai", model: "gpt-4o",
    sources: [SimShopOLTP],
    warehouse: SnowflakeWH,

    auto_model: {
        strategy: "kimball",
        source_schema: "simshop_oltp",
        propose: true,
        require_approval: true
    }
}

Auto-Modeling Process #

flowchart LR
  A["OLTP Schema\n20 tables\nForeign keys\nData profiling"] --> B["Profile & Analyze\nCardinality analysis\nJoin pattern detection\nTemporal analysis\nPII detection"]
  B --> C["Propose Model\nStar schema with\n5 facts, 6 dims\nSCD recommendations\nGrain suggestions"]
  C --> D["Human Review\nApprove, modify,\nor reject"]

The agent examines foreign key relationships, cardinality ratios, temporal columns, and naming patterns to propose:

Which tables are fact candidates (high cardinality, foreign keys, measures)
Which tables are dimension candidates (low cardinality, descriptive attributes)
Which SCD strategy fits each dimension
What the grain of each fact table should be

When require_approval: true, the proposal is presented to the data architect for review before any SQL is generated. The agent is the brain; the human has the final say.

Try It: Point the ETL Agent at the SimShop OLTP schema (20 tables) in DataSims with auto_model enabled. It should propose a star schema similar to the pre-built simshop_dw warehouse (15 tables: 5 fact tables, 6 dimensions, 4 aggregates).

Self-Healing Pipelines #

Pipelines fail. Sources go down. Schemas drift. Data arrives late. The ETL Agent's self-healing capability detects failures, diagnoses root causes, and applies automated remediation.

NEAM

etl agent ResilientWarehouse {
    provider: "openai", model: "gpt-4o",
    sources: [SimShopOLTP],
    warehouse: SnowflakeWH,

    self_heal: {
        enabled: true,
        max_retries: 3,
        strategies: ["retry", "schema_migrate", "fallback", "alert"],
        require_approval_for: ["schema_migrate"]
    },

    on_failure: "auto_fix"
}

Self-Healing Decision Tree #

flowchart TD
  PF["Pipeline Failure"] --> TE{"Transient Error?\n(timeout, network)"}
  PF --> SD{"Schema Drift?\n(column added/renamed)"}
  PF --> DQ{"Data Quality Failure?"}
  PF --> SU{"Source Unavailable?"}

  TE -->|Yes| R["Retry\n(up to max_retries)"]
  R --> RS{"Success?"}
  RS -->|YES| Resume1["Resume"]
  RS -->|NO| SD

  SD -->|Yes| SM["Schema Migration"]
  SM --> NA{"Needs approval?"}
  NA -->|YES| HR["Human review"]
  NA -->|NO| ALT["Apply ALTER TABLE,\nresume"]
  HR --> ALT

  DQ -->|Yes| QR["Quarantine bad rows,\nalert, continue"]

  SU -->|Yes| FB["Fallback to cached data,\nalert"]

Root Cause Detection #

The self-healing engine uses the LLM to diagnose failures:

Collect context: Error message, pipeline state, recent schema changes, data profiling results
Classify failure: Transient vs. structural vs. data quality
Propose remediation: Retry, schema migration, quarantine, fallback, or escalation
Execute or escalate: Automated for safe operations, human approval for destructive ones

Anti-Pattern: Setting require_approval_for: [] (empty) in production. This allows the agent to apply schema migrations without human review. Schema migrations can drop columns, change types, and break downstream consumers. Always require approval for structural changes in production.

Full Working Example: SimShop Sales Warehouse #

NEAM

// ════════════════════════════════════════════════════════
// SimShop Sales Warehouse — Full ETL Agent
// ════════════════════════════════════════════════════════

// ── Budget ──
budget WarehouseBudget { cost: 25.00, tokens: 200000 }

// ── Source System ──
source SimShopOLTP {
    type: "postgres",
    connection: env("SIMSHOP_PG_URL"),
    mode: "cdc",
    track_tables: ["orders", "order_items", "customers",
                   "products", "product_categories"]
}

// ── Target Warehouse ──
compute SimShopDW {
    engine: "postgres",
    connection: env("SIMSHOP_PG_URL"),
    database: "simshop"
}

// ── Semantic Layer ──
semantic SimShopGlossary {
    metrics: {
        revenue: {
            sql: "SUM(fct_orders.total_amount)",
            description: "Total revenue from completed orders",
            type: "measure",
            owner: "finance_team"
        },
        aov: {
            sql: "SUM(fct_orders.total_amount) / COUNT(DISTINCT fct_orders.order_id)",
            description: "Average order value",
            type: "ratio",
            owner: "analytics_team"
        },
        active_customers: {
            sql: "COUNT(DISTINCT CASE WHEN dc.last_activity > CURRENT_DATE - INTERVAL '30 days' THEN dc.customer_id END)",
            description: "Customers with activity in the last 30 days",
            type: "measure",
            owner: "marketing_team"
        }
    },
    entities: {
        customer: { table: "dim_customers", key: "customer_key" },
        product: { table: "dim_products", key: "product_key" },
        order: { table: "fct_orders", key: "order_key" }
    },
    synonyms: {
        "sales": "revenue",
        "income": "revenue",
        "buyers": "dim_customers"
    },
    time_intelligence: {
        fiscal_year_start: "January",
        week_start: "Monday",
        default_timezone: "UTC"
    }
}

// ── Quality Gate ──
quality WarehouseQuality {
    level: "strict",
    freshness: "1h",
    completeness: 0.98,
    uniqueness: ["order_id"],
    drift_detection: true,
    referential_integrity: true,
    on_fail: "gate_and_alert"
}

// ── The ETL Agent ──
etl agent SimShopWarehouse {
    provider: "openai",
    model: "gpt-4o",
    system: "You are a data warehouse architect for SimShop e-commerce.",
    temperature: 0.2,

    sources: [SimShopOLTP],
    warehouse: SimShopDW,

    model_type: "star",

    layers: {
        staging: {
            prefix: "stg_",
            operations: ["rename", "cast", "deduplicate", "filter_deleted"],
            materialization: "view"
        },
        integration: {
            prefix: "int_",
            operations: ["join", "conform", "business_logic"],
            materialization: "incremental"
        },
        marts: [
            mart SalesMart {
                facts: ["fct_orders", "fct_daily_revenue"],
                dimensions: ["dim_customers", "dim_products",
                            "dim_date", "dim_channels"],
                scd: {
                    dim_customers: "type2",
                    dim_products: "type1"
                },
                conformed: ["dim_date", "dim_customers"],
                grain: "one row per order line item",
                measures: {
                    total_revenue: "SUM(total_amount)",
                    order_count: "COUNT(DISTINCT order_id)",
                    avg_order_value: "SUM(total_amount) / COUNT(DISTINCT order_id)"
                }
            }
        ]
    },

    semantic: SimShopGlossary,
    quality: WarehouseQuality,

    self_heal: {
        enabled: true,
        max_retries: 3,
        strategies: ["retry", "schema_migrate", "fallback", "alert"],
        require_approval_for: ["schema_migrate"]
    },

    auto_model: {
        strategy: "kimball",
        source_schema: "simshop_oltp",
        propose: true,
        require_approval: true
    },

    lineage: true,
    budget: WarehouseBudget
}

// ── Execute Pipeline ──
let result = run_pipeline(SimShopWarehouse)
print(result)

// ── NL2SQL Query ──
let revenue = ask(SimShopWarehouse, "What was revenue by channel last quarter?")
print(revenue)

ETL Pipeline Schedule from DataSims #

The DataSims SimShop environment includes 15 pre-defined ETL pipelines that model a realistic warehouse schedule:

Time	Pipeline	Source → Target
2:00 AM	raw_to_staging_customers	oltp → staging
2:00 AM	raw_to_staging_orders	oltp → staging
3:00 AM	raw_to_staging_events	oltp → staging
4:00 AM	staging_to_dim_customers (SCD2)	staging → dw
4:00 AM	staging_to_dim_products	staging → dw
5:00 AM	staging_to_fact_orders	staging → dw
5:00 AM	staging_to_fact_activity	staging → dw
6:00 AM	dw_to_churn_features	dw → features
6:00 AM	dw_to_rec_features	dw → features
6:00 AM	dw_to_ltv_features	dw → features
7:00 AM	churn_model_scoring	features → predictions
8:00 AM	daily_revenue_agg	dw → reports
1st/mo	monthly_segment_agg	dw → reports
9:00 AM	data_quality_checks	dw → quality
10:00 AM	drift_detection	predictions → monitoring

This schedule shows the layered dependency chain that the ETL Agent must respect: staging before dimensions, dimensions before facts, facts before features, features before scoring. The ETL Agent's layer architecture (staging -> integration -> marts) maps directly to this real-world scheduling pattern.

Industry Perspective #

The ETL Agent's design reflects three converging trends in modern data engineering:

The Semantic Layer Renaissance (2022-present): Tools like dbt Semantic Layer, Cube.dev, and AtScale have popularized the idea of a shared metric layer. Neam's semantic declaration brings this into the agent system, so NL2SQL queries and dashboard definitions use the same metric definitions.

SQL-First ELT (dbt, 2016-present): The shift from ETL (transform outside the warehouse) to ELT (transform inside the warehouse using SQL) has been the dominant trend in data engineering. The ETL Agent embodies this: it generates SQL and pushes execution to the warehouse.

AI-Assisted Data Modeling (emerging, 2024-present): Research from MIT, Stanford, and industry labs demonstrates that LLMs can propose dimensional models from OLTP schemas with 70-80% accuracy. The ETL Agent's auto_model capability operationalizes this research with a human-in-the-loop approval step.

The Evidence #

The DataSims environment tests ETL Agent capabilities across five dimensions:

Dimensional modeling accuracy: The ETL Agent's auto-model proposal is compared against the pre-built simshop_dw star schema (15 tables). Measured by table coverage, grain correctness, and SCD strategy appropriateness.

Self-healing reliability: DataSims injects a schema change at month 18 (column rename in events) and a source outage at month 15 (48-hour events gap). The ETL Agent must detect, diagnose, and recover from both.

NL2SQL accuracy: The semantic layer is tested against 50 natural language questions with known correct SQL, spanning simple aggregation, multi-join, time intelligence, and complex filtering.

Pipeline correctness: Row counts, checksums, and referential integrity are validated after each pipeline run against the data_quality.check_results table.

Cost efficiency: LLM token usage and warehouse compute time are tracked via budget enforcement. The ETL Agent's cost is compared against manual development time.

To reproduce these experiments, see DataSims. For the language specification, see Neam: The AI-Native Programming Language.

Key Takeaways #

The ETL Agent is a specialized Data Agent that adds warehouse intelligence: dimensional modeling, semantic layers, NL2SQL, SQL transpilation, auto-modeling, and self-healing.
Mart declarations express facts (grain + measures), dimensions (SCD strategies), and conformed dimensions in a single block.
The semantic layer provides a single source of truth for business metrics, entities, and synonyms — eliminating the "12% disagreement" problem.
NL2SQL resolves natural language against the semantic layer and generates dialect-specific SQL, making data accessible to non-SQL users.
Multi-dialect transpilation means the same Neam program targets Snowflake, BigQuery, Redshift, Spark SQL, and PostgreSQL without code changes.
Auto-modeling proposes dimensional designs from OLTP schemas with human approval, accelerating warehouse bootstrapping.
Self-healing detects transient failures, schema drift, and data quality issues, applying automated remediation within guardrails.