Chapter 14: The Modeling Agent — Architecture Intelligence #

"All models are wrong, but some are useful." — George E.P. Box, statistician

20 min read | Priya, Marcus | Part IV: Platform Intelligence Agents

What you'll learn:

How the Modeling Agent reverse-engineers existing schemas from INFORMATION_SCHEMA and catalog metadata
How to build ER models at three levels — physical, logical, and conceptual
How normalization analysis evaluates 1NF through 5NF/BCNF and flags violations
How dimensional design generates star, snowflake, starflake, and Data Vault schemas
How schema amendments are proposed with impact analysis before execution
How external tool connectors synchronize with Erwin, ER/Studio, PowerDesigner, and dbt

The Problem #

Marcus joins a data science team at an insurance company. He is given access to a Snowflake account with 847 tables across 14 schemas. There is no data dictionary. There is no ER diagram. The last data architect left nine months ago, and her whiteboard diagrams were erased for a sprint planning session.

Marcus needs to build a claims prediction model. He needs to understand: Which tables contain claims data? What are the relationships between CLAIMS, POLICY_HOLDERS, POLICIES, COVERAGE_TYPES, and ADJUSTERS? Is CLAIMS.policy_id a foreign key to POLICIES.id or POLICIES.policy_number? Are the tables normalized or denormalized? Is there a dimensional model he should query, or is everything in 3NF?

He spends two weeks writing exploratory SQL, reverse-engineering relationships by joining on candidate keys and checking cardinality. He draws diagrams in Lucidchart. He discovers that CLAIMS_SUMMARY is a denormalized view that duplicates data from four source tables — but it is two months stale because the ETL job that refreshes it was disabled after a schema change.

The Modeling Agent does in minutes what Marcus did in weeks. It connects to INFORMATION_SCHEMA, reverse-engineers the physical schema, infers relationships from foreign keys and naming conventions, evaluates normalization health, and produces ER models at physical, logical, and conceptual levels. When Marcus needs a star schema for claims analytics, the agent designs it — with fact and dimension tables, SCD types, and grain definitions.

Schema Reverse-Engineering #

The schema_source declaration tells the Modeling Agent where to discover existing schemas:

NEAM

schema_source InsuranceDW {
    type: "snowflake",
    connection: env("SNOWFLAKE_ACCOUNT"),
    credentials: vault("snowflake/prod/creds"),
    databases: ["INSURANCE_PROD"],
    schemas: ["CLAIMS", "POLICY", "CUSTOMER", "REFERENCE"],
    include_views: true,
    include_constraints: true,
    include_statistics: true,
    refresh_interval: "6h"
}

The reverse-engineering process works in four phases:

flowchart TD
    P1["Phase 1: Metadata Extraction"]
    P1SRC["INFORMATION_SCHEMA.TABLES\nINFORMATION_SCHEMA.COLUMNS\nINFORMATION_SCHEMA.TABLE_CONSTRAINTS\nINFORMATION_SCHEMA.KEY_COLUMN_USAGE\nINFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS\nPlatform-specific statistics views"]
    P2["Phase 2: Relationship Inference"]
    P2SRC["Explicit: Foreign key constraints\nNaming: column_name matches (e.g., claims.policy_id → policies.id)\nCardinality: JOIN result analysis\nLLM: Semantic relationship guess (confidence-scored)"]
    P3["Phase 3: Pattern Recognition"]
    P3SRC["Identify: fact tables, dimension tables,\nbridge tables, audit columns,\nSCD patterns, junction tables"]
    P4["Phase 4: Model Generation"]
    P4SRC["Output: Physical ER model (as-is)\nLogical ER model (normalized)\nConceptual ER model (business)"]

    P1 --> P1SRC --> P2 --> P2SRC --> P3 --> P3SRC --> P4 --> P4SRC

Insight

Most databases have incomplete foreign key constraints. In the DataSims SimShop environment, only 64% of actual relationships are expressed as explicit foreign keys. The remaining 36% are inferred through naming conventions and cardinality analysis. The LLM adds semantic inference (e.g., recognizing that adj_id in the claims table refers to ADJUSTERS.adjuster_id despite the abbreviated column name).

ER Modeling: Physical, Logical, and Conceptual #

The er_model declaration creates entity-relationship models at three levels of abstraction:

NEAM

er_model InsurancePhysical {
    level: "physical",
    schema_source: InsuranceDW,
    include_data_types: true,
    include_indexes: true,
    include_partitions: true
}

er_model InsuranceLogical {
    level: "logical",
    schema_source: InsuranceDW,
    normalize_to: "3NF",
    resolve_denormalization: true,
    include_cardinality: true
}

er_model InsuranceConceptual {
    level: "conceptual",
    schema_source: InsuranceDW,
    business_domains: ["claims", "underwriting", "customer"],
    glossary: BusinessGlossary,
    include_business_rules: true
}

The three levels serve different audiences:

flowchart LR
    subgraph Conceptual ["Conceptual (Business)\nAudience: Executives, BAs"]
        C_Cust["Customer"]
        C_Pol["Policy"]
        C_Claim["Claim"]
        C_Cust -- "has" --> C_Pol
        C_Pol -- "covers" --> C_Claim
    end

    subgraph Logical ["Logical (Architecture)"]
        L_Cust["Customer\ncustomer_id\nname\nemail"]
        L_Pol["Policy\npolicy_id\ncustomer_id (FK)\nstart_date"]
        L_Cust -- "1:N" --> L_Pol
    end

    subgraph Physical ["Physical (Platform)\nAudience: DBAs, Data Engineers"]
        P_Claims["CLAIMS.CLAIMS\nCLAIM_ID NUMBER(12)\nPOLICY_ID NUMBER(12)\nAMOUNT DECIMAL\nSTATUS VARCHAR(20)\nIDX: ix_claim_policy\nPART: by CLAIM_DATE"]
    end

Normalization Analysis: 1NF through 5NF #

The normalization_analysis declaration evaluates a schema against normal forms and reports violations:

NEAM

normalization_analysis InsuranceHealth {
    schema_source: InsuranceDW,
    target: "BCNF",
    analyze: {
        first_normal_form: true,
        second_normal_form: true,
        third_normal_form: true,
        boyce_codd_normal_form: true,
        fourth_normal_form: true,
        fifth_normal_form: false
    },
    report: {
        violations: true,
        recommendations: true,
        impact_assessment: true,
        estimated_effort: true
    },
    exclusions: {
        tables: ["*_STAGING", "*_TEMP"],
        schemas: ["SANDBOX"]
    }
}

The analysis produces a report like:

Normalization Report: INSURANCE_PROD

Schema	Table	1NF	2NF	3NF	BCNF	Issue
CLAIMS	CLAIMS	OK	OK	OK	OK	—
CLAIMS	CLAIMS_SUMMARY	OK	FAIL	—	—	Partial dep: adjuster name depends on adjuster_id, not full composite key
POLICY	POLICY_DETAILS	OK	OK	FAIL	—	Transitive dep: agent_office depends on agent_id, not policy_id
CUSTOMER	CUSTOMER_360	OK	FAIL	—	—	Repeating groups: phone1, phone2, phone3 columns

Summary: 847 tables analyzed

812 (95.9%) pass 3NF
28 (3.3%) have 2NF violations
7 (0.8%) have 1NF violations (repeating groups)

Recommendation: 3 tables require immediate attention (CLAIMS_SUMMARY, POLICY_DETAILS, CUSTOMER_360)

Anti-Pattern

Do not normalize everything to 5NF. Analytics workloads benefit from controlled denormalization. The Modeling Agent's normalization analysis is diagnostic, not prescriptive — it tells you where normalization violations exist so you can decide which ones matter. A denormalized fact table in a star schema is intentional. A denormalized OLTP table is a defect.

Dimensional Design: Star, Snowflake, Data Vault #

The dimensional_model declaration generates analytical schemas:

NEAM

dimensional_model ClaimsAnalytics {
    methodology: "star",
    grain: "one row per claim event",
    source_model: InsuranceLogical,

    facts: {
        fact_claims: {
            measures: ["claim_amount", "paid_amount", "reserve_amount"],
            degenerate_dimensions: ["claim_number"],
            partitioned_by: "claim_date",
            scd_type: "none"
        }
    },

    dimensions: {
        dim_customer: {
            source_entity: "Customer",
            scd_type: 2,
            tracked_columns: ["address", "risk_score", "segment"],
            business_key: "customer_number"
        },
        dim_policy: {
            source_entity: "Policy",
            scd_type: 1,
            business_key: "policy_number"
        },
        dim_date: {
            type: "date_dimension",
            range: { start: "2020-01-01", end: "2030-12-31" },
            fiscal_year_start: "April"
        },
        dim_coverage: {
            source_entity: "CoverageType",
            scd_type: 0,
            business_key: "coverage_code"
        }
    }
}

The Modeling Agent supports four dimensional methodologies:

Methodology	Description	Best For
Star Schema	Fact table surrounded by denormalized dimensions	Simple queries, fast aggregations, BI tools
Snowflake Schema	Dimensions normalized into sub-dimensions	Storage efficiency, complex hierarchies
Starflake	Hybrid — some dimensions normalized, some not	Pragmatic balance of star and snowflake
Data Vault	Hub (business keys), Link (relationships), Satellite (descriptive)	Auditability, agile loading, historical tracking

flowchart LR
    subgraph Star ["Star Schema"]
        direction TB
        dim_customer["dim_customer"]
        fact_claims["fact_claims"]
        dim_policy["dim_policy"]
        dim_date["dim_date"]
        dim_customer --- fact_claims
        fact_claims --- dim_policy
        fact_claims --- dim_date
    end

    subgraph Vault ["Data Vault"]
        direction TB
        HUB_CLAIM["HUB_CLAIM\nclaim_bk"]
        LNK_CLAIM_POLICY["LNK_CLAIM_POLICY"]
        HUB_POLICY["HUB_POLICY\npolicy_bk"]
        SAT_CLAIM["SAT_CLAIM\namount\nstatus\nload_date"]
        HUB_CLAIM --- LNK_CLAIM_POLICY
        LNK_CLAIM_POLICY --- HUB_POLICY
        LNK_CLAIM_POLICY --- SAT_CLAIM
    end

DataMart Generation #

The datamart declaration generates purpose-specific schemas from dimensional models:

NEAM

datamart ClaimsMart {
    source_model: ClaimsAnalytics,
    purpose: "Claims analytics for underwriting team",
    target_platform: "snowflake",
    target_schema: "CLAIMS_MART",

    materialization: {
        strategy: "incremental",
        refresh_schedule: "0 6 * * *",
        incremental_key: "claim_date",
        lookback_window: "3d"
    },

    pre_aggregations: [
        {
            name: "monthly_claims_summary",
            dimensions: ["dim_coverage.coverage_type", "dim_customer.segment"],
            measures: ["SUM(claim_amount)", "COUNT(*)", "AVG(paid_amount)"],
            grain: "monthly"
        }
    ],

    access_policy: UnderwritingAccess
}

Schema Amendments with Impact Analysis #

The amendment declaration proposes schema changes with full impact analysis before execution:

NEAM

amendment_config ChangeManagement {
    approval: {
        required_approvers: 2,
        approver_roles: ["data_architect", "data_engineer_lead"],
        auto_approve_minor: false
    },
    impact_analysis: {
        check_downstream_queries: true,
        check_downstream_pipelines: true,
        check_downstream_reports: true,
        check_data_contracts: true,
        breaking_change_threshold: "warn"
    },
    rollback: {
        auto_rollback_on_failure: true,
        keep_backup: "7d"
    }
}

amendment AddRiskScore {
    target: "CUSTOMER.CUSTOMERS",
    operation: "add_column",
    column: {
        name: "risk_score",
        type: "DECIMAL(5,2)",
        nullable: true,
        default: null,
        comment: "Underwriting risk score (0.00-99.99)"
    },
    config: ChangeManagement,
    justification: "Required for claims prediction model feature engineering"
}

The impact analysis report:

Amendment Impact Analysis: AddRiskScore

Target: CUSTOMER.CUSTOMERS.risk_score (new column)

Downstream Impact

Category	Count	Details
Pipelines	3 affected	customer_daily_load, customer_360, crm_sync
Queries	12 affected	SELECT * queries will return new column
Reports	0 affected	No reports reference CUSTOMERS directly
Contracts	1 affected	customer_data_product needs version bump

Breaking Changes: NONE (additive change, nullable column)

Recommendation: SAFE TO PROCEED

Update customer_data_product contract version 2.1 → 2.2
Notify downstream pipeline owners (auto-sent)

Estimated Effort: 0.5 person-hours (contract update only)

Try It

In the DataSims SimShop environment, the Modeling Agent can reverse-engineer all 164 tables in the simshop database, identify the star schema in the analytics layer, flag the 3 normalization violations in the staging tables, and generate a Data Vault model from the raw sources. Run neam-agents/programs/simshop_churn.neam to see the full pipeline.

External Tool Connectors #

The modeling_tool declaration synchronizes with external modeling tools:

NEAM

modeling_tool ErwinSync {
    type: "erwin",
    connection: env("ERWIN_API_URL"),
    credentials: vault("erwin/prod/api_key"),
    sync_mode: "bidirectional",
    sync_interval: "1h",
    model_mapping: {
        physical: "INSURANCE_PHYSICAL_v3",
        logical: "INSURANCE_LOGICAL_v3"
    }
}

modeling_tool PowerDesignerSync {
    type: "powerdesigner",
    connection: env("PD_REPO_URL"),
    credentials: vault("pd/prod/creds"),
    sync_mode: "push",
    sync_interval: "6h"
}

Tool	Type Value	Capabilities	Use Case
Erwin Data Modeler	`"erwin"`	Physical/logical/conceptual models, forward/reverse engineering	Enterprise data architecture
ER/Studio	`"erstudio"`	Team-based modeling, repository management	Collaborative modeling
SAP PowerDesigner	`"powerdesigner"`	Conceptual, logical, physical, dimensional models	SAP ecosystem modeling
dbt	`"dbt"`	Model definitions, ref() graph, documentation	Analytics engineering

Neam Code: A Complete Modeling Agent #

NEAM

// ═══════════════════════════════════════════════════════════════
// Modeling Agent — Digital Data Architect for Insurance
// ═══════════════════════════════════════════════════════════════

budget ModelBudget { cost: 20.00, tokens: 200000, time: 3600000 }

modeling agent Architect {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a data modeling agent. Reverse-engineer schemas, evaluate normalization, and design dimensional models.",
    temperature: 0.2,

    schema_sources: [InsuranceDW],
    er_models: [InsurancePhysical, InsuranceLogical, InsuranceConceptual],
    dimensional_models: [ClaimsAnalytics],
    datamarts: [ClaimsMart],
    normalization: [InsuranceHealth],
    amendments: [AddRiskScore],
    modeling_tools: [ErwinSync],

    amendment_config: ChangeManagement,

    budget: ModelBudget,
    agent_md: "./agents/data_architect.md"
}

// ─── Operational Commands ───
let physical = Architect.reverse_engineer(InsuranceDW)
let norm_report = Architect.analyze_normalization(InsuranceDW)
let star = Architect.generate_dimensional("star", InsuranceLogical)
let impact = Architect.assess_impact(AddRiskScore)

Industry Perspective #

Kimball vs. Inmon vs. Data Vault #

The Modeling Agent does not pick sides in the Kimball-Inmon-Data Vault debate. It supports all three, and the choice depends on organizational context:

Methodology	Strengths	Best For	Modeling Agent Support
Kimball (star/snowflake)	Simple queries, fast BI	Departmental analytics, known query patterns	`methodology: "star"` or `"snowflake"`
Inmon (3NF enterprise)	Single source of truth, normalized	Enterprise-wide integration	`normalize_to: "3NF"` on er_model
Data Vault (hub/link/sat)	Auditability, agile loading, historization	Regulated industries, agile data warehousing	`methodology: "data_vault"`
Starflake (hybrid)	Pragmatic blend	Organizations with mixed needs	`methodology: "starflake"`

Schema Drift: The Silent Architecture Erosion #

Industry surveys show that the average enterprise data warehouse experiences 23 schema changes per month. Without governance, this leads to "schema drift" — the gradual divergence of the actual schema from the documented schema. The Modeling Agent's continuous reverse-engineering closes this gap by treating the live INFORMATION_SCHEMA as the source of truth and flagging deviations from the modeled state.

The Evidence #

DataSims experiments (DataSims repository) demonstrate the Modeling Agent's impact:

Metric	Without Modeling Agent	With Modeling Agent	Improvement
Schema Understanding Time	2.3 weeks (manual)	12 minutes (auto)	99.4%
Normalization Violations Found	1.2 per audit (sample)	7.4 per scan (exhaustive)	6.2x
Dimensional Model Design Time	3.1 days	45 minutes	95.5%
Schema Amendment Errors	2.1 per quarter (no impact analysis)	0.1 per quarter (with impact analysis)	95.2%

Ablation A5 (Modeling Agent removed) in the churn prediction experiment showed that feature engineering took 3.4x longer because the data scientist had to manually discover table relationships, identify the correct grain for joins, and reverse-engineer SCD patterns. The Modeling Agent's domain context — shared with the DataScientist Agent — eliminated this exploratory overhead entirely.

Key Takeaways #

The Modeling Agent reverse-engineers existing schemas from INFORMATION_SCHEMA, inferring relationships even when foreign key constraints are missing (naming convention + cardinality analysis + LLM semantic inference)
Three levels of ER modeling (physical, logical, conceptual) serve different audiences: DBAs, architects, and business stakeholders
Normalization analysis (1NF-5NF/BCNF) is diagnostic, not prescriptive — it tells you where violations exist so you can decide which ones are intentional
Dimensional design supports star, snowflake, starflake, and Data Vault methodologies — the choice depends on organizational context, not dogma
Schema amendments include full impact analysis before execution: downstream queries, pipelines, reports, and data contracts
The Modeling Agent designs the blueprint — then hands off to ETLAgent for pipeline generation and GovernanceAgent for policy enforcement

For Further Exploration #

Neam Language Reference: Modeling Agent
DataSims: Simulated Enterprise Environment — 164 tables across 12 schemas for modeling exercises
The Data Warehouse Toolkit (Kimball & Ross, 3rd edition, 2013) — the definitive dimensional modeling reference
Building a Scalable Data Warehouse with Data Vault 2.0 (Linstedt & Olschimke, 2015)
Database Design for Mere Mortals (Hernandez, 4th edition, 2020) — ER modeling fundamentals