Chapter 14: The Modeling Agent — Architecture Intelligence #

"All models are wrong, but some are useful." — George E.P. Box, statistician


20 min read | Priya, Marcus | Part IV: Platform Intelligence Agents

What you'll learn:


The Problem #

Marcus joins a data science team at an insurance company. He is given access to a Snowflake account with 847 tables across 14 schemas. There is no data dictionary. There is no ER diagram. The last data architect left nine months ago, and her whiteboard diagrams were erased for a sprint planning session.

Marcus needs to build a claims prediction model. He needs to understand: Which tables contain claims data? What are the relationships between CLAIMS, POLICY_HOLDERS, POLICIES, COVERAGE_TYPES, and ADJUSTERS? Is CLAIMS.policy_id a foreign key to POLICIES.id or POLICIES.policy_number? Are the tables normalized or denormalized? Is there a dimensional model he should query, or is everything in 3NF?

He spends two weeks writing exploratory SQL, reverse-engineering relationships by joining on candidate keys and checking cardinality. He draws diagrams in Lucidchart. He discovers that CLAIMS_SUMMARY is a denormalized view that duplicates data from four source tables — but it is two months stale because the ETL job that refreshes it was disabled after a schema change.

The Modeling Agent does in minutes what Marcus did in weeks. It connects to INFORMATION_SCHEMA, reverse-engineers the physical schema, infers relationships from foreign keys and naming conventions, evaluates normalization health, and produces ER models at physical, logical, and conceptual levels. When Marcus needs a star schema for claims analytics, the agent designs it — with fact and dimension tables, SCD types, and grain definitions.


Schema Reverse-Engineering #

The schema_source declaration tells the Modeling Agent where to discover existing schemas:

NEAM
schema_source InsuranceDW {
    type: "snowflake",
    connection: env("SNOWFLAKE_ACCOUNT"),
    credentials: vault("snowflake/prod/creds"),
    databases: ["INSURANCE_PROD"],
    schemas: ["CLAIMS", "POLICY", "CUSTOMER", "REFERENCE"],
    include_views: true,
    include_constraints: true,
    include_statistics: true,
    refresh_interval: "6h"
}

The reverse-engineering process works in four phases:

PIPELINE Schema Discovery Pipeline
flowchart TD
    P1["Phase 1: Metadata Extraction"]
    P1SRC["INFORMATION_SCHEMA.TABLES\nINFORMATION_SCHEMA.COLUMNS\nINFORMATION_SCHEMA.TABLE_CONSTRAINTS\nINFORMATION_SCHEMA.KEY_COLUMN_USAGE\nINFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS\nPlatform-specific statistics views"]
    P2["Phase 2: Relationship Inference"]
    P2SRC["Explicit: Foreign key constraints\nNaming: column_name matches (e.g., claims.policy_id → policies.id)\nCardinality: JOIN result analysis\nLLM: Semantic relationship guess (confidence-scored)"]
    P3["Phase 3: Pattern Recognition"]
    P3SRC["Identify: fact tables, dimension tables,\nbridge tables, audit columns,\nSCD patterns, junction tables"]
    P4["Phase 4: Model Generation"]
    P4SRC["Output: Physical ER model (as-is)\nLogical ER model (normalized)\nConceptual ER model (business)"]

    P1 --> P1SRC --> P2 --> P2SRC --> P3 --> P3SRC --> P4 --> P4SRC
Insight

Most databases have incomplete foreign key constraints. In the DataSims SimShop environment, only 64% of actual relationships are expressed as explicit foreign keys. The remaining 36% are inferred through naming conventions and cardinality analysis. The LLM adds semantic inference (e.g., recognizing that adj_id in the claims table refers to ADJUSTERS.adjuster_id despite the abbreviated column name).


ER Modeling: Physical, Logical, and Conceptual #

The er_model declaration creates entity-relationship models at three levels of abstraction:

NEAM
er_model InsurancePhysical {
    level: "physical",
    schema_source: InsuranceDW,
    include_data_types: true,
    include_indexes: true,
    include_partitions: true
}

er_model InsuranceLogical {
    level: "logical",
    schema_source: InsuranceDW,
    normalize_to: "3NF",
    resolve_denormalization: true,
    include_cardinality: true
}

er_model InsuranceConceptual {
    level: "conceptual",
    schema_source: InsuranceDW,
    business_domains: ["claims", "underwriting", "customer"],
    glossary: BusinessGlossary,
    include_business_rules: true
}

The three levels serve different audiences:

DIAGRAM Three Levels of ER Modeling
flowchart LR
    subgraph Conceptual ["Conceptual (Business)\nAudience: Executives, BAs"]
        C_Cust["Customer"]
        C_Pol["Policy"]
        C_Claim["Claim"]
        C_Cust -- "has" --> C_Pol
        C_Pol -- "covers" --> C_Claim
    end

    subgraph Logical ["Logical (Architecture)"]
        L_Cust["Customer\ncustomer_id\nname\nemail"]
        L_Pol["Policy\npolicy_id\ncustomer_id (FK)\nstart_date"]
        L_Cust -- "1:N" --> L_Pol
    end

    subgraph Physical ["Physical (Platform)\nAudience: DBAs, Data Engineers"]
        P_Claims["CLAIMS.CLAIMS\nCLAIM_ID NUMBER(12)\nPOLICY_ID NUMBER(12)\nAMOUNT DECIMAL\nSTATUS VARCHAR(20)\nIDX: ix_claim_policy\nPART: by CLAIM_DATE"]
    end

Normalization Analysis: 1NF through 5NF #

The normalization_analysis declaration evaluates a schema against normal forms and reports violations:

NEAM
normalization_analysis InsuranceHealth {
    schema_source: InsuranceDW,
    target: "BCNF",
    analyze: {
        first_normal_form: true,
        second_normal_form: true,
        third_normal_form: true,
        boyce_codd_normal_form: true,
        fourth_normal_form: true,
        fifth_normal_form: false
    },
    report: {
        violations: true,
        recommendations: true,
        impact_assessment: true,
        estimated_effort: true
    },
    exclusions: {
        tables: ["*_STAGING", "*_TEMP"],
        schemas: ["SANDBOX"]
    }
}

The analysis produces a report like:

Normalization Report: INSURANCE_PROD
SchemaTable1NF2NF3NFBCNFIssue
CLAIMSCLAIMSOKOKOKOK
CLAIMSCLAIMS_SUMMARYOKFAILPartial dep: adjuster name depends on adjuster_id, not full composite key
POLICYPOLICY_DETAILSOKOKFAILTransitive dep: agent_office depends on agent_id, not policy_id
CUSTOMERCUSTOMER_360OKFAILRepeating groups: phone1, phone2, phone3 columns

Summary: 847 tables analyzed

  • 812 (95.9%) pass 3NF
  • 28 (3.3%) have 2NF violations
  • 7 (0.8%) have 1NF violations (repeating groups)

Recommendation: 3 tables require immediate attention (CLAIMS_SUMMARY, POLICY_DETAILS, CUSTOMER_360)

Anti-Pattern

Do not normalize everything to 5NF. Analytics workloads benefit from controlled denormalization. The Modeling Agent's normalization analysis is diagnostic, not prescriptive — it tells you where normalization violations exist so you can decide which ones matter. A denormalized fact table in a star schema is intentional. A denormalized OLTP table is a defect.


Dimensional Design: Star, Snowflake, Data Vault #

The dimensional_model declaration generates analytical schemas:

NEAM
dimensional_model ClaimsAnalytics {
    methodology: "star",
    grain: "one row per claim event",
    source_model: InsuranceLogical,

    facts: {
        fact_claims: {
            measures: ["claim_amount", "paid_amount", "reserve_amount"],
            degenerate_dimensions: ["claim_number"],
            partitioned_by: "claim_date",
            scd_type: "none"
        }
    },

    dimensions: {
        dim_customer: {
            source_entity: "Customer",
            scd_type: 2,
            tracked_columns: ["address", "risk_score", "segment"],
            business_key: "customer_number"
        },
        dim_policy: {
            source_entity: "Policy",
            scd_type: 1,
            business_key: "policy_number"
        },
        dim_date: {
            type: "date_dimension",
            range: { start: "2020-01-01", end: "2030-12-31" },
            fiscal_year_start: "April"
        },
        dim_coverage: {
            source_entity: "CoverageType",
            scd_type: 0,
            business_key: "coverage_code"
        }
    }
}

The Modeling Agent supports four dimensional methodologies:

MethodologyDescriptionBest For
Star SchemaFact table surrounded by denormalized dimensionsSimple queries, fast aggregations, BI tools
Snowflake SchemaDimensions normalized into sub-dimensionsStorage efficiency, complex hierarchies
StarflakeHybrid — some dimensions normalized, some notPragmatic balance of star and snowflake
Data VaultHub (business keys), Link (relationships), Satellite (descriptive)Auditability, agile loading, historical tracking
DIAGRAM Star Schema vs Data Vault
flowchart LR
    subgraph Star ["Star Schema"]
        direction TB
        dim_customer["dim_customer"]
        fact_claims["fact_claims"]
        dim_policy["dim_policy"]
        dim_date["dim_date"]
        dim_customer --- fact_claims
        fact_claims --- dim_policy
        fact_claims --- dim_date
    end

    subgraph Vault ["Data Vault"]
        direction TB
        HUB_CLAIM["HUB_CLAIM\nclaim_bk"]
        LNK_CLAIM_POLICY["LNK_CLAIM_POLICY"]
        HUB_POLICY["HUB_POLICY\npolicy_bk"]
        SAT_CLAIM["SAT_CLAIM\namount\nstatus\nload_date"]
        HUB_CLAIM --- LNK_CLAIM_POLICY
        LNK_CLAIM_POLICY --- HUB_POLICY
        LNK_CLAIM_POLICY --- SAT_CLAIM
    end

DataMart Generation #

The datamart declaration generates purpose-specific schemas from dimensional models:

NEAM
datamart ClaimsMart {
    source_model: ClaimsAnalytics,
    purpose: "Claims analytics for underwriting team",
    target_platform: "snowflake",
    target_schema: "CLAIMS_MART",

    materialization: {
        strategy: "incremental",
        refresh_schedule: "0 6 * * *",
        incremental_key: "claim_date",
        lookback_window: "3d"
    },

    pre_aggregations: [
        {
            name: "monthly_claims_summary",
            dimensions: ["dim_coverage.coverage_type", "dim_customer.segment"],
            measures: ["SUM(claim_amount)", "COUNT(*)", "AVG(paid_amount)"],
            grain: "monthly"
        }
    ],

    access_policy: UnderwritingAccess
}

Schema Amendments with Impact Analysis #

The amendment declaration proposes schema changes with full impact analysis before execution:

NEAM
amendment_config ChangeManagement {
    approval: {
        required_approvers: 2,
        approver_roles: ["data_architect", "data_engineer_lead"],
        auto_approve_minor: false
    },
    impact_analysis: {
        check_downstream_queries: true,
        check_downstream_pipelines: true,
        check_downstream_reports: true,
        check_data_contracts: true,
        breaking_change_threshold: "warn"
    },
    rollback: {
        auto_rollback_on_failure: true,
        keep_backup: "7d"
    }
}

amendment AddRiskScore {
    target: "CUSTOMER.CUSTOMERS",
    operation: "add_column",
    column: {
        name: "risk_score",
        type: "DECIMAL(5,2)",
        nullable: true,
        default: null,
        comment: "Underwriting risk score (0.00-99.99)"
    },
    config: ChangeManagement,
    justification: "Required for claims prediction model feature engineering"
}

The impact analysis report:

Amendment Impact Analysis: AddRiskScore

Target: CUSTOMER.CUSTOMERS.risk_score (new column)

Downstream Impact

CategoryCountDetails
Pipelines3 affectedcustomer_daily_load, customer_360, crm_sync
Queries12 affectedSELECT * queries will return new column
Reports0 affectedNo reports reference CUSTOMERS directly
Contracts1 affectedcustomer_data_product needs version bump

Breaking Changes: NONE (additive change, nullable column)

Recommendation: SAFE TO PROCEED

  • Update customer_data_product contract version 2.1 → 2.2
  • Notify downstream pipeline owners (auto-sent)

Estimated Effort: 0.5 person-hours (contract update only)

Try It

In the DataSims SimShop environment, the Modeling Agent can reverse-engineer all 164 tables in the simshop database, identify the star schema in the analytics layer, flag the 3 normalization violations in the staging tables, and generate a Data Vault model from the raw sources. Run neam-agents/programs/simshop_churn.neam to see the full pipeline.


External Tool Connectors #

The modeling_tool declaration synchronizes with external modeling tools:

NEAM
modeling_tool ErwinSync {
    type: "erwin",
    connection: env("ERWIN_API_URL"),
    credentials: vault("erwin/prod/api_key"),
    sync_mode: "bidirectional",
    sync_interval: "1h",
    model_mapping: {
        physical: "INSURANCE_PHYSICAL_v3",
        logical: "INSURANCE_LOGICAL_v3"
    }
}

modeling_tool PowerDesignerSync {
    type: "powerdesigner",
    connection: env("PD_REPO_URL"),
    credentials: vault("pd/prod/creds"),
    sync_mode: "push",
    sync_interval: "6h"
}
ToolType ValueCapabilitiesUse Case
Erwin Data Modeler"erwin"Physical/logical/conceptual models, forward/reverse engineeringEnterprise data architecture
ER/Studio"erstudio"Team-based modeling, repository managementCollaborative modeling
SAP PowerDesigner"powerdesigner"Conceptual, logical, physical, dimensional modelsSAP ecosystem modeling
dbt"dbt"Model definitions, ref() graph, documentationAnalytics engineering

Neam Code: A Complete Modeling Agent #

NEAM
// ═══════════════════════════════════════════════════════════════
// Modeling Agent — Digital Data Architect for Insurance
// ═══════════════════════════════════════════════════════════════

budget ModelBudget { cost: 20.00, tokens: 200000, time: 3600000 }

modeling agent Architect {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a data modeling agent. Reverse-engineer schemas, evaluate normalization, and design dimensional models.",
    temperature: 0.2,

    schema_sources: [InsuranceDW],
    er_models: [InsurancePhysical, InsuranceLogical, InsuranceConceptual],
    dimensional_models: [ClaimsAnalytics],
    datamarts: [ClaimsMart],
    normalization: [InsuranceHealth],
    amendments: [AddRiskScore],
    modeling_tools: [ErwinSync],

    amendment_config: ChangeManagement,

    budget: ModelBudget,
    agent_md: "./agents/data_architect.md"
}

// ─── Operational Commands ───
let physical = Architect.reverse_engineer(InsuranceDW)
let norm_report = Architect.analyze_normalization(InsuranceDW)
let star = Architect.generate_dimensional("star", InsuranceLogical)
let impact = Architect.assess_impact(AddRiskScore)

Industry Perspective #

Kimball vs. Inmon vs. Data Vault #

The Modeling Agent does not pick sides in the Kimball-Inmon-Data Vault debate. It supports all three, and the choice depends on organizational context:

MethodologyStrengthsBest ForModeling Agent Support
Kimball (star/snowflake)Simple queries, fast BIDepartmental analytics, known query patternsmethodology: "star" or "snowflake"
Inmon (3NF enterprise)Single source of truth, normalizedEnterprise-wide integrationnormalize_to: "3NF" on er_model
Data Vault (hub/link/sat)Auditability, agile loading, historizationRegulated industries, agile data warehousingmethodology: "data_vault"
Starflake (hybrid)Pragmatic blendOrganizations with mixed needsmethodology: "starflake"

Schema Drift: The Silent Architecture Erosion #

Industry surveys show that the average enterprise data warehouse experiences 23 schema changes per month. Without governance, this leads to "schema drift" — the gradual divergence of the actual schema from the documented schema. The Modeling Agent's continuous reverse-engineering closes this gap by treating the live INFORMATION_SCHEMA as the source of truth and flagging deviations from the modeled state.


The Evidence #

DataSims experiments (DataSims repository) demonstrate the Modeling Agent's impact:

MetricWithout Modeling AgentWith Modeling AgentImprovement
Schema Understanding Time2.3 weeks (manual)12 minutes (auto)99.4%
Normalization Violations Found1.2 per audit (sample)7.4 per scan (exhaustive)6.2x
Dimensional Model Design Time3.1 days45 minutes95.5%
Schema Amendment Errors2.1 per quarter (no impact analysis)0.1 per quarter (with impact analysis)95.2%

Ablation A5 (Modeling Agent removed) in the churn prediction experiment showed that feature engineering took 3.4x longer because the data scientist had to manually discover table relationships, identify the correct grain for joins, and reverse-engineer SCD patterns. The Modeling Agent's domain context — shared with the DataScientist Agent — eliminated this exploratory overhead entirely.


Key Takeaways #

For Further Exploration #