Chapter 14: The Modeling Agent — Architecture Intelligence #
"All models are wrong, but some are useful." — George E.P. Box, statistician
20 min read | Priya, Marcus | Part IV: Platform Intelligence Agents
What you'll learn:
- How the Modeling Agent reverse-engineers existing schemas from INFORMATION_SCHEMA and catalog metadata
- How to build ER models at three levels — physical, logical, and conceptual
- How normalization analysis evaluates 1NF through 5NF/BCNF and flags violations
- How dimensional design generates star, snowflake, starflake, and Data Vault schemas
- How schema amendments are proposed with impact analysis before execution
- How external tool connectors synchronize with Erwin, ER/Studio, PowerDesigner, and dbt
The Problem #
Marcus joins a data science team at an insurance company. He is given access to a Snowflake account with 847 tables across 14 schemas. There is no data dictionary. There is no ER diagram. The last data architect left nine months ago, and her whiteboard diagrams were erased for a sprint planning session.
Marcus needs to build a claims prediction model. He needs to understand: Which tables contain claims data? What are the relationships between CLAIMS, POLICY_HOLDERS, POLICIES, COVERAGE_TYPES, and ADJUSTERS? Is CLAIMS.policy_id a foreign key to POLICIES.id or POLICIES.policy_number? Are the tables normalized or denormalized? Is there a dimensional model he should query, or is everything in 3NF?
He spends two weeks writing exploratory SQL, reverse-engineering relationships by joining on candidate keys and checking cardinality. He draws diagrams in Lucidchart. He discovers that CLAIMS_SUMMARY is a denormalized view that duplicates data from four source tables — but it is two months stale because the ETL job that refreshes it was disabled after a schema change.
The Modeling Agent does in minutes what Marcus did in weeks. It connects to INFORMATION_SCHEMA, reverse-engineers the physical schema, infers relationships from foreign keys and naming conventions, evaluates normalization health, and produces ER models at physical, logical, and conceptual levels. When Marcus needs a star schema for claims analytics, the agent designs it — with fact and dimension tables, SCD types, and grain definitions.
Schema Reverse-Engineering #
The schema_source declaration tells the Modeling Agent where to discover existing schemas:
schema_source InsuranceDW {
type: "snowflake",
connection: env("SNOWFLAKE_ACCOUNT"),
credentials: vault("snowflake/prod/creds"),
databases: ["INSURANCE_PROD"],
schemas: ["CLAIMS", "POLICY", "CUSTOMER", "REFERENCE"],
include_views: true,
include_constraints: true,
include_statistics: true,
refresh_interval: "6h"
}
The reverse-engineering process works in four phases:
flowchart TD
P1["Phase 1: Metadata Extraction"]
P1SRC["INFORMATION_SCHEMA.TABLES\nINFORMATION_SCHEMA.COLUMNS\nINFORMATION_SCHEMA.TABLE_CONSTRAINTS\nINFORMATION_SCHEMA.KEY_COLUMN_USAGE\nINFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS\nPlatform-specific statistics views"]
P2["Phase 2: Relationship Inference"]
P2SRC["Explicit: Foreign key constraints\nNaming: column_name matches (e.g., claims.policy_id → policies.id)\nCardinality: JOIN result analysis\nLLM: Semantic relationship guess (confidence-scored)"]
P3["Phase 3: Pattern Recognition"]
P3SRC["Identify: fact tables, dimension tables,\nbridge tables, audit columns,\nSCD patterns, junction tables"]
P4["Phase 4: Model Generation"]
P4SRC["Output: Physical ER model (as-is)\nLogical ER model (normalized)\nConceptual ER model (business)"]
P1 --> P1SRC --> P2 --> P2SRC --> P3 --> P3SRC --> P4 --> P4SRC
Most databases have incomplete foreign key constraints. In the DataSims SimShop environment, only 64% of actual relationships are expressed as explicit foreign keys. The remaining 36% are inferred through naming conventions and cardinality analysis. The LLM adds semantic inference (e.g., recognizing that adj_id in the claims table refers to ADJUSTERS.adjuster_id despite the abbreviated column name).
ER Modeling: Physical, Logical, and Conceptual #
The er_model declaration creates entity-relationship models at three levels of abstraction:
er_model InsurancePhysical {
level: "physical",
schema_source: InsuranceDW,
include_data_types: true,
include_indexes: true,
include_partitions: true
}
er_model InsuranceLogical {
level: "logical",
schema_source: InsuranceDW,
normalize_to: "3NF",
resolve_denormalization: true,
include_cardinality: true
}
er_model InsuranceConceptual {
level: "conceptual",
schema_source: InsuranceDW,
business_domains: ["claims", "underwriting", "customer"],
glossary: BusinessGlossary,
include_business_rules: true
}
The three levels serve different audiences:
flowchart LR
subgraph Conceptual ["Conceptual (Business)\nAudience: Executives, BAs"]
C_Cust["Customer"]
C_Pol["Policy"]
C_Claim["Claim"]
C_Cust -- "has" --> C_Pol
C_Pol -- "covers" --> C_Claim
end
subgraph Logical ["Logical (Architecture)"]
L_Cust["Customer\ncustomer_id\nname\nemail"]
L_Pol["Policy\npolicy_id\ncustomer_id (FK)\nstart_date"]
L_Cust -- "1:N" --> L_Pol
end
subgraph Physical ["Physical (Platform)\nAudience: DBAs, Data Engineers"]
P_Claims["CLAIMS.CLAIMS\nCLAIM_ID NUMBER(12)\nPOLICY_ID NUMBER(12)\nAMOUNT DECIMAL\nSTATUS VARCHAR(20)\nIDX: ix_claim_policy\nPART: by CLAIM_DATE"]
end
Normalization Analysis: 1NF through 5NF #
The normalization_analysis declaration evaluates a schema against normal forms and reports violations:
normalization_analysis InsuranceHealth {
schema_source: InsuranceDW,
target: "BCNF",
analyze: {
first_normal_form: true,
second_normal_form: true,
third_normal_form: true,
boyce_codd_normal_form: true,
fourth_normal_form: true,
fifth_normal_form: false
},
report: {
violations: true,
recommendations: true,
impact_assessment: true,
estimated_effort: true
},
exclusions: {
tables: ["*_STAGING", "*_TEMP"],
schemas: ["SANDBOX"]
}
}
The analysis produces a report like:
| Schema | Table | 1NF | 2NF | 3NF | BCNF | Issue |
|---|---|---|---|---|---|---|
| CLAIMS | CLAIMS | OK | OK | OK | OK | — |
| CLAIMS | CLAIMS_SUMMARY | OK | FAIL | — | — | Partial dep: adjuster name depends on adjuster_id, not full composite key |
| POLICY | POLICY_DETAILS | OK | OK | FAIL | — | Transitive dep: agent_office depends on agent_id, not policy_id |
| CUSTOMER | CUSTOMER_360 | OK | FAIL | — | — | Repeating groups: phone1, phone2, phone3 columns |
Summary: 847 tables analyzed
- 812 (95.9%) pass 3NF
- 28 (3.3%) have 2NF violations
- 7 (0.8%) have 1NF violations (repeating groups)
Recommendation: 3 tables require immediate attention (CLAIMS_SUMMARY, POLICY_DETAILS, CUSTOMER_360)
Do not normalize everything to 5NF. Analytics workloads benefit from controlled denormalization. The Modeling Agent's normalization analysis is diagnostic, not prescriptive — it tells you where normalization violations exist so you can decide which ones matter. A denormalized fact table in a star schema is intentional. A denormalized OLTP table is a defect.
Dimensional Design: Star, Snowflake, Data Vault #
The dimensional_model declaration generates analytical schemas:
dimensional_model ClaimsAnalytics {
methodology: "star",
grain: "one row per claim event",
source_model: InsuranceLogical,
facts: {
fact_claims: {
measures: ["claim_amount", "paid_amount", "reserve_amount"],
degenerate_dimensions: ["claim_number"],
partitioned_by: "claim_date",
scd_type: "none"
}
},
dimensions: {
dim_customer: {
source_entity: "Customer",
scd_type: 2,
tracked_columns: ["address", "risk_score", "segment"],
business_key: "customer_number"
},
dim_policy: {
source_entity: "Policy",
scd_type: 1,
business_key: "policy_number"
},
dim_date: {
type: "date_dimension",
range: { start: "2020-01-01", end: "2030-12-31" },
fiscal_year_start: "April"
},
dim_coverage: {
source_entity: "CoverageType",
scd_type: 0,
business_key: "coverage_code"
}
}
}
The Modeling Agent supports four dimensional methodologies:
| Methodology | Description | Best For |
|---|---|---|
| Star Schema | Fact table surrounded by denormalized dimensions | Simple queries, fast aggregations, BI tools |
| Snowflake Schema | Dimensions normalized into sub-dimensions | Storage efficiency, complex hierarchies |
| Starflake | Hybrid — some dimensions normalized, some not | Pragmatic balance of star and snowflake |
| Data Vault | Hub (business keys), Link (relationships), Satellite (descriptive) | Auditability, agile loading, historical tracking |
flowchart LR
subgraph Star ["Star Schema"]
direction TB
dim_customer["dim_customer"]
fact_claims["fact_claims"]
dim_policy["dim_policy"]
dim_date["dim_date"]
dim_customer --- fact_claims
fact_claims --- dim_policy
fact_claims --- dim_date
end
subgraph Vault ["Data Vault"]
direction TB
HUB_CLAIM["HUB_CLAIM\nclaim_bk"]
LNK_CLAIM_POLICY["LNK_CLAIM_POLICY"]
HUB_POLICY["HUB_POLICY\npolicy_bk"]
SAT_CLAIM["SAT_CLAIM\namount\nstatus\nload_date"]
HUB_CLAIM --- LNK_CLAIM_POLICY
LNK_CLAIM_POLICY --- HUB_POLICY
LNK_CLAIM_POLICY --- SAT_CLAIM
end
DataMart Generation #
The datamart declaration generates purpose-specific schemas from dimensional models:
datamart ClaimsMart {
source_model: ClaimsAnalytics,
purpose: "Claims analytics for underwriting team",
target_platform: "snowflake",
target_schema: "CLAIMS_MART",
materialization: {
strategy: "incremental",
refresh_schedule: "0 6 * * *",
incremental_key: "claim_date",
lookback_window: "3d"
},
pre_aggregations: [
{
name: "monthly_claims_summary",
dimensions: ["dim_coverage.coverage_type", "dim_customer.segment"],
measures: ["SUM(claim_amount)", "COUNT(*)", "AVG(paid_amount)"],
grain: "monthly"
}
],
access_policy: UnderwritingAccess
}
Schema Amendments with Impact Analysis #
The amendment declaration proposes schema changes with full impact analysis before execution:
amendment_config ChangeManagement {
approval: {
required_approvers: 2,
approver_roles: ["data_architect", "data_engineer_lead"],
auto_approve_minor: false
},
impact_analysis: {
check_downstream_queries: true,
check_downstream_pipelines: true,
check_downstream_reports: true,
check_data_contracts: true,
breaking_change_threshold: "warn"
},
rollback: {
auto_rollback_on_failure: true,
keep_backup: "7d"
}
}
amendment AddRiskScore {
target: "CUSTOMER.CUSTOMERS",
operation: "add_column",
column: {
name: "risk_score",
type: "DECIMAL(5,2)",
nullable: true,
default: null,
comment: "Underwriting risk score (0.00-99.99)"
},
config: ChangeManagement,
justification: "Required for claims prediction model feature engineering"
}
The impact analysis report:
Target: CUSTOMER.CUSTOMERS.risk_score (new column)
Downstream Impact
| Category | Count | Details |
|---|---|---|
| Pipelines | 3 affected | customer_daily_load, customer_360, crm_sync |
| Queries | 12 affected | SELECT * queries will return new column |
| Reports | 0 affected | No reports reference CUSTOMERS directly |
| Contracts | 1 affected | customer_data_product needs version bump |
Breaking Changes: NONE (additive change, nullable column)
Recommendation: SAFE TO PROCEED
- Update customer_data_product contract version 2.1 → 2.2
- Notify downstream pipeline owners (auto-sent)
Estimated Effort: 0.5 person-hours (contract update only)
In the DataSims SimShop environment, the Modeling Agent can reverse-engineer all 164 tables in the simshop database, identify the star schema in the analytics layer, flag the 3 normalization violations in the staging tables, and generate a Data Vault model from the raw sources. Run neam-agents/programs/simshop_churn.neam to see the full pipeline.
External Tool Connectors #
The modeling_tool declaration synchronizes with external modeling tools:
modeling_tool ErwinSync {
type: "erwin",
connection: env("ERWIN_API_URL"),
credentials: vault("erwin/prod/api_key"),
sync_mode: "bidirectional",
sync_interval: "1h",
model_mapping: {
physical: "INSURANCE_PHYSICAL_v3",
logical: "INSURANCE_LOGICAL_v3"
}
}
modeling_tool PowerDesignerSync {
type: "powerdesigner",
connection: env("PD_REPO_URL"),
credentials: vault("pd/prod/creds"),
sync_mode: "push",
sync_interval: "6h"
}
| Tool | Type Value | Capabilities | Use Case |
|---|---|---|---|
| Erwin Data Modeler | "erwin" | Physical/logical/conceptual models, forward/reverse engineering | Enterprise data architecture |
| ER/Studio | "erstudio" | Team-based modeling, repository management | Collaborative modeling |
| SAP PowerDesigner | "powerdesigner" | Conceptual, logical, physical, dimensional models | SAP ecosystem modeling |
| dbt | "dbt" | Model definitions, ref() graph, documentation | Analytics engineering |
Neam Code: A Complete Modeling Agent #
// ═══════════════════════════════════════════════════════════════
// Modeling Agent — Digital Data Architect for Insurance
// ═══════════════════════════════════════════════════════════════
budget ModelBudget { cost: 20.00, tokens: 200000, time: 3600000 }
modeling agent Architect {
provider: "anthropic",
model: "claude-sonnet-4-6",
system: "You are a data modeling agent. Reverse-engineer schemas, evaluate normalization, and design dimensional models.",
temperature: 0.2,
schema_sources: [InsuranceDW],
er_models: [InsurancePhysical, InsuranceLogical, InsuranceConceptual],
dimensional_models: [ClaimsAnalytics],
datamarts: [ClaimsMart],
normalization: [InsuranceHealth],
amendments: [AddRiskScore],
modeling_tools: [ErwinSync],
amendment_config: ChangeManagement,
budget: ModelBudget,
agent_md: "./agents/data_architect.md"
}
// ─── Operational Commands ───
let physical = Architect.reverse_engineer(InsuranceDW)
let norm_report = Architect.analyze_normalization(InsuranceDW)
let star = Architect.generate_dimensional("star", InsuranceLogical)
let impact = Architect.assess_impact(AddRiskScore)
Industry Perspective #
Kimball vs. Inmon vs. Data Vault #
The Modeling Agent does not pick sides in the Kimball-Inmon-Data Vault debate. It supports all three, and the choice depends on organizational context:
| Methodology | Strengths | Best For | Modeling Agent Support |
|---|---|---|---|
| Kimball (star/snowflake) | Simple queries, fast BI | Departmental analytics, known query patterns | methodology: "star" or "snowflake" |
| Inmon (3NF enterprise) | Single source of truth, normalized | Enterprise-wide integration | normalize_to: "3NF" on er_model |
| Data Vault (hub/link/sat) | Auditability, agile loading, historization | Regulated industries, agile data warehousing | methodology: "data_vault" |
| Starflake (hybrid) | Pragmatic blend | Organizations with mixed needs | methodology: "starflake" |
Schema Drift: The Silent Architecture Erosion #
Industry surveys show that the average enterprise data warehouse experiences 23 schema changes per month. Without governance, this leads to "schema drift" — the gradual divergence of the actual schema from the documented schema. The Modeling Agent's continuous reverse-engineering closes this gap by treating the live INFORMATION_SCHEMA as the source of truth and flagging deviations from the modeled state.
The Evidence #
DataSims experiments (DataSims repository) demonstrate the Modeling Agent's impact:
| Metric | Without Modeling Agent | With Modeling Agent | Improvement |
|---|---|---|---|
| Schema Understanding Time | 2.3 weeks (manual) | 12 minutes (auto) | 99.4% |
| Normalization Violations Found | 1.2 per audit (sample) | 7.4 per scan (exhaustive) | 6.2x |
| Dimensional Model Design Time | 3.1 days | 45 minutes | 95.5% |
| Schema Amendment Errors | 2.1 per quarter (no impact analysis) | 0.1 per quarter (with impact analysis) | 95.2% |
Ablation A5 (Modeling Agent removed) in the churn prediction experiment showed that feature engineering took 3.4x longer because the data scientist had to manually discover table relationships, identify the correct grain for joins, and reverse-engineer SCD patterns. The Modeling Agent's domain context — shared with the DataScientist Agent — eliminated this exploratory overhead entirely.
Key Takeaways #
- The Modeling Agent reverse-engineers existing schemas from INFORMATION_SCHEMA, inferring relationships even when foreign key constraints are missing (naming convention + cardinality analysis + LLM semantic inference)
- Three levels of ER modeling (physical, logical, conceptual) serve different audiences: DBAs, architects, and business stakeholders
- Normalization analysis (1NF-5NF/BCNF) is diagnostic, not prescriptive — it tells you where violations exist so you can decide which ones are intentional
- Dimensional design supports star, snowflake, starflake, and Data Vault methodologies — the choice depends on organizational context, not dogma
- Schema amendments include full impact analysis before execution: downstream queries, pipelines, reports, and data contracts
- The Modeling Agent designs the blueprint — then hands off to ETLAgent for pipeline generation and GovernanceAgent for policy enforcement
For Further Exploration #
- Neam Language Reference: Modeling Agent
- DataSims: Simulated Enterprise Environment — 164 tables across 12 schemas for modeling exercises
- The Data Warehouse Toolkit (Kimball & Ross, 3rd edition, 2013) — the definitive dimensional modeling reference
- Building a Scalable Data Warehouse with Data Vault 2.0 (Linstedt & Olschimke, 2015)
- Database Design for Mere Mortals (Hernandez, 4th edition, 2020) — ER modeling fundamentals