Chapter 11: Infrastructure Profiles — One Program, Ten Platforms #

"Any sufficiently advanced abstraction is indistinguishable from portability." — adapted from Arthur C. Clarke

Reading time: 20 min | Personas: Priya (Data Engineer), David (VP of Data), Sarah (ML Engineer) | Part III: Data Infrastructure Agents

What You'll Learn #

How the infrastructure_profile declaration abstracts platform details from agent logic
How the same .neam program runs on Snowflake, BigQuery, Databricks, Redshift, Oracle, PostgreSQL, Teradata, Microsoft Fabric, Bigtable, and standalone Spark
The platform capability matrix: which platforms support SQL pushdown, ML-in-DB, feature stores, streaming, and governance
How agents adapt their execution strategy based on the active profile
A complete example: the same churn prediction program running against a PostgreSQL profile and a Snowflake profile

The Problem: "We Rewrote Everything When We Switched to Snowflake" #

Priya's company spent 18 months building a data platform on Databricks. Spark jobs, Delta Lake tables, Unity Catalog integration, MLflow for model tracking — the full stack. Then the CFO mandated a switch to Snowflake for cost reasons. The data engineering team estimated 3 months to migrate. It took 9.

The problem was not the data migration (Chapter 10 handles that). The problem was that every pipeline, every agent configuration, every SQL query, and every ML integration was written against Databricks-specific APIs. Spark SQL syntax. Delta Lake MERGE semantics. Unity Catalog's Python SDK. MLflow's experiment tracking API. Changing the platform meant rewriting the application.

This is the vendor lock-in tax. It is not hypothetical — it is the single largest hidden cost in enterprise data infrastructure. According to Flexera's 2024 State of the Cloud report, organizations running multi-cloud strategies spend 30% more on integration than those locked into a single platform. But single-platform strategies carry the risk Priya experienced: when the platform changes, everything changes.

The Infrastructure Profile solves this by placing a declarative abstraction layer between agent logic and platform implementation. Agents write against abstract interfaces. The profile maps those interfaces to platform-specific implementations. Changing the platform means changing one declaration — not rewriting the application.

The Infrastructure Profile #

An infrastructure profile declares the platform-specific configuration for data warehousing, data science, governance, and streaming in a single block. Agents reference the profile; they never reference platform APIs directly.

NEAM

infrastructure_profile ProductionInfra {
    data_warehouse: {
        platform: "snowflake",
        connection: env("SNOWFLAKE_URL"),
        warehouse: "ANALYTICS_WH",
        database: "PRODUCTION",
        schemas: ["raw", "staging", "analytics", "ml_features", "ml_predictions"]
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        feature_store: { type: "feast", uri: env("FEAST_SERVER_URI") },
        compute: { local: true, gpu: false }
    },

    governance: {
        catalog: { type: "unity", uri: env("UNITY_CATALOG_URI") },
        regulations: ["GDPR", "CCPA"],
        pii_columns: ["email", "phone", "date_of_birth", "first_name", "last_name"]
    },

    streaming: {
        platform: "snowpipe",
        connection: env("SNOWPIPE_URL")
    }
}

What the Profile Abstracts #

flowchart TB
    A["AGENT LAYER (Platform-Agnostic)\n\nAgents write against ABSTRACT interfaces:\nexecute_sql(query) — dispatched to target dialect\ntrain_model(config) — dispatched to target ML platform\nstore_artifact(model) — dispatched to target registry\nmonitor_drift(config) — dispatched to target monitoring\nregister_table(meta) — dispatched to target catalog\ningest_stream(source) — dispatched to target streaming"]
    A --> B["INFRASTRUCTURE PROFILE\n\nSelected at configuration time — maps abstract\noperations to platform-specific implementations"]
    B --> C1["Snowflake"]
    B --> C2["BigQuery"]
    B --> C3["Databricks"]
    B --> C4["Redshift"]
    B --> C5["Oracle"]
    B --> C6["PostgreSQL"]
    B --> C7["Teradata"]
    B --> C8["Microsoft\nFabric"]

The key insight: same .neam program, same agent logic, different platform. Change only the infrastructure_profile declaration.

Platform Capability Matrix #

Not all platforms support all capabilities. The infrastructure profile understands what each platform can and cannot do, and agents adapt accordingly.

Capability	Snowflake	BigQuery	Databricks	Redshift	Oracle	PostgreSQL	Teradata	Fabric
SQL Pushdown	Snowpark	BQ SQL	Spark SQL	Redshift SQL	PL/SQL	pgSQL	Vantage SQL	Spark SQL
ML-in-DB	Snowpark ML	BQ ML	MLlib	Redshift ML	OML	MADlib	Vantage Analytics	Synapse ML
Feature Store	Feast	Vertex AI FS	Built-in FS	SageMaker FS	Custom	Custom	Custom	Built-in FS
Streaming	Snowpipe	BQ Streams	Structured Streaming	Kinesis	GoldenGate	Logical Replication	QueryGrid	Event Streams
Governance	Horizon	Dataplex	Unity Catalog	Lake Formation	Data Safe	Row-Level Security	Column-Level Sec	Purview
Semi-structured	VARIANT	JSON	Struct	SUPER	JSON	JSONB	JSON	JSON
Time Travel	90 days	7 days	30 days	None	Flashback	None	None	None
Materialized Views	Yes	Yes	Yes (Delta)	Yes	Yes	Yes	Yes	Yes
Serverless	Yes	Yes	Yes (SQL)	Serverless	Autonomous	No	Vantage	Yes

How Agents Use the Capability Matrix #

When the ETL Agent generates SQL, it checks the capability matrix:

SQL Pushdown supported? Generate dialect-specific SQL and push to warehouse.
ML-in-DB supported? Train models inside the warehouse (no data movement).
Feature Store available? Use the platform's feature store; otherwise, use feature tables.
Streaming supported? Use native streaming; otherwise, use micro-batch.
Governance built-in? Use native access policies; otherwise, generate RBAC views.

flowchart TB
    A["ETL Agent needs to train a churn model"] --> B["Platform = Snowflake"]
    A --> C["Platform = BigQuery"]
    A --> D["Platform = PostgreSQL"]
    A --> E["Platform = Teradata"]
    B --> B1["Use Snowpark ML (ML-in-DB)\nNo data leaves the warehouse"]
    C --> C1["Use BigQuery ML\nCREATE MODEL directly in BQ"]
    D --> D1["Use MADlib or export to MLflow\nData moves to external ML platform"]
    E --> E1["Use Vantage Analytics Library\nIn-database scoring"]

Insight: The capability matrix is not just about feature support — it is about cost. Running ML training inside Snowflake (Snowpark ML) avoids data egress charges. Running it outside requires extracting data, which incurs both egress cost and latency. The profile-aware agent makes the optimal choice automatically.

Declaring Profiles for Each Platform #

PostgreSQL Profile (Development / Small Scale) #

NEAM

infrastructure_profile DevProfile {
    data_warehouse: {
        platform: "postgres",
        connection: env("PG_URL"),
        schemas: ["raw", "staging", "analytics", "ml_features"]
    },

    data_science: {
        mlflow: { uri: "http://localhost:5000" },
        compute: { local: true, gpu: false }
    },

    governance: {
        regulations: ["GDPR"],
        pii_columns: ["email", "phone", "date_of_birth"]
    }
}

Snowflake Profile (Production / Enterprise) #

NEAM

infrastructure_profile SnowflakeProfile {
    data_warehouse: {
        platform: "snowflake",
        connection: env("SNOWFLAKE_URL"),
        warehouse: "ANALYTICS_WH",
        database: "PRODUCTION",
        schemas: ["raw", "staging", "analytics", "ml_features", "ml_predictions"]
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        feature_store: { type: "feast", uri: env("FEAST_URI") },
        compute: { local: false, gpu: false }
    },

    governance: {
        catalog: { type: "snowflake_horizon" },
        regulations: ["GDPR", "CCPA", "SOX"],
        pii_columns: ["email", "phone", "date_of_birth", "first_name", "last_name", "ssn"]
    },

    streaming: {
        platform: "snowpipe",
        connection: env("SNOWPIPE_URL")
    }
}

BigQuery Profile (Google Cloud) #

NEAM

infrastructure_profile BigQueryProfile {
    data_warehouse: {
        platform: "bigquery",
        project: env("GCP_PROJECT_ID"),
        dataset: "analytics",
        location: "US"
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        feature_store: { type: "vertex_ai", project: env("GCP_PROJECT_ID") },
        compute: { local: false, gpu: true }
    },

    governance: {
        catalog: { type: "dataplex", project: env("GCP_PROJECT_ID") },
        regulations: ["GDPR", "CCPA"],
        pii_columns: ["email", "phone", "date_of_birth"]
    },

    streaming: {
        platform: "bigquery_streaming",
        project: env("GCP_PROJECT_ID")
    }
}

Databricks Profile (Lakehouse) #

NEAM

infrastructure_profile DatabricksProfile {
    data_warehouse: {
        platform: "databricks",
        connection: env("DATABRICKS_URL"),
        catalog: "main",
        schemas: ["raw", "staging", "analytics", "ml_features"]
    },

    data_science: {
        mlflow: { uri: env("DATABRICKS_URL"), managed: true },
        feature_store: { type: "databricks", catalog: "main" },
        compute: { local: false, gpu: true }
    },

    governance: {
        catalog: { type: "unity_catalog", url: env("DATABRICKS_URL") },
        regulations: ["GDPR", "HIPAA"],
        pii_columns: ["email", "phone", "date_of_birth", "ssn"]
    },

    streaming: {
        platform: "structured_streaming",
        checkpoint: "dbfs:/checkpoints/"
    }
}

Anti-Pattern: Embedding platform-specific configuration in agent declarations instead of in the infrastructure profile. If you write warehouse: "snowflake" directly in your etl agent block, you cannot swap platforms without modifying agent logic. Always reference the profile.

How Agents Adapt Execution Per Profile #

The same agent declaration produces different execution plans depending on the active profile.

Example: Churn Feature Pipeline #

NEAM

// This agent logic is IDENTICAL regardless of platform
etl agent ChurnFeatures {
    provider: "openai", model: "gpt-4o-mini",
    sources: [OLTPSource],
    warehouse: InfraWarehouse,
    model_type: "star",

    pipeline: {
        extract: {
            from: [OLTPSource],
            mode: "incremental",
            watermark: "updated_at"
        },
        transform: {
            derive: {
                days_since_last_order: "DATEDIFF(day, last_order_date, CURRENT_DATE)",
                order_frequency_30d: "COUNT(order_id) OVER (PARTITION BY customer_id ORDER BY order_date RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW)",
                avg_order_value: "AVG(total_amount) OVER (PARTITION BY customer_id)"
            }
        },
        load: {
            target: InfraWarehouse,
            mode: "upsert",
            key: ["customer_id"]
        }
    },

    quality: FeatureQuality,
    lineage: true
}

PostgreSQL Execution (DevProfile) #

CODE

  Agent generates PostgreSQL-dialect SQL:

  INSERT INTO ml_features.churn_features (customer_id, days_since_last_order, ...)
  SELECT
      c.customer_id,
      CURRENT_DATE - c.last_order_date AS days_since_last_order,
      COUNT(o.order_id) OVER (
          PARTITION BY c.customer_id
          ORDER BY o.order_date
          RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
      ) AS order_frequency_30d,
      AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS avg_order_value
  FROM simshop_oltp.customers c
  JOIN simshop_oltp.orders o ON c.customer_id = o.customer_id
  ON CONFLICT (customer_id) DO UPDATE SET ...

Snowflake Execution (SnowflakeProfile) #

CODE

  Agent generates Snowflake-dialect SQL:

  MERGE INTO ml_features.churn_features t
  USING (
      SELECT
          c.customer_id,
          DATEDIFF(day, c.last_order_date, CURRENT_DATE()) AS days_since_last_order,
          COUNT(o.order_id) OVER (
              PARTITION BY c.customer_id
              ORDER BY o.order_date
              RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
          ) AS order_frequency_30d,
          AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS avg_order_value
      FROM raw.customers c
      JOIN raw.orders o ON c.customer_id = o.customer_id
  ) s
  ON t.customer_id = s.customer_id
  WHEN MATCHED THEN UPDATE SET ...
  WHEN NOT MATCHED THEN INSERT ...

Key Differences Handled Automatically #

Aspect	PostgreSQL	Snowflake
Date arithmetic	`CURRENT_DATE - column`	`DATEDIFF(day, column, CURRENT_DATE())`
Interval syntax	`INTERVAL '30 days'`	`INTERVAL '30 days'` (same)
Upsert	`ON CONFLICT ... DO UPDATE`	`MERGE INTO ... USING`
Schema qualification	`simshop_oltp.customers`	`raw.customers` (from profile)
Timestamp function	`CURRENT_DATE`	`CURRENT_DATE()` (parentheses)

The agent logic did not change. The infrastructure profile changed. The SQL changed automatically.

Complete Example: Same Churn Prediction, Two Platforms #

This example shows the complete churn prediction program from DataSims, configured to run against two different platforms by swapping only the infrastructure profile.

NEAM

// ═══════════════════════════════════════════════════════════════
// SimShop Churn Prediction — Platform-Agnostic Program
// ═══════════════════════════════════════════════════════════════

// ── Budgets ──
budget DIOBudget { cost: 500.00, tokens: 2000000 }
budget AgentBudget { cost: 50.00, tokens: 500000 }

// ╔═══════════════════════════════════════════════════════════╗
// ║  SWAP THIS BLOCK TO CHANGE PLATFORMS                     ║
// ╚═══════════════════════════════════════════════════════════╝

// Option A: PostgreSQL (development, DataSims)
infrastructure_profile SimShopInfra {
    data_warehouse: {
        platform: "postgres",
        connection: env("SIMSHOP_PG_URL"),
        schemas: ["simshop_oltp", "simshop_staging", "simshop_dw",
                  "ml_features", "ml_predictions"]
    },
    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        compute: { local: true, gpu: false }
    },
    governance: {
        regulations: ["GDPR"],
        pii_columns: ["email", "phone", "date_of_birth",
                      "first_name", "last_name"]
    }
}

// Option B: Snowflake (production)
// infrastructure_profile SimShopInfra {
//     data_warehouse: {
//         platform: "snowflake",
//         connection: env("SNOWFLAKE_URL"),
//         warehouse: "ANALYTICS_WH",
//         database: "SIMSHOP_PROD",
//         schemas: ["raw", "staging", "analytics",
//                   "ml_features", "ml_predictions"]
//     },
//     data_science: {
//         mlflow: { uri: env("MLFLOW_TRACKING_URI") },
//         feature_store: { type: "feast", uri: env("FEAST_URI") },
//         compute: { local: false, gpu: false }
//     },
//     governance: {
//         catalog: { type: "snowflake_horizon" },
//         regulations: ["GDPR", "CCPA"],
//         pii_columns: ["email", "phone", "date_of_birth",
//                       "first_name", "last_name"]
//     },
//     streaming: {
//         platform: "snowpipe",
//         connection: env("SNOWPIPE_URL")
//     }
// }

// ══════════════════════════════════════════════════════════════
// EVERYTHING BELOW IS PLATFORM-AGNOSTIC
// ══════════════════════════════════════════════════════════════

// ── SQL Connection (resolved from profile) ──
sql_connection SimShopDB {
    platform: "postgres",
    connection: env("SIMSHOP_PG_URL"),
    database: "simshop"
}

// ── Sub-Agents ──
analyst agent SimShopAnalyst {
    provider: "openai", model: "gpt-4o-mini",
    connections: [SimShopDB],
    budget: AgentBudget
}

datascientist agent ChurnDS {
    provider: "openai", model: "gpt-4o",
    budget: AgentBudget
}

causal agent ChurnCausal {
    provider: "openai", model: "o3-mini",
    budget: AgentBudget
}

datatest agent ChurnTester {
    provider: "openai", model: "gpt-4o",
    budget: AgentBudget
}

mlops agent ChurnMLOps {
    provider: "openai", model: "gpt-4o",
    budget: AgentBudget
}

// ── The Orchestrator ──
dio agent SimShopDIO {
    mode: "auto",
    task: "Predict which SimShop customers will churn in the next 90 days",
    infrastructure: SimShopInfra,
    provider: "openai",
    model: "gpt-4o",
    budget: DIOBudget
}

// ── Execute ──
let result = dio_solve(SimShopDIO, "Predict churn and identify top drivers")
print(result)

What Changes Between Platforms #

Component	PostgreSQL Profile	Snowflake Profile
SQL dialect	`pg` (PostgreSQL)	Snowflake SQL
Upsert syntax	`ON CONFLICT DO UPDATE`	`MERGE INTO ... USING`
Date functions	`CURRENT_DATE - INTERVAL`	`DATEADD(day, -N, CURRENT_DATE())`
Feature store	Local tables	Feast managed service
Governance	Row-level security	Snowflake Horizon
Streaming	Logical replication	Snowpipe
ML training	Export to MLflow	Snowpark ML (in-DB) or MLflow
Schema names	`simshop_oltp`, `simshop_dw`	`raw`, `analytics`

What Does NOT Change #

Agent declarations (ETL, DataScientist, Causal, DataTest, MLOps)
Business logic (churn definition, feature engineering formulas)
Quality gates (completeness, freshness, drift thresholds)
Orchestration (DIO crew formation, RACI matrix)
Budget constraints

Try It: Clone the DataSims repo and run the churn prediction program with the PostgreSQL profile. Then modify the infrastructure_profile block to target a different platform (even a mock/dry-run is instructive). Observe how the generated SQL changes while the agent logic stays identical.

Standalone Spark and Bigtable #

Beyond the eight primary warehouse platforms, the infrastructure profile also supports standalone Spark clusters and Google Bigtable for specialized workloads.

Standalone Spark #

NEAM

infrastructure_profile SparkProfile {
    data_warehouse: {
        platform: "spark",
        master: env("SPARK_MASTER_URL"),
        deploy_mode: "cluster",
        executor_memory: "8g",
        num_executors: 16
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        compute: { local: false, gpu: true }
    }
}

Spark profiles are used when the workload requires distributed processing beyond what a SQL warehouse can handle — typically large-scale ML feature engineering, graph processing, or streaming ETL.

Google Bigtable #

NEAM

infrastructure_profile BigtableProfile {
    data_warehouse: {
        platform: "bigtable",
        project: env("GCP_PROJECT_ID"),
        instance: "analytics-instance",
        table_prefix: "simshop_"
    }
}

Bigtable profiles are used for high-throughput, low-latency workloads like real-time feature serving or time-series storage. Agents adapt by generating Bigtable-compatible key designs instead of SQL.

Industry Perspective #

Infrastructure abstraction is a well-established principle in software engineering (think JDBC, ODBC, SQLAlchemy). What is new in the Neam approach is extending this abstraction beyond SQL to cover the entire data lifecycle:

Multi-Cloud Reality (Flexera, 2024): 89% of enterprises use multi-cloud. The infrastructure profile lets organizations run the same data programs across AWS (Redshift), GCP (BigQuery), Azure (Fabric), and on-premises (PostgreSQL, Oracle) without rewriting agent logic.

Platform Churn (Gartner, 2023): The average enterprise replaces or adds a major data platform every 18 months. Infrastructure profiles reduce the cost of platform transitions from months of rewriting to a configuration change.

Regulatory Geography (GDPR, CCPA, DORA): Different regions require different platforms due to data residency laws. A single program with region-specific profiles (EU data stays in EU BigQuery, US data stays in US Snowflake) satisfies compliance without code duplication.

The Evidence #

The DataSims environment demonstrates infrastructure profiles in practice. The SimShop churn prediction program (neam-agents/programs/simshop_churn.neam) includes an infrastructure_profile SimShopInfra targeting PostgreSQL. The same program is designed to work against any supported platform by changing only the profile block.

Evaluation criteria:

Functional equivalence: The same 47 churn features must be computed correctly on both PostgreSQL and the target platform.
SQL correctness: Generated SQL must be syntactically valid and semantically equivalent across dialects.
Performance parity: Query execution time should be within 2x across platforms (accounting for scale differences).
Zero agent changes: No agent declaration should require modification when the profile changes.

For the complete simulation environment, see DataSims. For the language specification, see Neam: The AI-Native Programming Language.

Key Takeaways #

Infrastructure profiles decouple agent logic from platform implementation. Change the profile, not the program.
Ten platforms supported: Snowflake, BigQuery, Databricks, Redshift, Oracle, PostgreSQL, Teradata, Microsoft Fabric, standalone Spark, and Bigtable.
The capability matrix tells agents what each platform can do (SQL pushdown, ML-in-DB, feature store, streaming, governance), so agents adapt automatically.
SQL transpilation generates the correct dialect for each platform — date functions, upsert syntax, type casting, and schema qualification all adjust automatically.
The agent layer never changes. Business logic, quality gates, orchestration, and budgets are platform-agnostic by design.
Multi-cloud and regulatory compliance are configuration problems, not engineering problems, when infrastructure profiles are used.