Chapter 11: Infrastructure Profiles — One Program, Ten Platforms #

"Any sufficiently advanced abstraction is indistinguishable from portability." — adapted from Arthur C. Clarke


Reading time: 20 min | Personas: Priya (Data Engineer), David (VP of Data), Sarah (ML Engineer) | Part III: Data Infrastructure Agents


What You'll Learn #


The Problem: "We Rewrote Everything When We Switched to Snowflake" #

Priya's company spent 18 months building a data platform on Databricks. Spark jobs, Delta Lake tables, Unity Catalog integration, MLflow for model tracking — the full stack. Then the CFO mandated a switch to Snowflake for cost reasons. The data engineering team estimated 3 months to migrate. It took 9.

The problem was not the data migration (Chapter 10 handles that). The problem was that every pipeline, every agent configuration, every SQL query, and every ML integration was written against Databricks-specific APIs. Spark SQL syntax. Delta Lake MERGE semantics. Unity Catalog's Python SDK. MLflow's experiment tracking API. Changing the platform meant rewriting the application.

This is the vendor lock-in tax. It is not hypothetical — it is the single largest hidden cost in enterprise data infrastructure. According to Flexera's 2024 State of the Cloud report, organizations running multi-cloud strategies spend 30% more on integration than those locked into a single platform. But single-platform strategies carry the risk Priya experienced: when the platform changes, everything changes.

The Infrastructure Profile solves this by placing a declarative abstraction layer between agent logic and platform implementation. Agents write against abstract interfaces. The profile maps those interfaces to platform-specific implementations. Changing the platform means changing one declaration — not rewriting the application.


The Infrastructure Profile #

An infrastructure profile declares the platform-specific configuration for data warehousing, data science, governance, and streaming in a single block. Agents reference the profile; they never reference platform APIs directly.

NEAM
infrastructure_profile ProductionInfra {
    data_warehouse: {
        platform: "snowflake",
        connection: env("SNOWFLAKE_URL"),
        warehouse: "ANALYTICS_WH",
        database: "PRODUCTION",
        schemas: ["raw", "staging", "analytics", "ml_features", "ml_predictions"]
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        feature_store: { type: "feast", uri: env("FEAST_SERVER_URI") },
        compute: { local: true, gpu: false }
    },

    governance: {
        catalog: { type: "unity", uri: env("UNITY_CATALOG_URI") },
        regulations: ["GDPR", "CCPA"],
        pii_columns: ["email", "phone", "date_of_birth", "first_name", "last_name"]
    },

    streaming: {
        platform: "snowpipe",
        connection: env("SNOWPIPE_URL")
    }
}

What the Profile Abstracts #

DIAGRAM Infrastructure Profile Abstraction
flowchart TB
    A["AGENT LAYER (Platform-Agnostic)\n\nAgents write against ABSTRACT interfaces:\nexecute_sql(query) — dispatched to target dialect\ntrain_model(config) — dispatched to target ML platform\nstore_artifact(model) — dispatched to target registry\nmonitor_drift(config) — dispatched to target monitoring\nregister_table(meta) — dispatched to target catalog\ningest_stream(source) — dispatched to target streaming"]
    A --> B["INFRASTRUCTURE PROFILE\n\nSelected at configuration time — maps abstract\noperations to platform-specific implementations"]
    B --> C1["Snowflake"]
    B --> C2["BigQuery"]
    B --> C3["Databricks"]
    B --> C4["Redshift"]
    B --> C5["Oracle"]
    B --> C6["PostgreSQL"]
    B --> C7["Teradata"]
    B --> C8["Microsoft\nFabric"]

The key insight: same .neam program, same agent logic, different platform. Change only the infrastructure_profile declaration.


Platform Capability Matrix #

Not all platforms support all capabilities. The infrastructure profile understands what each platform can and cannot do, and agents adapt accordingly.

CapabilitySnowflakeBigQueryDatabricksRedshiftOraclePostgreSQLTeradataFabric
SQL PushdownSnowparkBQ SQLSpark SQLRedshift SQLPL/SQLpgSQLVantage SQLSpark SQL
ML-in-DBSnowpark MLBQ MLMLlibRedshift MLOMLMADlibVantage AnalyticsSynapse ML
Feature StoreFeastVertex AI FSBuilt-in FSSageMaker FSCustomCustomCustomBuilt-in FS
StreamingSnowpipeBQ StreamsStructured StreamingKinesisGoldenGateLogical ReplicationQueryGridEvent Streams
GovernanceHorizonDataplexUnity CatalogLake FormationData SafeRow-Level SecurityColumn-Level SecPurview
Semi-structuredVARIANTJSONStructSUPERJSONJSONBJSONJSON
Time Travel90 days7 days30 daysNoneFlashbackNoneNoneNone
Materialized ViewsYesYesYes (Delta)YesYesYesYesYes
ServerlessYesYesYes (SQL)ServerlessAutonomousNoVantageYes

How Agents Use the Capability Matrix #

When the ETL Agent generates SQL, it checks the capability matrix:

DIAGRAM ETL Agent Platform-Specific ML Training
flowchart TB
    A["ETL Agent needs to train a churn model"] --> B["Platform = Snowflake"]
    A --> C["Platform = BigQuery"]
    A --> D["Platform = PostgreSQL"]
    A --> E["Platform = Teradata"]
    B --> B1["Use Snowpark ML (ML-in-DB)\nNo data leaves the warehouse"]
    C --> C1["Use BigQuery ML\nCREATE MODEL directly in BQ"]
    D --> D1["Use MADlib or export to MLflow\nData moves to external ML platform"]
    E --> E1["Use Vantage Analytics Library\nIn-database scoring"]

Insight: The capability matrix is not just about feature support — it is about cost. Running ML training inside Snowflake (Snowpark ML) avoids data egress charges. Running it outside requires extracting data, which incurs both egress cost and latency. The profile-aware agent makes the optimal choice automatically.


Declaring Profiles for Each Platform #

PostgreSQL Profile (Development / Small Scale) #

NEAM
infrastructure_profile DevProfile {
    data_warehouse: {
        platform: "postgres",
        connection: env("PG_URL"),
        schemas: ["raw", "staging", "analytics", "ml_features"]
    },

    data_science: {
        mlflow: { uri: "http://localhost:5000" },
        compute: { local: true, gpu: false }
    },

    governance: {
        regulations: ["GDPR"],
        pii_columns: ["email", "phone", "date_of_birth"]
    }
}

Snowflake Profile (Production / Enterprise) #

NEAM
infrastructure_profile SnowflakeProfile {
    data_warehouse: {
        platform: "snowflake",
        connection: env("SNOWFLAKE_URL"),
        warehouse: "ANALYTICS_WH",
        database: "PRODUCTION",
        schemas: ["raw", "staging", "analytics", "ml_features", "ml_predictions"]
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        feature_store: { type: "feast", uri: env("FEAST_URI") },
        compute: { local: false, gpu: false }
    },

    governance: {
        catalog: { type: "snowflake_horizon" },
        regulations: ["GDPR", "CCPA", "SOX"],
        pii_columns: ["email", "phone", "date_of_birth", "first_name", "last_name", "ssn"]
    },

    streaming: {
        platform: "snowpipe",
        connection: env("SNOWPIPE_URL")
    }
}

BigQuery Profile (Google Cloud) #

NEAM
infrastructure_profile BigQueryProfile {
    data_warehouse: {
        platform: "bigquery",
        project: env("GCP_PROJECT_ID"),
        dataset: "analytics",
        location: "US"
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        feature_store: { type: "vertex_ai", project: env("GCP_PROJECT_ID") },
        compute: { local: false, gpu: true }
    },

    governance: {
        catalog: { type: "dataplex", project: env("GCP_PROJECT_ID") },
        regulations: ["GDPR", "CCPA"],
        pii_columns: ["email", "phone", "date_of_birth"]
    },

    streaming: {
        platform: "bigquery_streaming",
        project: env("GCP_PROJECT_ID")
    }
}

Databricks Profile (Lakehouse) #

NEAM
infrastructure_profile DatabricksProfile {
    data_warehouse: {
        platform: "databricks",
        connection: env("DATABRICKS_URL"),
        catalog: "main",
        schemas: ["raw", "staging", "analytics", "ml_features"]
    },

    data_science: {
        mlflow: { uri: env("DATABRICKS_URL"), managed: true },
        feature_store: { type: "databricks", catalog: "main" },
        compute: { local: false, gpu: true }
    },

    governance: {
        catalog: { type: "unity_catalog", url: env("DATABRICKS_URL") },
        regulations: ["GDPR", "HIPAA"],
        pii_columns: ["email", "phone", "date_of_birth", "ssn"]
    },

    streaming: {
        platform: "structured_streaming",
        checkpoint: "dbfs:/checkpoints/"
    }
}

Anti-Pattern: Embedding platform-specific configuration in agent declarations instead of in the infrastructure profile. If you write warehouse: "snowflake" directly in your etl agent block, you cannot swap platforms without modifying agent logic. Always reference the profile.


How Agents Adapt Execution Per Profile #

The same agent declaration produces different execution plans depending on the active profile.

Example: Churn Feature Pipeline #

NEAM
// This agent logic is IDENTICAL regardless of platform
etl agent ChurnFeatures {
    provider: "openai", model: "gpt-4o-mini",
    sources: [OLTPSource],
    warehouse: InfraWarehouse,
    model_type: "star",

    pipeline: {
        extract: {
            from: [OLTPSource],
            mode: "incremental",
            watermark: "updated_at"
        },
        transform: {
            derive: {
                days_since_last_order: "DATEDIFF(day, last_order_date, CURRENT_DATE)",
                order_frequency_30d: "COUNT(order_id) OVER (PARTITION BY customer_id ORDER BY order_date RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW)",
                avg_order_value: "AVG(total_amount) OVER (PARTITION BY customer_id)"
            }
        },
        load: {
            target: InfraWarehouse,
            mode: "upsert",
            key: ["customer_id"]
        }
    },

    quality: FeatureQuality,
    lineage: true
}

PostgreSQL Execution (DevProfile) #

CODE
  Agent generates PostgreSQL-dialect SQL:

  INSERT INTO ml_features.churn_features (customer_id, days_since_last_order, ...)
  SELECT
      c.customer_id,
      CURRENT_DATE - c.last_order_date AS days_since_last_order,
      COUNT(o.order_id) OVER (
          PARTITION BY c.customer_id
          ORDER BY o.order_date
          RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
      ) AS order_frequency_30d,
      AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS avg_order_value
  FROM simshop_oltp.customers c
  JOIN simshop_oltp.orders o ON c.customer_id = o.customer_id
  ON CONFLICT (customer_id) DO UPDATE SET ...

Snowflake Execution (SnowflakeProfile) #

CODE
  Agent generates Snowflake-dialect SQL:

  MERGE INTO ml_features.churn_features t
  USING (
      SELECT
          c.customer_id,
          DATEDIFF(day, c.last_order_date, CURRENT_DATE()) AS days_since_last_order,
          COUNT(o.order_id) OVER (
              PARTITION BY c.customer_id
              ORDER BY o.order_date
              RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
          ) AS order_frequency_30d,
          AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS avg_order_value
      FROM raw.customers c
      JOIN raw.orders o ON c.customer_id = o.customer_id
  ) s
  ON t.customer_id = s.customer_id
  WHEN MATCHED THEN UPDATE SET ...
  WHEN NOT MATCHED THEN INSERT ...

Key Differences Handled Automatically #

AspectPostgreSQLSnowflake
Date arithmeticCURRENT_DATE - columnDATEDIFF(day, column, CURRENT_DATE())
Interval syntaxINTERVAL '30 days'INTERVAL '30 days' (same)
UpsertON CONFLICT ... DO UPDATEMERGE INTO ... USING
Schema qualificationsimshop_oltp.customersraw.customers (from profile)
Timestamp functionCURRENT_DATECURRENT_DATE() (parentheses)

The agent logic did not change. The infrastructure profile changed. The SQL changed automatically.


Complete Example: Same Churn Prediction, Two Platforms #

This example shows the complete churn prediction program from DataSims, configured to run against two different platforms by swapping only the infrastructure profile.

NEAM
// ═══════════════════════════════════════════════════════════════
// SimShop Churn Prediction — Platform-Agnostic Program
// ═══════════════════════════════════════════════════════════════

// ── Budgets ──
budget DIOBudget { cost: 500.00, tokens: 2000000 }
budget AgentBudget { cost: 50.00, tokens: 500000 }

// ╔═══════════════════════════════════════════════════════════╗
// ║  SWAP THIS BLOCK TO CHANGE PLATFORMS                     ║
// ╚═══════════════════════════════════════════════════════════╝

// Option A: PostgreSQL (development, DataSims)
infrastructure_profile SimShopInfra {
    data_warehouse: {
        platform: "postgres",
        connection: env("SIMSHOP_PG_URL"),
        schemas: ["simshop_oltp", "simshop_staging", "simshop_dw",
                  "ml_features", "ml_predictions"]
    },
    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        compute: { local: true, gpu: false }
    },
    governance: {
        regulations: ["GDPR"],
        pii_columns: ["email", "phone", "date_of_birth",
                      "first_name", "last_name"]
    }
}

// Option B: Snowflake (production)
// infrastructure_profile SimShopInfra {
//     data_warehouse: {
//         platform: "snowflake",
//         connection: env("SNOWFLAKE_URL"),
//         warehouse: "ANALYTICS_WH",
//         database: "SIMSHOP_PROD",
//         schemas: ["raw", "staging", "analytics",
//                   "ml_features", "ml_predictions"]
//     },
//     data_science: {
//         mlflow: { uri: env("MLFLOW_TRACKING_URI") },
//         feature_store: { type: "feast", uri: env("FEAST_URI") },
//         compute: { local: false, gpu: false }
//     },
//     governance: {
//         catalog: { type: "snowflake_horizon" },
//         regulations: ["GDPR", "CCPA"],
//         pii_columns: ["email", "phone", "date_of_birth",
//                       "first_name", "last_name"]
//     },
//     streaming: {
//         platform: "snowpipe",
//         connection: env("SNOWPIPE_URL")
//     }
// }

// ══════════════════════════════════════════════════════════════
// EVERYTHING BELOW IS PLATFORM-AGNOSTIC
// ══════════════════════════════════════════════════════════════

// ── SQL Connection (resolved from profile) ──
sql_connection SimShopDB {
    platform: "postgres",
    connection: env("SIMSHOP_PG_URL"),
    database: "simshop"
}

// ── Sub-Agents ──
analyst agent SimShopAnalyst {
    provider: "openai", model: "gpt-4o-mini",
    connections: [SimShopDB],
    budget: AgentBudget
}

datascientist agent ChurnDS {
    provider: "openai", model: "gpt-4o",
    budget: AgentBudget
}

causal agent ChurnCausal {
    provider: "openai", model: "o3-mini",
    budget: AgentBudget
}

datatest agent ChurnTester {
    provider: "openai", model: "gpt-4o",
    budget: AgentBudget
}

mlops agent ChurnMLOps {
    provider: "openai", model: "gpt-4o",
    budget: AgentBudget
}

// ── The Orchestrator ──
dio agent SimShopDIO {
    mode: "auto",
    task: "Predict which SimShop customers will churn in the next 90 days",
    infrastructure: SimShopInfra,
    provider: "openai",
    model: "gpt-4o",
    budget: DIOBudget
}

// ── Execute ──
let result = dio_solve(SimShopDIO, "Predict churn and identify top drivers")
print(result)

What Changes Between Platforms #

ComponentPostgreSQL ProfileSnowflake Profile
SQL dialectpg (PostgreSQL)Snowflake SQL
Upsert syntaxON CONFLICT DO UPDATEMERGE INTO ... USING
Date functionsCURRENT_DATE - INTERVALDATEADD(day, -N, CURRENT_DATE())
Feature storeLocal tablesFeast managed service
GovernanceRow-level securitySnowflake Horizon
StreamingLogical replicationSnowpipe
ML trainingExport to MLflowSnowpark ML (in-DB) or MLflow
Schema namessimshop_oltp, simshop_dwraw, analytics

What Does NOT Change #

Try It: Clone the DataSims repo and run the churn prediction program with the PostgreSQL profile. Then modify the infrastructure_profile block to target a different platform (even a mock/dry-run is instructive). Observe how the generated SQL changes while the agent logic stays identical.


Standalone Spark and Bigtable #

Beyond the eight primary warehouse platforms, the infrastructure profile also supports standalone Spark clusters and Google Bigtable for specialized workloads.

Standalone Spark #

NEAM
infrastructure_profile SparkProfile {
    data_warehouse: {
        platform: "spark",
        master: env("SPARK_MASTER_URL"),
        deploy_mode: "cluster",
        executor_memory: "8g",
        num_executors: 16
    },

    data_science: {
        mlflow: { uri: env("MLFLOW_TRACKING_URI") },
        compute: { local: false, gpu: true }
    }
}

Spark profiles are used when the workload requires distributed processing beyond what a SQL warehouse can handle — typically large-scale ML feature engineering, graph processing, or streaming ETL.

Google Bigtable #

NEAM
infrastructure_profile BigtableProfile {
    data_warehouse: {
        platform: "bigtable",
        project: env("GCP_PROJECT_ID"),
        instance: "analytics-instance",
        table_prefix: "simshop_"
    }
}

Bigtable profiles are used for high-throughput, low-latency workloads like real-time feature serving or time-series storage. Agents adapt by generating Bigtable-compatible key designs instead of SQL.


Industry Perspective #

Infrastructure abstraction is a well-established principle in software engineering (think JDBC, ODBC, SQLAlchemy). What is new in the Neam approach is extending this abstraction beyond SQL to cover the entire data lifecycle:

Multi-Cloud Reality (Flexera, 2024): 89% of enterprises use multi-cloud. The infrastructure profile lets organizations run the same data programs across AWS (Redshift), GCP (BigQuery), Azure (Fabric), and on-premises (PostgreSQL, Oracle) without rewriting agent logic.

Platform Churn (Gartner, 2023): The average enterprise replaces or adds a major data platform every 18 months. Infrastructure profiles reduce the cost of platform transitions from months of rewriting to a configuration change.

Regulatory Geography (GDPR, CCPA, DORA): Different regions require different platforms due to data residency laws. A single program with region-specific profiles (EU data stays in EU BigQuery, US data stays in US Snowflake) satisfies compliance without code duplication.


The Evidence #

The DataSims environment demonstrates infrastructure profiles in practice. The SimShop churn prediction program (neam-agents/programs/simshop_churn.neam) includes an infrastructure_profile SimShopInfra targeting PostgreSQL. The same program is designed to work against any supported platform by changing only the profile block.

Evaluation criteria:

For the complete simulation environment, see DataSims. For the language specification, see Neam: The AI-Native Programming Language.


Key Takeaways #

  1. Infrastructure profiles decouple agent logic from platform implementation. Change the profile, not the program.
  2. Ten platforms supported: Snowflake, BigQuery, Databricks, Redshift, Oracle, PostgreSQL, Teradata, Microsoft Fabric, standalone Spark, and Bigtable.
  3. The capability matrix tells agents what each platform can do (SQL pushdown, ML-in-DB, feature store, streaming, governance), so agents adapt automatically.
  4. SQL transpilation generates the correct dialect for each platform — date functions, upsert syntax, type casting, and schema qualification all adjust automatically.
  5. The agent layer never changes. Business logic, quality gates, orchestration, and budgets are platform-agnostic by design.
  6. Multi-cloud and regulatory compliance are configuration problems, not engineering problems, when infrastructure profiles are used.