Chapter 11: Infrastructure Profiles — One Program, Ten Platforms #
"Any sufficiently advanced abstraction is indistinguishable from portability." — adapted from Arthur C. Clarke
Reading time: 20 min | Personas: Priya (Data Engineer), David (VP of Data), Sarah (ML Engineer) | Part III: Data Infrastructure Agents
What You'll Learn #
- How the
infrastructure_profiledeclaration abstracts platform details from agent logic - How the same
.neamprogram runs on Snowflake, BigQuery, Databricks, Redshift, Oracle, PostgreSQL, Teradata, Microsoft Fabric, Bigtable, and standalone Spark - The platform capability matrix: which platforms support SQL pushdown, ML-in-DB, feature stores, streaming, and governance
- How agents adapt their execution strategy based on the active profile
- A complete example: the same churn prediction program running against a PostgreSQL profile and a Snowflake profile
The Problem: "We Rewrote Everything When We Switched to Snowflake" #
Priya's company spent 18 months building a data platform on Databricks. Spark jobs, Delta Lake tables, Unity Catalog integration, MLflow for model tracking — the full stack. Then the CFO mandated a switch to Snowflake for cost reasons. The data engineering team estimated 3 months to migrate. It took 9.
The problem was not the data migration (Chapter 10 handles that). The problem was that every pipeline, every agent configuration, every SQL query, and every ML integration was written against Databricks-specific APIs. Spark SQL syntax. Delta Lake MERGE semantics. Unity Catalog's Python SDK. MLflow's experiment tracking API. Changing the platform meant rewriting the application.
This is the vendor lock-in tax. It is not hypothetical — it is the single largest hidden cost in enterprise data infrastructure. According to Flexera's 2024 State of the Cloud report, organizations running multi-cloud strategies spend 30% more on integration than those locked into a single platform. But single-platform strategies carry the risk Priya experienced: when the platform changes, everything changes.
The Infrastructure Profile solves this by placing a declarative abstraction layer between agent logic and platform implementation. Agents write against abstract interfaces. The profile maps those interfaces to platform-specific implementations. Changing the platform means changing one declaration — not rewriting the application.
The Infrastructure Profile #
An infrastructure profile declares the platform-specific configuration for data warehousing, data science, governance, and streaming in a single block. Agents reference the profile; they never reference platform APIs directly.
infrastructure_profile ProductionInfra {
data_warehouse: {
platform: "snowflake",
connection: env("SNOWFLAKE_URL"),
warehouse: "ANALYTICS_WH",
database: "PRODUCTION",
schemas: ["raw", "staging", "analytics", "ml_features", "ml_predictions"]
},
data_science: {
mlflow: { uri: env("MLFLOW_TRACKING_URI") },
feature_store: { type: "feast", uri: env("FEAST_SERVER_URI") },
compute: { local: true, gpu: false }
},
governance: {
catalog: { type: "unity", uri: env("UNITY_CATALOG_URI") },
regulations: ["GDPR", "CCPA"],
pii_columns: ["email", "phone", "date_of_birth", "first_name", "last_name"]
},
streaming: {
platform: "snowpipe",
connection: env("SNOWPIPE_URL")
}
}
What the Profile Abstracts #
flowchart TB
A["AGENT LAYER (Platform-Agnostic)\n\nAgents write against ABSTRACT interfaces:\nexecute_sql(query) — dispatched to target dialect\ntrain_model(config) — dispatched to target ML platform\nstore_artifact(model) — dispatched to target registry\nmonitor_drift(config) — dispatched to target monitoring\nregister_table(meta) — dispatched to target catalog\ningest_stream(source) — dispatched to target streaming"]
A --> B["INFRASTRUCTURE PROFILE\n\nSelected at configuration time — maps abstract\noperations to platform-specific implementations"]
B --> C1["Snowflake"]
B --> C2["BigQuery"]
B --> C3["Databricks"]
B --> C4["Redshift"]
B --> C5["Oracle"]
B --> C6["PostgreSQL"]
B --> C7["Teradata"]
B --> C8["Microsoft\nFabric"]
The key insight: same .neam program, same agent logic, different platform. Change only the infrastructure_profile declaration.
Platform Capability Matrix #
Not all platforms support all capabilities. The infrastructure profile understands what each platform can and cannot do, and agents adapt accordingly.
| Capability | Snowflake | BigQuery | Databricks | Redshift | Oracle | PostgreSQL | Teradata | Fabric |
|---|---|---|---|---|---|---|---|---|
| SQL Pushdown | Snowpark | BQ SQL | Spark SQL | Redshift SQL | PL/SQL | pgSQL | Vantage SQL | Spark SQL |
| ML-in-DB | Snowpark ML | BQ ML | MLlib | Redshift ML | OML | MADlib | Vantage Analytics | Synapse ML |
| Feature Store | Feast | Vertex AI FS | Built-in FS | SageMaker FS | Custom | Custom | Custom | Built-in FS |
| Streaming | Snowpipe | BQ Streams | Structured Streaming | Kinesis | GoldenGate | Logical Replication | QueryGrid | Event Streams |
| Governance | Horizon | Dataplex | Unity Catalog | Lake Formation | Data Safe | Row-Level Security | Column-Level Sec | Purview |
| Semi-structured | VARIANT | JSON | Struct | SUPER | JSON | JSONB | JSON | JSON |
| Time Travel | 90 days | 7 days | 30 days | None | Flashback | None | None | None |
| Materialized Views | Yes | Yes | Yes (Delta) | Yes | Yes | Yes | Yes | Yes |
| Serverless | Yes | Yes | Yes (SQL) | Serverless | Autonomous | No | Vantage | Yes |
How Agents Use the Capability Matrix #
When the ETL Agent generates SQL, it checks the capability matrix:
- SQL Pushdown supported? Generate dialect-specific SQL and push to warehouse.
- ML-in-DB supported? Train models inside the warehouse (no data movement).
- Feature Store available? Use the platform's feature store; otherwise, use feature tables.
- Streaming supported? Use native streaming; otherwise, use micro-batch.
- Governance built-in? Use native access policies; otherwise, generate RBAC views.
flowchart TB
A["ETL Agent needs to train a churn model"] --> B["Platform = Snowflake"]
A --> C["Platform = BigQuery"]
A --> D["Platform = PostgreSQL"]
A --> E["Platform = Teradata"]
B --> B1["Use Snowpark ML (ML-in-DB)\nNo data leaves the warehouse"]
C --> C1["Use BigQuery ML\nCREATE MODEL directly in BQ"]
D --> D1["Use MADlib or export to MLflow\nData moves to external ML platform"]
E --> E1["Use Vantage Analytics Library\nIn-database scoring"]
Insight: The capability matrix is not just about feature support — it is about cost. Running ML training inside Snowflake (Snowpark ML) avoids data egress charges. Running it outside requires extracting data, which incurs both egress cost and latency. The profile-aware agent makes the optimal choice automatically.
Declaring Profiles for Each Platform #
PostgreSQL Profile (Development / Small Scale) #
infrastructure_profile DevProfile {
data_warehouse: {
platform: "postgres",
connection: env("PG_URL"),
schemas: ["raw", "staging", "analytics", "ml_features"]
},
data_science: {
mlflow: { uri: "http://localhost:5000" },
compute: { local: true, gpu: false }
},
governance: {
regulations: ["GDPR"],
pii_columns: ["email", "phone", "date_of_birth"]
}
}
Snowflake Profile (Production / Enterprise) #
infrastructure_profile SnowflakeProfile {
data_warehouse: {
platform: "snowflake",
connection: env("SNOWFLAKE_URL"),
warehouse: "ANALYTICS_WH",
database: "PRODUCTION",
schemas: ["raw", "staging", "analytics", "ml_features", "ml_predictions"]
},
data_science: {
mlflow: { uri: env("MLFLOW_TRACKING_URI") },
feature_store: { type: "feast", uri: env("FEAST_URI") },
compute: { local: false, gpu: false }
},
governance: {
catalog: { type: "snowflake_horizon" },
regulations: ["GDPR", "CCPA", "SOX"],
pii_columns: ["email", "phone", "date_of_birth", "first_name", "last_name", "ssn"]
},
streaming: {
platform: "snowpipe",
connection: env("SNOWPIPE_URL")
}
}
BigQuery Profile (Google Cloud) #
infrastructure_profile BigQueryProfile {
data_warehouse: {
platform: "bigquery",
project: env("GCP_PROJECT_ID"),
dataset: "analytics",
location: "US"
},
data_science: {
mlflow: { uri: env("MLFLOW_TRACKING_URI") },
feature_store: { type: "vertex_ai", project: env("GCP_PROJECT_ID") },
compute: { local: false, gpu: true }
},
governance: {
catalog: { type: "dataplex", project: env("GCP_PROJECT_ID") },
regulations: ["GDPR", "CCPA"],
pii_columns: ["email", "phone", "date_of_birth"]
},
streaming: {
platform: "bigquery_streaming",
project: env("GCP_PROJECT_ID")
}
}
Databricks Profile (Lakehouse) #
infrastructure_profile DatabricksProfile {
data_warehouse: {
platform: "databricks",
connection: env("DATABRICKS_URL"),
catalog: "main",
schemas: ["raw", "staging", "analytics", "ml_features"]
},
data_science: {
mlflow: { uri: env("DATABRICKS_URL"), managed: true },
feature_store: { type: "databricks", catalog: "main" },
compute: { local: false, gpu: true }
},
governance: {
catalog: { type: "unity_catalog", url: env("DATABRICKS_URL") },
regulations: ["GDPR", "HIPAA"],
pii_columns: ["email", "phone", "date_of_birth", "ssn"]
},
streaming: {
platform: "structured_streaming",
checkpoint: "dbfs:/checkpoints/"
}
}
Anti-Pattern: Embedding platform-specific configuration in agent declarations instead of in the infrastructure profile. If you write
warehouse: "snowflake"directly in youretl agentblock, you cannot swap platforms without modifying agent logic. Always reference the profile.
How Agents Adapt Execution Per Profile #
The same agent declaration produces different execution plans depending on the active profile.
Example: Churn Feature Pipeline #
// This agent logic is IDENTICAL regardless of platform
etl agent ChurnFeatures {
provider: "openai", model: "gpt-4o-mini",
sources: [OLTPSource],
warehouse: InfraWarehouse,
model_type: "star",
pipeline: {
extract: {
from: [OLTPSource],
mode: "incremental",
watermark: "updated_at"
},
transform: {
derive: {
days_since_last_order: "DATEDIFF(day, last_order_date, CURRENT_DATE)",
order_frequency_30d: "COUNT(order_id) OVER (PARTITION BY customer_id ORDER BY order_date RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW)",
avg_order_value: "AVG(total_amount) OVER (PARTITION BY customer_id)"
}
},
load: {
target: InfraWarehouse,
mode: "upsert",
key: ["customer_id"]
}
},
quality: FeatureQuality,
lineage: true
}
PostgreSQL Execution (DevProfile) #
Agent generates PostgreSQL-dialect SQL:
INSERT INTO ml_features.churn_features (customer_id, days_since_last_order, ...)
SELECT
c.customer_id,
CURRENT_DATE - c.last_order_date AS days_since_last_order,
COUNT(o.order_id) OVER (
PARTITION BY c.customer_id
ORDER BY o.order_date
RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
) AS order_frequency_30d,
AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS avg_order_value
FROM simshop_oltp.customers c
JOIN simshop_oltp.orders o ON c.customer_id = o.customer_id
ON CONFLICT (customer_id) DO UPDATE SET ...
Snowflake Execution (SnowflakeProfile) #
Agent generates Snowflake-dialect SQL:
MERGE INTO ml_features.churn_features t
USING (
SELECT
c.customer_id,
DATEDIFF(day, c.last_order_date, CURRENT_DATE()) AS days_since_last_order,
COUNT(o.order_id) OVER (
PARTITION BY c.customer_id
ORDER BY o.order_date
RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
) AS order_frequency_30d,
AVG(o.total_amount) OVER (PARTITION BY c.customer_id) AS avg_order_value
FROM raw.customers c
JOIN raw.orders o ON c.customer_id = o.customer_id
) s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
Key Differences Handled Automatically #
| Aspect | PostgreSQL | Snowflake |
|---|---|---|
| Date arithmetic | CURRENT_DATE - column | DATEDIFF(day, column, CURRENT_DATE()) |
| Interval syntax | INTERVAL '30 days' | INTERVAL '30 days' (same) |
| Upsert | ON CONFLICT ... DO UPDATE | MERGE INTO ... USING |
| Schema qualification | simshop_oltp.customers | raw.customers (from profile) |
| Timestamp function | CURRENT_DATE | CURRENT_DATE() (parentheses) |
The agent logic did not change. The infrastructure profile changed. The SQL changed automatically.
Complete Example: Same Churn Prediction, Two Platforms #
This example shows the complete churn prediction program from DataSims, configured to run against two different platforms by swapping only the infrastructure profile.
// ═══════════════════════════════════════════════════════════════
// SimShop Churn Prediction — Platform-Agnostic Program
// ═══════════════════════════════════════════════════════════════
// ── Budgets ──
budget DIOBudget { cost: 500.00, tokens: 2000000 }
budget AgentBudget { cost: 50.00, tokens: 500000 }
// ╔═══════════════════════════════════════════════════════════╗
// ║ SWAP THIS BLOCK TO CHANGE PLATFORMS ║
// ╚═══════════════════════════════════════════════════════════╝
// Option A: PostgreSQL (development, DataSims)
infrastructure_profile SimShopInfra {
data_warehouse: {
platform: "postgres",
connection: env("SIMSHOP_PG_URL"),
schemas: ["simshop_oltp", "simshop_staging", "simshop_dw",
"ml_features", "ml_predictions"]
},
data_science: {
mlflow: { uri: env("MLFLOW_TRACKING_URI") },
compute: { local: true, gpu: false }
},
governance: {
regulations: ["GDPR"],
pii_columns: ["email", "phone", "date_of_birth",
"first_name", "last_name"]
}
}
// Option B: Snowflake (production)
// infrastructure_profile SimShopInfra {
// data_warehouse: {
// platform: "snowflake",
// connection: env("SNOWFLAKE_URL"),
// warehouse: "ANALYTICS_WH",
// database: "SIMSHOP_PROD",
// schemas: ["raw", "staging", "analytics",
// "ml_features", "ml_predictions"]
// },
// data_science: {
// mlflow: { uri: env("MLFLOW_TRACKING_URI") },
// feature_store: { type: "feast", uri: env("FEAST_URI") },
// compute: { local: false, gpu: false }
// },
// governance: {
// catalog: { type: "snowflake_horizon" },
// regulations: ["GDPR", "CCPA"],
// pii_columns: ["email", "phone", "date_of_birth",
// "first_name", "last_name"]
// },
// streaming: {
// platform: "snowpipe",
// connection: env("SNOWPIPE_URL")
// }
// }
// ══════════════════════════════════════════════════════════════
// EVERYTHING BELOW IS PLATFORM-AGNOSTIC
// ══════════════════════════════════════════════════════════════
// ── SQL Connection (resolved from profile) ──
sql_connection SimShopDB {
platform: "postgres",
connection: env("SIMSHOP_PG_URL"),
database: "simshop"
}
// ── Sub-Agents ──
analyst agent SimShopAnalyst {
provider: "openai", model: "gpt-4o-mini",
connections: [SimShopDB],
budget: AgentBudget
}
datascientist agent ChurnDS {
provider: "openai", model: "gpt-4o",
budget: AgentBudget
}
causal agent ChurnCausal {
provider: "openai", model: "o3-mini",
budget: AgentBudget
}
datatest agent ChurnTester {
provider: "openai", model: "gpt-4o",
budget: AgentBudget
}
mlops agent ChurnMLOps {
provider: "openai", model: "gpt-4o",
budget: AgentBudget
}
// ── The Orchestrator ──
dio agent SimShopDIO {
mode: "auto",
task: "Predict which SimShop customers will churn in the next 90 days",
infrastructure: SimShopInfra,
provider: "openai",
model: "gpt-4o",
budget: DIOBudget
}
// ── Execute ──
let result = dio_solve(SimShopDIO, "Predict churn and identify top drivers")
print(result)
What Changes Between Platforms #
| Component | PostgreSQL Profile | Snowflake Profile |
|---|---|---|
| SQL dialect | pg (PostgreSQL) | Snowflake SQL |
| Upsert syntax | ON CONFLICT DO UPDATE | MERGE INTO ... USING |
| Date functions | CURRENT_DATE - INTERVAL | DATEADD(day, -N, CURRENT_DATE()) |
| Feature store | Local tables | Feast managed service |
| Governance | Row-level security | Snowflake Horizon |
| Streaming | Logical replication | Snowpipe |
| ML training | Export to MLflow | Snowpark ML (in-DB) or MLflow |
| Schema names | simshop_oltp, simshop_dw | raw, analytics |
What Does NOT Change #
- Agent declarations (ETL, DataScientist, Causal, DataTest, MLOps)
- Business logic (churn definition, feature engineering formulas)
- Quality gates (completeness, freshness, drift thresholds)
- Orchestration (DIO crew formation, RACI matrix)
- Budget constraints
Try It: Clone the DataSims repo and run the churn prediction program with the PostgreSQL profile. Then modify the
infrastructure_profileblock to target a different platform (even a mock/dry-run is instructive). Observe how the generated SQL changes while the agent logic stays identical.
Standalone Spark and Bigtable #
Beyond the eight primary warehouse platforms, the infrastructure profile also supports standalone Spark clusters and Google Bigtable for specialized workloads.
Standalone Spark #
infrastructure_profile SparkProfile {
data_warehouse: {
platform: "spark",
master: env("SPARK_MASTER_URL"),
deploy_mode: "cluster",
executor_memory: "8g",
num_executors: 16
},
data_science: {
mlflow: { uri: env("MLFLOW_TRACKING_URI") },
compute: { local: false, gpu: true }
}
}
Spark profiles are used when the workload requires distributed processing beyond what a SQL warehouse can handle — typically large-scale ML feature engineering, graph processing, or streaming ETL.
Google Bigtable #
infrastructure_profile BigtableProfile {
data_warehouse: {
platform: "bigtable",
project: env("GCP_PROJECT_ID"),
instance: "analytics-instance",
table_prefix: "simshop_"
}
}
Bigtable profiles are used for high-throughput, low-latency workloads like real-time feature serving or time-series storage. Agents adapt by generating Bigtable-compatible key designs instead of SQL.
Industry Perspective #
Infrastructure abstraction is a well-established principle in software engineering (think JDBC, ODBC, SQLAlchemy). What is new in the Neam approach is extending this abstraction beyond SQL to cover the entire data lifecycle:
Multi-Cloud Reality (Flexera, 2024): 89% of enterprises use multi-cloud. The infrastructure profile lets organizations run the same data programs across AWS (Redshift), GCP (BigQuery), Azure (Fabric), and on-premises (PostgreSQL, Oracle) without rewriting agent logic.
Platform Churn (Gartner, 2023): The average enterprise replaces or adds a major data platform every 18 months. Infrastructure profiles reduce the cost of platform transitions from months of rewriting to a configuration change.
Regulatory Geography (GDPR, CCPA, DORA): Different regions require different platforms due to data residency laws. A single program with region-specific profiles (EU data stays in EU BigQuery, US data stays in US Snowflake) satisfies compliance without code duplication.
The Evidence #
The DataSims environment demonstrates infrastructure profiles in practice. The SimShop churn prediction program (neam-agents/programs/simshop_churn.neam) includes an infrastructure_profile SimShopInfra targeting PostgreSQL. The same program is designed to work against any supported platform by changing only the profile block.
Evaluation criteria:
- Functional equivalence: The same 47 churn features must be computed correctly on both PostgreSQL and the target platform.
- SQL correctness: Generated SQL must be syntactically valid and semantically equivalent across dialects.
- Performance parity: Query execution time should be within 2x across platforms (accounting for scale differences).
- Zero agent changes: No agent declaration should require modification when the profile changes.
For the complete simulation environment, see DataSims. For the language specification, see Neam: The AI-Native Programming Language.
Key Takeaways #
- Infrastructure profiles decouple agent logic from platform implementation. Change the profile, not the program.
- Ten platforms supported: Snowflake, BigQuery, Databricks, Redshift, Oracle, PostgreSQL, Teradata, Microsoft Fabric, standalone Spark, and Bigtable.
- The capability matrix tells agents what each platform can do (SQL pushdown, ML-in-DB, feature store, streaming, governance), so agents adapt automatically.
- SQL transpilation generates the correct dialect for each platform — date functions, upsert syntax, type casting, and schema qualification all adjust automatically.
- The agent layer never changes. Business logic, quality gates, orchestration, and budgets are platform-agnostic by design.
- Multi-cloud and regulatory compliance are configuration problems, not engineering problems, when infrastructure profiles are used.