Chapter 15: The Analyst Agent — NL-to-SQL Across 9 Dialects #
"The goal is not to build a better SQL generator. The goal is to let anyone in the organization ask a question and get a trustworthy answer in 30 seconds." — Benn Stancil, co-founder of Mode Analytics
25 min read | Kim, Raj | Part IV: Platform Intelligence Agents
What you'll learn:
- How the Analyst Agent translates natural language business questions into platform-specific SQL across 9+ dialects
- How domain transfer learning from the v0.9.x ecosystem gives the agent schema understanding, business glossary, and governance context — before a single query is written
- How platform-specific optimization generates cost-aware SQL with materialized view rewriting, partition pruning, and result caching
- How governed execution enforces access policies, row limits, cost limits, and query timeouts from the Governance Agent
- How multi-format output delivers results as Excel workbooks, PDF reports, HTML dashboards, CSV, JSON, and Slack messages
- How automated insight discovery finds trends, outliers, correlations, drill-down opportunities, and forecasts
The Problem #
Kim is a data analyst at a retail company. She receives a Slack message from the VP of Marketing at 9:14 AM: "What were our top 10 products by revenue last quarter, broken down by region? And how does that compare to the same quarter last year?"
Kim knows the answer is in Snowflake. She opens her SQL editor. But which tables? The data warehouse has 412 tables across 8 schemas. She checks the wiki — last updated seven months ago. She finds a SALES_MART.FACT_ORDERS table, joins it with SALES_MART.DIM_PRODUCT and SALES_MART.DIM_REGION, filters by date, groups by product and region, calculates revenue, adds a year-over-year comparison using a window function, formats it, and sends it back at 10:47 AM — 93 minutes later.
The VP follows up: "Can you break that down by channel too? And add the margin percentage?"
Kim goes back to Snowflake. She needs DIM_CHANNEL, which she discovers is actually called DIM_SALES_CHANNEL after 10 minutes of searching. Margin requires cost data from FACT_ORDERS.unit_cost, which she hopes includes shipping costs but is not sure. She sends the second answer at 11:38 AM.
This cycle repeats 8-12 times per day across the analytics team. Each question takes 30-90 minutes. The delay is not SQL difficulty — Kim is an expert SQL writer. The delay is context acquisition: finding the right tables, understanding the schema, verifying business definitions, and translating ambiguous English into precise SQL.
The Analyst Agent eliminates the context acquisition bottleneck. It already knows the schema (from the Modeling Agent), the business glossary (from the Governance Agent), the semantic layer (from the ETL Agent), and the access policies (from the Governance Agent). When Kim types "top 10 products by revenue last quarter by region, YoY comparison," the agent generates optimized Snowflake SQL in seconds — not because it is a better SQL writer than Kim, but because it already has the context that Kim spends 80% of her time acquiring.
The 9+ Supported SQL Dialects #
SQL is not one language. It is a family of dialects with significant syntactic and semantic differences. The Analyst Agent generates native, optimized SQL for each platform — not lowest-common-denominator ANSI SQL.
SQL Dialect Differences — Same Question, Different SQL
═══════════════════════════════════════════════════════════
Question: "Top 10 products by revenue, last 30 days"
Snowflake:
SELECT product_name, SUM(revenue) AS total_rev
FROM analytics.fact_orders
WHERE order_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
BigQuery:
SELECT product_name, SUM(revenue) AS total_rev
FROM `project.analytics.fact_orders`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
Redshift:
SELECT product_name, SUM(revenue) AS total_rev
FROM analytics.fact_orders
WHERE order_date >= GETDATE() - INTERVAL '30 days'
GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
Oracle:
SELECT product_name, SUM(revenue) AS total_rev
FROM analytics.fact_orders
WHERE order_date >= SYSDATE - 30
GROUP BY product_name
ORDER BY total_rev DESC
FETCH FIRST 10 ROWS ONLY;
SQL Server:
SELECT TOP 10 product_name, SUM(revenue) AS total_rev
FROM analytics.fact_orders
WHERE order_date >= DATEADD(day, -30, GETDATE())
GROUP BY product_name
ORDER BY total_rev DESC;
| Platform | platform: value | Key Dialect Differences |
|---|---|---|
| Snowflake | "snowflake" | DATEADD, FLATTEN, QUALIFY, semi-structured VARIANT |
| BigQuery | "bigquery" | Backtick quoting, DATE_SUB, UNNEST, STRUCT, slot-based pricing |
| Redshift | "redshift" | GETDATE(), DISTKEY/SORTKEY hints, WLM queues |
| Databricks | "databricks" | Spark SQL, DELTA table hints, Unity Catalog references |
| PostgreSQL | "postgres" | INTERVAL literals, LATERAL JOIN, CTE materialization hints |
| MySQL | "mysql" | Backtick quoting, DATE_SUB, STRAIGHT_JOIN |
| Oracle | "oracle" | SYSDATE, FETCH FIRST, CONNECT BY, optimizer hints |
| SQL Server | "sqlserver" | TOP, DATEADD, CROSS APPLY, query store hints |
| DuckDB | "duckdb" | Postgres-compatible, PIVOT/UNPIVOT, local analytical engine |
Domain Transfer Learning from the v0.9.x Ecosystem #
The Analyst Agent's key advantage over generic NL-to-SQL tools is domain context — deep knowledge of the data estate that it absorbs from the entire v0.9.x agent ecosystem.
flowchart TB
DA["DataAgent (v0.9.0)"] -->|"Source schemas, column types, quality rules"| RESULT["Analyst Agent understands the schema BEFORE the first question is asked"]
ETL["ETLAgent (v0.9.1)"] -->|"Semantic layer: business metrics, calculated fields\nMart definitions: which tables serve which use cases"| RESULT
GOV["GovernanceAgent (v0.9.4)"] -->|"Business glossary: 'revenue' = unit_price * quantity\nClassification: which columns contain PII\nAccess policies: which roles can see which columns"| RESULT
MOD["ModelingAgent (v0.9.5)"] -->|"ER models: table relationships, foreign keys\nDimensional models: facts, dimensions, grain\nNormalization status: which tables are trustworthy"| RESULT
The domain_context declaration aggregates these references:
domain_context RetailAnalytics {
// v0.9.5 — Schema structure understanding
models: [RetailERModel],
dimensional_models: [SalesStarSchema],
// v0.9.1 — Business metric definitions
semantic_layers: [SalesSemantic],
marts: [SalesMart, MarketingMart],
// v0.9.4 — Business vocabulary and access rules
glossary: BusinessGlossary,
access_policy: AnalystAccess,
classification: DataSensitivity,
// v0.9.0 — Source metadata
schemas: [OrderSchema, ProductSchema, CustomerSchema]
}
Generic NL-to-SQL tools (standalone products without data platform integration) achieve 60-70% accuracy on novel questions because they lack business context. They do not know that "revenue" in your organization means unit_price * quantity - discount_amount rather than total_amount. The domain_context declaration gives the Analyst Agent this context explicitly, lifting accuracy to 92%+ on domain-specific questions.
Platform-Specific Query Optimization #
The query_optimizer declaration configures platform-aware cost optimization:
query_optimizer SnowflakeOptimizer {
platform: "snowflake",
strategies: {
materialized_view_rewriting: {
enabled: true,
prefer_over: "base_tables",
stale_threshold: "4h"
},
partition_pruning: {
enabled: true,
warn_full_scan: true,
max_partitions_scanned: 365
},
clustering_alignment: {
enabled: true,
suggest_clustering_keys: true
},
result_caching: {
enabled: true,
ttl: "24h"
},
warehouse_sizing: {
auto_select: true,
prefer_smaller: true,
cost_estimate: true
}
},
cost_estimation: {
enabled: true,
warn_above: 1.00,
block_above: 10.00,
display_to_user: true
}
}
The optimizer transforms the generated SQL before execution:
flowchart TB
A["NL Question"] --> B["Raw SQL (semantically correct)"]
B --> C["Optimizer Pass 1:\nMaterialized View Rewriting\n\nfact_orders WHERE date > X\n→ mv_monthly_sales (pre-agg)\n(10B rows → 120K rows)"]
C --> D["Optimizer Pass 2:\nPartition Pruning\n\nWHERE order_date >= '2025-10-01'\n→ Prune 274/365 daily parts\n(scan 91 partitions, not 365)"]
D --> E["Optimizer Pass 3:\nCost Estimation\n\nEstimated: 0.03 credits ($0.12)\nStatus: APPROVED (< $1.00 warn)"]
E --> F["Optimized SQL (ready for execution)"]
Do not disable cost estimation to "move faster." A single unoptimized BigQuery query scanning a multi-petabyte table can cost thousands of dollars. The Analyst Agent's cost estimation is not a speed bump — it is a financial guardrail. One customer reported a $47,000 BigQuery bill from a single analyst's ad-hoc query that scanned 3 years of event data without a date filter.
Governed Execution #
The execution_policy declaration enforces governance during query execution:
execution_policy AnalystExecution {
connection_pool: {
max_connections: 10,
timeout: "30s",
idle_timeout: "5m"
},
query_limits: {
max_rows: 100000,
max_runtime: "300s",
max_cost: 5.00,
max_concurrent: 3
},
governance: {
enforce_access_policy: true,
enforce_classification: true,
mask_restricted_columns: true,
audit_all_queries: true,
require_purpose: true
},
safety: {
block_ddl: true,
block_dml: true,
block_truncate: true,
read_only: true
}
}
Governed execution means:
- The Analyst Agent can only read — DDL, DML, and TRUNCATE are blocked at the execution policy level
- Access policies from the Governance Agent are enforced — if the requesting user's role cannot see RESTRICTED columns, those columns are masked in the output
- Every query is logged in the audit trail with user, purpose, SQL text, row count, cost, and timestamp
- Cost limits prevent runaway queries — if estimated cost exceeds
max_cost, the query is blocked before execution
Multi-Format Output #
The output_format declaration configures how results are delivered:
output_format ExcelReport {
type: "excel",
template: "./templates/quarterly_report.xlsx",
sheets: {
summary: { chart: "bar", title: "Revenue by Region" },
detail: { auto_filter: true, freeze_panes: "A2" },
metadata: { query_text: true, run_timestamp: true }
},
delivery: { channel: "email", recipients: ["vp_marketing@company.com"] }
}
output_format SlackSummary {
type: "slack",
channel: "#analytics-results",
format: {
max_rows: 20,
include_chart: true,
chart_type: "bar",
thread_reply: true
}
}
output_format PDFDashboard {
type: "pdf",
template: "./templates/executive_dashboard.html",
page_size: "A4",
orientation: "landscape",
charts: ["bar", "line", "pie"],
delivery: { channel: "email", recipients: ["cfo@company.com"] }
}
output_format JSONApi {
type: "json",
schema: {
include_metadata: true,
include_column_types: true,
date_format: "ISO-8601",
null_handling: "omit"
},
delivery: { channel: "webhook", url: env("ANALYTICS_API_URL") }
}
| Format | Use Case | Key Features |
|---|---|---|
| Excel | Business stakeholders, finance teams | Multi-sheet workbooks, charts, auto-filters, freeze panes |
| Executive reports, board decks | Template-based, charts, professional formatting | |
| HTML | Interactive dashboards, email embeds | Responsive tables, inline charts, sortable columns |
| CSV | Data exchange, downstream processing | Simple, universal, large datasets |
| JSON | API integration, programmatic access | Typed, metadata-rich, webhook delivery |
| Slack | Quick answers, team collaboration | Inline tables, charts, thread replies |
Automated Insight Discovery #
The Analyst Agent does not just answer questions — it discovers insights that the user did not ask about:
analyst agent DataAnalyst {
provider: "anthropic",
model: "claude-sonnet-4-6",
system: "You are a data analyst. Generate optimized SQL, discover insights, and explain findings in business terms.",
temperature: 0.3,
connections: [SnowflakeDW, BigQueryAnalytics, PostgresApp],
domain_context: RetailAnalytics,
optimizer: SnowflakeOptimizer,
execution_policy: AnalystExecution,
output_formats: [ExcelReport, SlackSummary, PDFDashboard, JSONApi],
insights: {
trends: {
enabled: true,
lookback: "12m",
granularity: "weekly",
significance_threshold: 0.05
},
outliers: {
enabled: true,
method: "iqr",
threshold: 1.5
},
correlations: {
enabled: true,
min_correlation: 0.7,
max_features: 20
},
drill_down: {
enabled: true,
auto_suggest: true,
max_depth: 3
},
forecasts: {
enabled: true,
methods: ["linear_regression", "moving_average", "exponential_smoothing"],
horizon: "3m",
confidence_interval: 0.95
}
},
query_library: RetailQueryLibrary,
budget: AnalystBudget,
agent_md: "./agents/data_analyst.md"
}
When Kim asks "top 10 products by revenue last quarter by region," the Analyst Agent returns the answer and also discovers:
| Type | Finding |
|---|---|
| Trend | Product "Widget Pro" revenue grew 34% QoQ in the West region — the fastest growth of any product-region combination. |
| Outlier | Product "Cable Basic" revenue dropped 87% in the Northeast — from $420K to $54K. This exceeds the 1.5x IQR threshold. |
| Correlation | Regions with higher marketing spend (from DIM_MARKETING_BUDGET) show 0.78 correlation with revenue growth (p < 0.01). |
| Drill-Down Suggestion | "Cable Basic" drop in Northeast appears driven by a single large customer (CUST-4892) who stopped ordering in Week 3. Suggest drill-down by customer. |
| Forecast | At current trajectory, West region will exceed East region in total revenue by Q3 2026 (95% CI: Q2-Q4 2026). |
The most valuable insights are often the ones nobody asked for. An analyst answering "top 10 products by revenue" would typically deliver exactly that table and move on. The Analyst Agent's automated insight discovery surfaces the outlier, the correlation, and the drill-down suggestion — turning a data retrieval task into a data analysis task.
Organizational Query Learning #
The query_library declaration captures organizational knowledge:
query_library RetailQueryLibrary {
storage: "snowflake://ANALYTICS/QUERY_LIBRARY",
learning: {
enabled: true,
capture_approved_queries: true,
capture_user_corrections: true,
similarity_threshold: 0.85
},
templates: {
auto_generate: true,
approval_required: true,
categories: ["sales", "marketing", "finance", "operations"]
}
}
Every time the Analyst Agent generates a query, it checks the library for similar past queries. If a user corrects a generated query ("No, revenue should exclude returns"), the correction is captured and applied to future queries with similar patterns. Over time, the agent learns the organization's specific definitions, preferences, and edge cases.
Neam Code: Complete Analyst Agent #
// ═══════════════════════════════════════════════════════════════
// Analyst Agent — Digital Data Analyst for Retail Analytics
// ═══════════════════════════════════════════════════════════════
budget AnalystBudget { cost: 25.00, tokens: 250000, time: 3600000 }
// ─── SQL Connections ───
sql_connection SnowflakeDW {
platform: "snowflake",
connection: env("SNOWFLAKE_URL"),
credentials: env("SNOWFLAKE_CREDS"),
warehouse: "ANALYTICS_WH",
database: "PROD_DB",
schema: "SALES_MART",
timeout: 600,
max_rows: 100000,
cost_limit: 5.00,
prefer_materialized_views: true,
use_result_cache: true,
partition_pruning: true,
schema_source: Snowflake_Analytics,
semantic_layer: Sales_Semantic
}
sql_connection BigQueryAnalytics {
platform: "bigquery",
connection: env("BIGQUERY_CREDS"),
project: "retail-prod",
dataset: "analytics",
cost_limit: 10.00,
max_rows: 500000
}
sql_connection PostgresApp {
platform: "postgres",
connection: env("PG_URL"),
credentials: env("PG_CREDS"),
database: "retail_app",
schema: "public",
max_rows: 5000,
timeout: 30
}
// ─── Domain Context ───
domain_context RetailAnalytics {
models: [RetailERModel],
dimensional_models: [SalesStarSchema],
semantic_layers: [SalesSemantic],
marts: [SalesMart, MarketingMart],
glossary: BusinessGlossary,
access_policy: AnalystAccess,
classification: DataSensitivity,
schemas: [OrderSchema, ProductSchema, CustomerSchema]
}
// ─── The Analyst Agent ───
analyst agent RetailAnalyst {
provider: "anthropic",
model: "claude-sonnet-4-6",
system: "You are a retail data analyst. Answer business questions with optimized SQL across Snowflake, BigQuery, and PostgreSQL.",
temperature: 0.3,
connections: [SnowflakeDW, BigQueryAnalytics, PostgresApp],
domain_context: RetailAnalytics,
optimizer: SnowflakeOptimizer,
execution_policy: AnalystExecution,
output_formats: [ExcelReport, SlackSummary, PDFDashboard, JSONApi],
insights: {
trends: { enabled: true, lookback: "12m" },
outliers: { enabled: true, method: "iqr" },
correlations: { enabled: true, min_correlation: 0.7 },
drill_down: { enabled: true, auto_suggest: true },
forecasts: { enabled: true, horizon: "3m" }
},
query_library: RetailQueryLibrary,
budget: AnalystBudget,
agent_md: "./agents/data_analyst.md"
}
// ─── Usage ───
let answer = RetailAnalyst.ask(
"Top 10 products by revenue last quarter by region, YoY comparison",
output: ExcelReport
)
let scheduled = RetailAnalyst.schedule(
"Weekly revenue summary by channel",
schedule: "0 8 * * MON",
output: SlackSummary
)
Industry Perspective #
The NL-to-SQL Landscape #
The NL-to-SQL problem has been studied for decades, from early systems like LUNAR (1973) to modern LLM-based approaches. The Analyst Agent differentiates from generic NL-to-SQL tools in three ways:
| Capability | Generic NL-to-SQL | Analyst Agent |
|---|---|---|
| Schema understanding | Introspect at query time | Pre-loaded via domain_context |
| Business definitions | None (guesses from column names) | Business glossary from GovernanceAgent |
| Governance | None (trusts the user) | Access policy enforcement, column masking |
| Multi-platform | Usually single-platform | 9+ dialects with platform-specific optimization |
| Learning | Stateless | Query library captures organizational knowledge |
The 80/20 of Analyst Time #
Industry surveys consistently show that analysts spend 60-80% of their time on data discovery, preparation, and context acquisition — not analysis. The Analyst Agent inverts this ratio by pre-loading context from the v0.9.x ecosystem, allowing analysts to spend 80% of their time on interpretation, communication, and decision support.
The Evidence #
DataSims experiments (DataSims repository) demonstrate the Analyst Agent's impact:
| Metric | Without Analyst Agent | With Analyst Agent | Improvement |
|---|---|---|---|
| Time to First Answer | 47 min (manual SQL) | 28 seconds (NL-to-SQL) | 99.0% |
| Queries Per Analyst Per Day | 8-12 | 40-60 | 4.5x |
| Incorrect Join Errors | 2.3 per week | 0.1 per week (domain context) | 95.7% |
| Cross-Platform Query Effort | 2.5 hours (manual translation) | 0 (automatic dialect) | 100% |
| Insight Discovery Rate | 0.3 insights per query (human) | 2.1 insights per query (auto) | 7x |
Ablation A6 (Agent.MD removed across all agents, including the Analyst Agent) showed a 7.7% decrease in churn model AUC. While the Analyst Agent is not directly involved in model building, its domain context feeds the DataScientist Agent's feature engineering. When the Analyst Agent loses its Agent.MD — which encodes organizational-specific business definitions — the domain context becomes generic, and downstream feature engineering suffers.
Clone DataSims, start the SimShop environment, and use the Analyst Agent to answer: "Which customer segments have the highest churn rate, and what is the average order value for each segment?" The agent will generate Snowflake SQL against the SimShop dimensional model, execute it, and discover that the "High Value" segment has a paradoxically high churn rate — the seed insight for the churn prediction case study in Chapter 26.
Key Takeaways #
- The Analyst Agent translates natural language to platform-native SQL across 9+ dialects — not lowest-common-denominator ANSI SQL
- Domain transfer learning from the v0.9.x ecosystem (schemas, semantic layers, glossary, ER models, access policies) eliminates the context acquisition bottleneck that consumes 60-80% of analyst time
- Platform-specific optimization includes materialized view rewriting, partition pruning, result caching, and cost estimation — preventing both slow queries and expensive queries
- Governed execution enforces access policies, column masking, cost limits, and read-only mode — the Analyst Agent cannot modify data, only read it
- Multi-format output (Excel, PDF, HTML, CSV, JSON, Slack) delivers results where stakeholders actually consume them
- Automated insight discovery surfaces trends, outliers, correlations, drill-down suggestions, and forecasts that the user did not explicitly ask for
- Organizational query learning captures corrections and approved queries, improving accuracy over time
For Further Exploration #
- Neam Language Reference: Analyst Agent
- DataSims: Simulated Enterprise Environment — SimShop dimensional model for NL-to-SQL testing
- Designing Data-Intensive Applications (Kleppmann, 2017) — Chapter 2: Data Models and Query Languages
- SQL Antipatterns (Karwin, 2010) — common SQL mistakes that NL-to-SQL systems must avoid
- Spider Benchmark (Yale, 2018) — academic NL-to-SQL evaluation methodology