Chapter 15: The Analyst Agent — NL-to-SQL Across 9 Dialects #

"The goal is not to build a better SQL generator. The goal is to let anyone in the organization ask a question and get a trustworthy answer in 30 seconds." — Benn Stancil, co-founder of Mode Analytics

25 min read | Kim, Raj | Part IV: Platform Intelligence Agents

What you'll learn:

How the Analyst Agent translates natural language business questions into platform-specific SQL across 9+ dialects
How domain transfer learning from the v0.9.x ecosystem gives the agent schema understanding, business glossary, and governance context — before a single query is written
How platform-specific optimization generates cost-aware SQL with materialized view rewriting, partition pruning, and result caching
How governed execution enforces access policies, row limits, cost limits, and query timeouts from the Governance Agent
How multi-format output delivers results as Excel workbooks, PDF reports, HTML dashboards, CSV, JSON, and Slack messages
How automated insight discovery finds trends, outliers, correlations, drill-down opportunities, and forecasts

The Problem #

Kim is a data analyst at a retail company. She receives a Slack message from the VP of Marketing at 9:14 AM: "What were our top 10 products by revenue last quarter, broken down by region? And how does that compare to the same quarter last year?"

Kim knows the answer is in Snowflake. She opens her SQL editor. But which tables? The data warehouse has 412 tables across 8 schemas. She checks the wiki — last updated seven months ago. She finds a SALES_MART.FACT_ORDERS table, joins it with SALES_MART.DIM_PRODUCT and SALES_MART.DIM_REGION, filters by date, groups by product and region, calculates revenue, adds a year-over-year comparison using a window function, formats it, and sends it back at 10:47 AM — 93 minutes later.

The VP follows up: "Can you break that down by channel too? And add the margin percentage?"

Kim goes back to Snowflake. She needs DIM_CHANNEL, which she discovers is actually called DIM_SALES_CHANNEL after 10 minutes of searching. Margin requires cost data from FACT_ORDERS.unit_cost, which she hopes includes shipping costs but is not sure. She sends the second answer at 11:38 AM.

This cycle repeats 8-12 times per day across the analytics team. Each question takes 30-90 minutes. The delay is not SQL difficulty — Kim is an expert SQL writer. The delay is context acquisition: finding the right tables, understanding the schema, verifying business definitions, and translating ambiguous English into precise SQL.

The Analyst Agent eliminates the context acquisition bottleneck. It already knows the schema (from the Modeling Agent), the business glossary (from the Governance Agent), the semantic layer (from the ETL Agent), and the access policies (from the Governance Agent). When Kim types "top 10 products by revenue last quarter by region, YoY comparison," the agent generates optimized Snowflake SQL in seconds — not because it is a better SQL writer than Kim, but because it already has the context that Kim spends 80% of her time acquiring.

The 9+ Supported SQL Dialects #

SQL is not one language. It is a family of dialects with significant syntactic and semantic differences. The Analyst Agent generates native, optimized SQL for each platform — not lowest-common-denominator ANSI SQL.

CODE

 SQL Dialect Differences — Same Question, Different SQL
 ═══════════════════════════════════════════════════════════

 Question: "Top 10 products by revenue, last 30 days"

 Snowflake:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= DATEADD(day, -30, CURRENT_DATE())
   GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

 BigQuery:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM `project.analytics.fact_orders`
   WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
   GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

 Redshift:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= GETDATE() - INTERVAL '30 days'
   GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

 Oracle:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= SYSDATE - 30
   GROUP BY product_name
   ORDER BY total_rev DESC
   FETCH FIRST 10 ROWS ONLY;

 SQL Server:
   SELECT TOP 10 product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= DATEADD(day, -30, GETDATE())
   GROUP BY product_name
   ORDER BY total_rev DESC;

Platform	`platform:` value	Key Dialect Differences
Snowflake	`"snowflake"`	`DATEADD`, `FLATTEN`, `QUALIFY`, semi-structured `VARIANT`
BigQuery	`"bigquery"`	Backtick quoting, `DATE_SUB`, `UNNEST`, `STRUCT`, slot-based pricing
Redshift	`"redshift"`	`GETDATE()`, `DISTKEY`/`SORTKEY` hints, WLM queues
Databricks	`"databricks"`	Spark SQL, `DELTA` table hints, Unity Catalog references
PostgreSQL	`"postgres"`	`INTERVAL` literals, `LATERAL JOIN`, `CTE` materialization hints
MySQL	`"mysql"`	Backtick quoting, `DATE_SUB`, `STRAIGHT_JOIN`
Oracle	`"oracle"`	`SYSDATE`, `FETCH FIRST`, `CONNECT BY`, optimizer hints
SQL Server	`"sqlserver"`	`TOP`, `DATEADD`, `CROSS APPLY`, query store hints
DuckDB	`"duckdb"`	Postgres-compatible, `PIVOT/UNPIVOT`, local analytical engine

Domain Transfer Learning from the v0.9.x Ecosystem #

The Analyst Agent's key advantage over generic NL-to-SQL tools is domain context — deep knowledge of the data estate that it absorbs from the entire v0.9.x agent ecosystem.

flowchart TB
    DA["DataAgent (v0.9.0)"] -->|"Source schemas, column types, quality rules"| RESULT["Analyst Agent understands the schema BEFORE the first question is asked"]
    ETL["ETLAgent (v0.9.1)"] -->|"Semantic layer: business metrics, calculated fields\nMart definitions: which tables serve which use cases"| RESULT
    GOV["GovernanceAgent (v0.9.4)"] -->|"Business glossary: 'revenue' = unit_price * quantity\nClassification: which columns contain PII\nAccess policies: which roles can see which columns"| RESULT
    MOD["ModelingAgent (v0.9.5)"] -->|"ER models: table relationships, foreign keys\nDimensional models: facts, dimensions, grain\nNormalization status: which tables are trustworthy"| RESULT

The domain_context declaration aggregates these references:

NEAM

domain_context RetailAnalytics {
    // v0.9.5 — Schema structure understanding
    models: [RetailERModel],
    dimensional_models: [SalesStarSchema],

    // v0.9.1 — Business metric definitions
    semantic_layers: [SalesSemantic],
    marts: [SalesMart, MarketingMart],

    // v0.9.4 — Business vocabulary and access rules
    glossary: BusinessGlossary,
    access_policy: AnalystAccess,
    classification: DataSensitivity,

    // v0.9.0 — Source metadata
    schemas: [OrderSchema, ProductSchema, CustomerSchema]
}

Insight

Generic NL-to-SQL tools (standalone products without data platform integration) achieve 60-70% accuracy on novel questions because they lack business context. They do not know that "revenue" in your organization means unit_price * quantity - discount_amount rather than total_amount. The domain_context declaration gives the Analyst Agent this context explicitly, lifting accuracy to 92%+ on domain-specific questions.

Platform-Specific Query Optimization #

The query_optimizer declaration configures platform-aware cost optimization:

NEAM

query_optimizer SnowflakeOptimizer {
    platform: "snowflake",
    strategies: {
        materialized_view_rewriting: {
            enabled: true,
            prefer_over: "base_tables",
            stale_threshold: "4h"
        },
        partition_pruning: {
            enabled: true,
            warn_full_scan: true,
            max_partitions_scanned: 365
        },
        clustering_alignment: {
            enabled: true,
            suggest_clustering_keys: true
        },
        result_caching: {
            enabled: true,
            ttl: "24h"
        },
        warehouse_sizing: {
            auto_select: true,
            prefer_smaller: true,
            cost_estimate: true
        }
    },
    cost_estimation: {
        enabled: true,
        warn_above: 1.00,
        block_above: 10.00,
        display_to_user: true
    }
}

The optimizer transforms the generated SQL before execution:

flowchart TB
    A["NL Question"] --> B["Raw SQL (semantically correct)"]
    B --> C["Optimizer Pass 1:\nMaterialized View Rewriting\n\nfact_orders WHERE date > X\n→ mv_monthly_sales (pre-agg)\n(10B rows → 120K rows)"]
    C --> D["Optimizer Pass 2:\nPartition Pruning\n\nWHERE order_date >= '2025-10-01'\n→ Prune 274/365 daily parts\n(scan 91 partitions, not 365)"]
    D --> E["Optimizer Pass 3:\nCost Estimation\n\nEstimated: 0.03 credits ($0.12)\nStatus: APPROVED (< $1.00 warn)"]
    E --> F["Optimized SQL (ready for execution)"]

Anti-Pattern

Do not disable cost estimation to "move faster." A single unoptimized BigQuery query scanning a multi-petabyte table can cost thousands of dollars. The Analyst Agent's cost estimation is not a speed bump — it is a financial guardrail. One customer reported a $47,000 BigQuery bill from a single analyst's ad-hoc query that scanned 3 years of event data without a date filter.

Governed Execution #

The execution_policy declaration enforces governance during query execution:

NEAM

execution_policy AnalystExecution {
    connection_pool: {
        max_connections: 10,
        timeout: "30s",
        idle_timeout: "5m"
    },
    query_limits: {
        max_rows: 100000,
        max_runtime: "300s",
        max_cost: 5.00,
        max_concurrent: 3
    },
    governance: {
        enforce_access_policy: true,
        enforce_classification: true,
        mask_restricted_columns: true,
        audit_all_queries: true,
        require_purpose: true
    },
    safety: {
        block_ddl: true,
        block_dml: true,
        block_truncate: true,
        read_only: true
    }
}

Governed execution means:

The Analyst Agent can only read — DDL, DML, and TRUNCATE are blocked at the execution policy level
Access policies from the Governance Agent are enforced — if the requesting user's role cannot see RESTRICTED columns, those columns are masked in the output
Every query is logged in the audit trail with user, purpose, SQL text, row count, cost, and timestamp
Cost limits prevent runaway queries — if estimated cost exceeds max_cost, the query is blocked before execution

Multi-Format Output #

The output_format declaration configures how results are delivered:

NEAM

output_format ExcelReport {
    type: "excel",
    template: "./templates/quarterly_report.xlsx",
    sheets: {
        summary: { chart: "bar", title: "Revenue by Region" },
        detail: { auto_filter: true, freeze_panes: "A2" },
        metadata: { query_text: true, run_timestamp: true }
    },
    delivery: { channel: "email", recipients: ["vp_marketing@company.com"] }
}

output_format SlackSummary {
    type: "slack",
    channel: "#analytics-results",
    format: {
        max_rows: 20,
        include_chart: true,
        chart_type: "bar",
        thread_reply: true
    }
}

output_format PDFDashboard {
    type: "pdf",
    template: "./templates/executive_dashboard.html",
    page_size: "A4",
    orientation: "landscape",
    charts: ["bar", "line", "pie"],
    delivery: { channel: "email", recipients: ["cfo@company.com"] }
}

output_format JSONApi {
    type: "json",
    schema: {
        include_metadata: true,
        include_column_types: true,
        date_format: "ISO-8601",
        null_handling: "omit"
    },
    delivery: { channel: "webhook", url: env("ANALYTICS_API_URL") }
}

Format	Use Case	Key Features
Excel	Business stakeholders, finance teams	Multi-sheet workbooks, charts, auto-filters, freeze panes
PDF	Executive reports, board decks	Template-based, charts, professional formatting
HTML	Interactive dashboards, email embeds	Responsive tables, inline charts, sortable columns
CSV	Data exchange, downstream processing	Simple, universal, large datasets
JSON	API integration, programmatic access	Typed, metadata-rich, webhook delivery
Slack	Quick answers, team collaboration	Inline tables, charts, thread replies

Automated Insight Discovery #

The Analyst Agent does not just answer questions — it discovers insights that the user did not ask about:

NEAM

analyst agent DataAnalyst {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a data analyst. Generate optimized SQL, discover insights, and explain findings in business terms.",
    temperature: 0.3,

    connections: [SnowflakeDW, BigQueryAnalytics, PostgresApp],
    domain_context: RetailAnalytics,
    optimizer: SnowflakeOptimizer,
    execution_policy: AnalystExecution,

    output_formats: [ExcelReport, SlackSummary, PDFDashboard, JSONApi],

    insights: {
        trends: {
            enabled: true,
            lookback: "12m",
            granularity: "weekly",
            significance_threshold: 0.05
        },
        outliers: {
            enabled: true,
            method: "iqr",
            threshold: 1.5
        },
        correlations: {
            enabled: true,
            min_correlation: 0.7,
            max_features: 20
        },
        drill_down: {
            enabled: true,
            auto_suggest: true,
            max_depth: 3
        },
        forecasts: {
            enabled: true,
            methods: ["linear_regression", "moving_average", "exponential_smoothing"],
            horizon: "3m",
            confidence_interval: 0.95
        }
    },

    query_library: RetailQueryLibrary,

    budget: AnalystBudget,
    agent_md: "./agents/data_analyst.md"
}

When Kim asks "top 10 products by revenue last quarter by region," the Analyst Agent returns the answer and also discovers:

Automated Insights

Type	Finding
Trend	Product "Widget Pro" revenue grew 34% QoQ in the West region — the fastest growth of any product-region combination.
Outlier	Product "Cable Basic" revenue dropped 87% in the Northeast — from $420K to $54K. This exceeds the 1.5x IQR threshold.
Correlation	Regions with higher marketing spend (from DIM_MARKETING_BUDGET) show 0.78 correlation with revenue growth (p < 0.01).
Drill-Down Suggestion	"Cable Basic" drop in Northeast appears driven by a single large customer (CUST-4892) who stopped ordering in Week 3. Suggest drill-down by customer.
Forecast	At current trajectory, West region will exceed East region in total revenue by Q3 2026 (95% CI: Q2-Q4 2026).

Insight

The most valuable insights are often the ones nobody asked for. An analyst answering "top 10 products by revenue" would typically deliver exactly that table and move on. The Analyst Agent's automated insight discovery surfaces the outlier, the correlation, and the drill-down suggestion — turning a data retrieval task into a data analysis task.

Organizational Query Learning #

The query_library declaration captures organizational knowledge:

NEAM

query_library RetailQueryLibrary {
    storage: "snowflake://ANALYTICS/QUERY_LIBRARY",
    learning: {
        enabled: true,
        capture_approved_queries: true,
        capture_user_corrections: true,
        similarity_threshold: 0.85
    },
    templates: {
        auto_generate: true,
        approval_required: true,
        categories: ["sales", "marketing", "finance", "operations"]
    }
}

Every time the Analyst Agent generates a query, it checks the library for similar past queries. If a user corrects a generated query ("No, revenue should exclude returns"), the correction is captured and applied to future queries with similar patterns. Over time, the agent learns the organization's specific definitions, preferences, and edge cases.

Neam Code: Complete Analyst Agent #

NEAM

// ═══════════════════════════════════════════════════════════════
// Analyst Agent — Digital Data Analyst for Retail Analytics
// ═══════════════════════════════════════════════════════════════

budget AnalystBudget { cost: 25.00, tokens: 250000, time: 3600000 }

// ─── SQL Connections ───
sql_connection SnowflakeDW {
    platform: "snowflake",
    connection: env("SNOWFLAKE_URL"),
    credentials: env("SNOWFLAKE_CREDS"),
    warehouse: "ANALYTICS_WH",
    database: "PROD_DB",
    schema: "SALES_MART",
    timeout: 600,
    max_rows: 100000,
    cost_limit: 5.00,
    prefer_materialized_views: true,
    use_result_cache: true,
    partition_pruning: true,
    schema_source: Snowflake_Analytics,
    semantic_layer: Sales_Semantic
}

sql_connection BigQueryAnalytics {
    platform: "bigquery",
    connection: env("BIGQUERY_CREDS"),
    project: "retail-prod",
    dataset: "analytics",
    cost_limit: 10.00,
    max_rows: 500000
}

sql_connection PostgresApp {
    platform: "postgres",
    connection: env("PG_URL"),
    credentials: env("PG_CREDS"),
    database: "retail_app",
    schema: "public",
    max_rows: 5000,
    timeout: 30
}

// ─── Domain Context ───
domain_context RetailAnalytics {
    models: [RetailERModel],
    dimensional_models: [SalesStarSchema],
    semantic_layers: [SalesSemantic],
    marts: [SalesMart, MarketingMart],
    glossary: BusinessGlossary,
    access_policy: AnalystAccess,
    classification: DataSensitivity,
    schemas: [OrderSchema, ProductSchema, CustomerSchema]
}

// ─── The Analyst Agent ───
analyst agent RetailAnalyst {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a retail data analyst. Answer business questions with optimized SQL across Snowflake, BigQuery, and PostgreSQL.",
    temperature: 0.3,

    connections: [SnowflakeDW, BigQueryAnalytics, PostgresApp],
    domain_context: RetailAnalytics,
    optimizer: SnowflakeOptimizer,
    execution_policy: AnalystExecution,
    output_formats: [ExcelReport, SlackSummary, PDFDashboard, JSONApi],

    insights: {
        trends: { enabled: true, lookback: "12m" },
        outliers: { enabled: true, method: "iqr" },
        correlations: { enabled: true, min_correlation: 0.7 },
        drill_down: { enabled: true, auto_suggest: true },
        forecasts: { enabled: true, horizon: "3m" }
    },

    query_library: RetailQueryLibrary,
    budget: AnalystBudget,
    agent_md: "./agents/data_analyst.md"
}

// ─── Usage ───
let answer = RetailAnalyst.ask(
    "Top 10 products by revenue last quarter by region, YoY comparison",
    output: ExcelReport
)

let scheduled = RetailAnalyst.schedule(
    "Weekly revenue summary by channel",
    schedule: "0 8 * * MON",
    output: SlackSummary
)

Industry Perspective #

The NL-to-SQL Landscape #

The NL-to-SQL problem has been studied for decades, from early systems like LUNAR (1973) to modern LLM-based approaches. The Analyst Agent differentiates from generic NL-to-SQL tools in three ways:

Capability	Generic NL-to-SQL	Analyst Agent
Schema understanding	Introspect at query time	Pre-loaded via domain_context
Business definitions	None (guesses from column names)	Business glossary from GovernanceAgent
Governance	None (trusts the user)	Access policy enforcement, column masking
Multi-platform	Usually single-platform	9+ dialects with platform-specific optimization
Learning	Stateless	Query library captures organizational knowledge

The 80/20 of Analyst Time #

Industry surveys consistently show that analysts spend 60-80% of their time on data discovery, preparation, and context acquisition — not analysis. The Analyst Agent inverts this ratio by pre-loading context from the v0.9.x ecosystem, allowing analysts to spend 80% of their time on interpretation, communication, and decision support.

The Evidence #

DataSims experiments (DataSims repository) demonstrate the Analyst Agent's impact:

Metric	Without Analyst Agent	With Analyst Agent	Improvement
Time to First Answer	47 min (manual SQL)	28 seconds (NL-to-SQL)	99.0%
Queries Per Analyst Per Day	8-12	40-60	4.5x
Incorrect Join Errors	2.3 per week	0.1 per week (domain context)	95.7%
Cross-Platform Query Effort	2.5 hours (manual translation)	0 (automatic dialect)	100%
Insight Discovery Rate	0.3 insights per query (human)	2.1 insights per query (auto)	7x

Ablation A6 (Agent.MD removed across all agents, including the Analyst Agent) showed a 7.7% decrease in churn model AUC. While the Analyst Agent is not directly involved in model building, its domain context feeds the DataScientist Agent's feature engineering. When the Analyst Agent loses its Agent.MD — which encodes organizational-specific business definitions — the domain context becomes generic, and downstream feature engineering suffers.

Try It

Clone DataSims, start the SimShop environment, and use the Analyst Agent to answer: "Which customer segments have the highest churn rate, and what is the average order value for each segment?" The agent will generate Snowflake SQL against the SimShop dimensional model, execute it, and discover that the "High Value" segment has a paradoxically high churn rate — the seed insight for the churn prediction case study in Chapter 26.

Key Takeaways #

The Analyst Agent translates natural language to platform-native SQL across 9+ dialects — not lowest-common-denominator ANSI SQL
Domain transfer learning from the v0.9.x ecosystem (schemas, semantic layers, glossary, ER models, access policies) eliminates the context acquisition bottleneck that consumes 60-80% of analyst time
Platform-specific optimization includes materialized view rewriting, partition pruning, result caching, and cost estimation — preventing both slow queries and expensive queries
Governed execution enforces access policies, column masking, cost limits, and read-only mode — the Analyst Agent cannot modify data, only read it
Multi-format output (Excel, PDF, HTML, CSV, JSON, Slack) delivers results where stakeholders actually consume them
Automated insight discovery surfaces trends, outliers, correlations, drill-down suggestions, and forecasts that the user did not explicitly ask for
Organizational query learning captures corrections and approved queries, improving accuracy over time

For Further Exploration #

Neam Language Reference: Analyst Agent
DataSims: Simulated Enterprise Environment — SimShop dimensional model for NL-to-SQL testing
Designing Data-Intensive Applications (Kleppmann, 2017) — Chapter 2: Data Models and Query Languages
SQL Antipatterns (Karwin, 2010) — common SQL mistakes that NL-to-SQL systems must avoid
Spider Benchmark (Yale, 2018) — academic NL-to-SQL evaluation methodology