Chapter 15: The Analyst Agent — NL-to-SQL Across 9 Dialects #

"The goal is not to build a better SQL generator. The goal is to let anyone in the organization ask a question and get a trustworthy answer in 30 seconds." — Benn Stancil, co-founder of Mode Analytics


25 min read | Kim, Raj | Part IV: Platform Intelligence Agents

What you'll learn:


The Problem #

Kim is a data analyst at a retail company. She receives a Slack message from the VP of Marketing at 9:14 AM: "What were our top 10 products by revenue last quarter, broken down by region? And how does that compare to the same quarter last year?"

Kim knows the answer is in Snowflake. She opens her SQL editor. But which tables? The data warehouse has 412 tables across 8 schemas. She checks the wiki — last updated seven months ago. She finds a SALES_MART.FACT_ORDERS table, joins it with SALES_MART.DIM_PRODUCT and SALES_MART.DIM_REGION, filters by date, groups by product and region, calculates revenue, adds a year-over-year comparison using a window function, formats it, and sends it back at 10:47 AM — 93 minutes later.

The VP follows up: "Can you break that down by channel too? And add the margin percentage?"

Kim goes back to Snowflake. She needs DIM_CHANNEL, which she discovers is actually called DIM_SALES_CHANNEL after 10 minutes of searching. Margin requires cost data from FACT_ORDERS.unit_cost, which she hopes includes shipping costs but is not sure. She sends the second answer at 11:38 AM.

This cycle repeats 8-12 times per day across the analytics team. Each question takes 30-90 minutes. The delay is not SQL difficulty — Kim is an expert SQL writer. The delay is context acquisition: finding the right tables, understanding the schema, verifying business definitions, and translating ambiguous English into precise SQL.

The Analyst Agent eliminates the context acquisition bottleneck. It already knows the schema (from the Modeling Agent), the business glossary (from the Governance Agent), the semantic layer (from the ETL Agent), and the access policies (from the Governance Agent). When Kim types "top 10 products by revenue last quarter by region, YoY comparison," the agent generates optimized Snowflake SQL in seconds — not because it is a better SQL writer than Kim, but because it already has the context that Kim spends 80% of her time acquiring.


The 9+ Supported SQL Dialects #

SQL is not one language. It is a family of dialects with significant syntactic and semantic differences. The Analyst Agent generates native, optimized SQL for each platform — not lowest-common-denominator ANSI SQL.

CODE
 SQL Dialect Differences — Same Question, Different SQL
 ═══════════════════════════════════════════════════════════

 Question: "Top 10 products by revenue, last 30 days"

 Snowflake:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= DATEADD(day, -30, CURRENT_DATE())
   GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

 BigQuery:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM `project.analytics.fact_orders`
   WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
   GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

 Redshift:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= GETDATE() - INTERVAL '30 days'
   GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

 Oracle:
   SELECT product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= SYSDATE - 30
   GROUP BY product_name
   ORDER BY total_rev DESC
   FETCH FIRST 10 ROWS ONLY;

 SQL Server:
   SELECT TOP 10 product_name, SUM(revenue) AS total_rev
   FROM analytics.fact_orders
   WHERE order_date >= DATEADD(day, -30, GETDATE())
   GROUP BY product_name
   ORDER BY total_rev DESC;
Platformplatform: valueKey Dialect Differences
Snowflake"snowflake"DATEADD, FLATTEN, QUALIFY, semi-structured VARIANT
BigQuery"bigquery"Backtick quoting, DATE_SUB, UNNEST, STRUCT, slot-based pricing
Redshift"redshift"GETDATE(), DISTKEY/SORTKEY hints, WLM queues
Databricks"databricks"Spark SQL, DELTA table hints, Unity Catalog references
PostgreSQL"postgres"INTERVAL literals, LATERAL JOIN, CTE materialization hints
MySQL"mysql"Backtick quoting, DATE_SUB, STRAIGHT_JOIN
Oracle"oracle"SYSDATE, FETCH FIRST, CONNECT BY, optimizer hints
SQL Server"sqlserver"TOP, DATEADD, CROSS APPLY, query store hints
DuckDB"duckdb"Postgres-compatible, PIVOT/UNPIVOT, local analytical engine

Domain Transfer Learning from the v0.9.x Ecosystem #

The Analyst Agent's key advantage over generic NL-to-SQL tools is domain context — deep knowledge of the data estate that it absorbs from the entire v0.9.x agent ecosystem.

DIAGRAM Domain Context: What the Analyst Agent Already Knows
flowchart TB
    DA["DataAgent (v0.9.0)"] -->|"Source schemas, column types, quality rules"| RESULT["Analyst Agent understands the schema BEFORE the first question is asked"]
    ETL["ETLAgent (v0.9.1)"] -->|"Semantic layer: business metrics, calculated fields\nMart definitions: which tables serve which use cases"| RESULT
    GOV["GovernanceAgent (v0.9.4)"] -->|"Business glossary: 'revenue' = unit_price * quantity\nClassification: which columns contain PII\nAccess policies: which roles can see which columns"| RESULT
    MOD["ModelingAgent (v0.9.5)"] -->|"ER models: table relationships, foreign keys\nDimensional models: facts, dimensions, grain\nNormalization status: which tables are trustworthy"| RESULT

The domain_context declaration aggregates these references:

NEAM
domain_context RetailAnalytics {
    // v0.9.5 — Schema structure understanding
    models: [RetailERModel],
    dimensional_models: [SalesStarSchema],

    // v0.9.1 — Business metric definitions
    semantic_layers: [SalesSemantic],
    marts: [SalesMart, MarketingMart],

    // v0.9.4 — Business vocabulary and access rules
    glossary: BusinessGlossary,
    access_policy: AnalystAccess,
    classification: DataSensitivity,

    // v0.9.0 — Source metadata
    schemas: [OrderSchema, ProductSchema, CustomerSchema]
}
Insight

Generic NL-to-SQL tools (standalone products without data platform integration) achieve 60-70% accuracy on novel questions because they lack business context. They do not know that "revenue" in your organization means unit_price * quantity - discount_amount rather than total_amount. The domain_context declaration gives the Analyst Agent this context explicitly, lifting accuracy to 92%+ on domain-specific questions.


Platform-Specific Query Optimization #

The query_optimizer declaration configures platform-aware cost optimization:

NEAM
query_optimizer SnowflakeOptimizer {
    platform: "snowflake",
    strategies: {
        materialized_view_rewriting: {
            enabled: true,
            prefer_over: "base_tables",
            stale_threshold: "4h"
        },
        partition_pruning: {
            enabled: true,
            warn_full_scan: true,
            max_partitions_scanned: 365
        },
        clustering_alignment: {
            enabled: true,
            suggest_clustering_keys: true
        },
        result_caching: {
            enabled: true,
            ttl: "24h"
        },
        warehouse_sizing: {
            auto_select: true,
            prefer_smaller: true,
            cost_estimate: true
        }
    },
    cost_estimation: {
        enabled: true,
        warn_above: 1.00,
        block_above: 10.00,
        display_to_user: true
    }
}

The optimizer transforms the generated SQL before execution:

DIAGRAM Optimization Pipeline
flowchart TB
    A["NL Question"] --> B["Raw SQL (semantically correct)"]
    B --> C["Optimizer Pass 1:\nMaterialized View Rewriting\n\nfact_orders WHERE date > X\n→ mv_monthly_sales (pre-agg)\n(10B rows → 120K rows)"]
    C --> D["Optimizer Pass 2:\nPartition Pruning\n\nWHERE order_date >= '2025-10-01'\n→ Prune 274/365 daily parts\n(scan 91 partitions, not 365)"]
    D --> E["Optimizer Pass 3:\nCost Estimation\n\nEstimated: 0.03 credits ($0.12)\nStatus: APPROVED (< $1.00 warn)"]
    E --> F["Optimized SQL (ready for execution)"]
Anti-Pattern

Do not disable cost estimation to "move faster." A single unoptimized BigQuery query scanning a multi-petabyte table can cost thousands of dollars. The Analyst Agent's cost estimation is not a speed bump — it is a financial guardrail. One customer reported a $47,000 BigQuery bill from a single analyst's ad-hoc query that scanned 3 years of event data without a date filter.


Governed Execution #

The execution_policy declaration enforces governance during query execution:

NEAM
execution_policy AnalystExecution {
    connection_pool: {
        max_connections: 10,
        timeout: "30s",
        idle_timeout: "5m"
    },
    query_limits: {
        max_rows: 100000,
        max_runtime: "300s",
        max_cost: 5.00,
        max_concurrent: 3
    },
    governance: {
        enforce_access_policy: true,
        enforce_classification: true,
        mask_restricted_columns: true,
        audit_all_queries: true,
        require_purpose: true
    },
    safety: {
        block_ddl: true,
        block_dml: true,
        block_truncate: true,
        read_only: true
    }
}

Governed execution means:


Multi-Format Output #

The output_format declaration configures how results are delivered:

NEAM
output_format ExcelReport {
    type: "excel",
    template: "./templates/quarterly_report.xlsx",
    sheets: {
        summary: { chart: "bar", title: "Revenue by Region" },
        detail: { auto_filter: true, freeze_panes: "A2" },
        metadata: { query_text: true, run_timestamp: true }
    },
    delivery: { channel: "email", recipients: ["vp_marketing@company.com"] }
}

output_format SlackSummary {
    type: "slack",
    channel: "#analytics-results",
    format: {
        max_rows: 20,
        include_chart: true,
        chart_type: "bar",
        thread_reply: true
    }
}

output_format PDFDashboard {
    type: "pdf",
    template: "./templates/executive_dashboard.html",
    page_size: "A4",
    orientation: "landscape",
    charts: ["bar", "line", "pie"],
    delivery: { channel: "email", recipients: ["cfo@company.com"] }
}

output_format JSONApi {
    type: "json",
    schema: {
        include_metadata: true,
        include_column_types: true,
        date_format: "ISO-8601",
        null_handling: "omit"
    },
    delivery: { channel: "webhook", url: env("ANALYTICS_API_URL") }
}
FormatUse CaseKey Features
ExcelBusiness stakeholders, finance teamsMulti-sheet workbooks, charts, auto-filters, freeze panes
PDFExecutive reports, board decksTemplate-based, charts, professional formatting
HTMLInteractive dashboards, email embedsResponsive tables, inline charts, sortable columns
CSVData exchange, downstream processingSimple, universal, large datasets
JSONAPI integration, programmatic accessTyped, metadata-rich, webhook delivery
SlackQuick answers, team collaborationInline tables, charts, thread replies

Automated Insight Discovery #

The Analyst Agent does not just answer questions — it discovers insights that the user did not ask about:

NEAM
analyst agent DataAnalyst {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a data analyst. Generate optimized SQL, discover insights, and explain findings in business terms.",
    temperature: 0.3,

    connections: [SnowflakeDW, BigQueryAnalytics, PostgresApp],
    domain_context: RetailAnalytics,
    optimizer: SnowflakeOptimizer,
    execution_policy: AnalystExecution,

    output_formats: [ExcelReport, SlackSummary, PDFDashboard, JSONApi],

    insights: {
        trends: {
            enabled: true,
            lookback: "12m",
            granularity: "weekly",
            significance_threshold: 0.05
        },
        outliers: {
            enabled: true,
            method: "iqr",
            threshold: 1.5
        },
        correlations: {
            enabled: true,
            min_correlation: 0.7,
            max_features: 20
        },
        drill_down: {
            enabled: true,
            auto_suggest: true,
            max_depth: 3
        },
        forecasts: {
            enabled: true,
            methods: ["linear_regression", "moving_average", "exponential_smoothing"],
            horizon: "3m",
            confidence_interval: 0.95
        }
    },

    query_library: RetailQueryLibrary,

    budget: AnalystBudget,
    agent_md: "./agents/data_analyst.md"
}

When Kim asks "top 10 products by revenue last quarter by region," the Analyst Agent returns the answer and also discovers:

Automated Insights
TypeFinding
Trend Product "Widget Pro" revenue grew 34% QoQ in the West region — the fastest growth of any product-region combination.
Outlier Product "Cable Basic" revenue dropped 87% in the Northeast — from $420K to $54K. This exceeds the 1.5x IQR threshold.
Correlation Regions with higher marketing spend (from DIM_MARKETING_BUDGET) show 0.78 correlation with revenue growth (p < 0.01).
Drill-Down Suggestion "Cable Basic" drop in Northeast appears driven by a single large customer (CUST-4892) who stopped ordering in Week 3. Suggest drill-down by customer.
Forecast At current trajectory, West region will exceed East region in total revenue by Q3 2026 (95% CI: Q2-Q4 2026).
Insight

The most valuable insights are often the ones nobody asked for. An analyst answering "top 10 products by revenue" would typically deliver exactly that table and move on. The Analyst Agent's automated insight discovery surfaces the outlier, the correlation, and the drill-down suggestion — turning a data retrieval task into a data analysis task.


Organizational Query Learning #

The query_library declaration captures organizational knowledge:

NEAM
query_library RetailQueryLibrary {
    storage: "snowflake://ANALYTICS/QUERY_LIBRARY",
    learning: {
        enabled: true,
        capture_approved_queries: true,
        capture_user_corrections: true,
        similarity_threshold: 0.85
    },
    templates: {
        auto_generate: true,
        approval_required: true,
        categories: ["sales", "marketing", "finance", "operations"]
    }
}

Every time the Analyst Agent generates a query, it checks the library for similar past queries. If a user corrects a generated query ("No, revenue should exclude returns"), the correction is captured and applied to future queries with similar patterns. Over time, the agent learns the organization's specific definitions, preferences, and edge cases.


Neam Code: Complete Analyst Agent #

NEAM
// ═══════════════════════════════════════════════════════════════
// Analyst Agent — Digital Data Analyst for Retail Analytics
// ═══════════════════════════════════════════════════════════════

budget AnalystBudget { cost: 25.00, tokens: 250000, time: 3600000 }

// ─── SQL Connections ───
sql_connection SnowflakeDW {
    platform: "snowflake",
    connection: env("SNOWFLAKE_URL"),
    credentials: env("SNOWFLAKE_CREDS"),
    warehouse: "ANALYTICS_WH",
    database: "PROD_DB",
    schema: "SALES_MART",
    timeout: 600,
    max_rows: 100000,
    cost_limit: 5.00,
    prefer_materialized_views: true,
    use_result_cache: true,
    partition_pruning: true,
    schema_source: Snowflake_Analytics,
    semantic_layer: Sales_Semantic
}

sql_connection BigQueryAnalytics {
    platform: "bigquery",
    connection: env("BIGQUERY_CREDS"),
    project: "retail-prod",
    dataset: "analytics",
    cost_limit: 10.00,
    max_rows: 500000
}

sql_connection PostgresApp {
    platform: "postgres",
    connection: env("PG_URL"),
    credentials: env("PG_CREDS"),
    database: "retail_app",
    schema: "public",
    max_rows: 5000,
    timeout: 30
}

// ─── Domain Context ───
domain_context RetailAnalytics {
    models: [RetailERModel],
    dimensional_models: [SalesStarSchema],
    semantic_layers: [SalesSemantic],
    marts: [SalesMart, MarketingMart],
    glossary: BusinessGlossary,
    access_policy: AnalystAccess,
    classification: DataSensitivity,
    schemas: [OrderSchema, ProductSchema, CustomerSchema]
}

// ─── The Analyst Agent ───
analyst agent RetailAnalyst {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a retail data analyst. Answer business questions with optimized SQL across Snowflake, BigQuery, and PostgreSQL.",
    temperature: 0.3,

    connections: [SnowflakeDW, BigQueryAnalytics, PostgresApp],
    domain_context: RetailAnalytics,
    optimizer: SnowflakeOptimizer,
    execution_policy: AnalystExecution,
    output_formats: [ExcelReport, SlackSummary, PDFDashboard, JSONApi],

    insights: {
        trends: { enabled: true, lookback: "12m" },
        outliers: { enabled: true, method: "iqr" },
        correlations: { enabled: true, min_correlation: 0.7 },
        drill_down: { enabled: true, auto_suggest: true },
        forecasts: { enabled: true, horizon: "3m" }
    },

    query_library: RetailQueryLibrary,
    budget: AnalystBudget,
    agent_md: "./agents/data_analyst.md"
}

// ─── Usage ───
let answer = RetailAnalyst.ask(
    "Top 10 products by revenue last quarter by region, YoY comparison",
    output: ExcelReport
)

let scheduled = RetailAnalyst.schedule(
    "Weekly revenue summary by channel",
    schedule: "0 8 * * MON",
    output: SlackSummary
)

Industry Perspective #

The NL-to-SQL Landscape #

The NL-to-SQL problem has been studied for decades, from early systems like LUNAR (1973) to modern LLM-based approaches. The Analyst Agent differentiates from generic NL-to-SQL tools in three ways:

CapabilityGeneric NL-to-SQLAnalyst Agent
Schema understandingIntrospect at query timePre-loaded via domain_context
Business definitionsNone (guesses from column names)Business glossary from GovernanceAgent
GovernanceNone (trusts the user)Access policy enforcement, column masking
Multi-platformUsually single-platform9+ dialects with platform-specific optimization
LearningStatelessQuery library captures organizational knowledge

The 80/20 of Analyst Time #

Industry surveys consistently show that analysts spend 60-80% of their time on data discovery, preparation, and context acquisition — not analysis. The Analyst Agent inverts this ratio by pre-loading context from the v0.9.x ecosystem, allowing analysts to spend 80% of their time on interpretation, communication, and decision support.


The Evidence #

DataSims experiments (DataSims repository) demonstrate the Analyst Agent's impact:

MetricWithout Analyst AgentWith Analyst AgentImprovement
Time to First Answer47 min (manual SQL)28 seconds (NL-to-SQL)99.0%
Queries Per Analyst Per Day8-1240-604.5x
Incorrect Join Errors2.3 per week0.1 per week (domain context)95.7%
Cross-Platform Query Effort2.5 hours (manual translation)0 (automatic dialect)100%
Insight Discovery Rate0.3 insights per query (human)2.1 insights per query (auto)7x

Ablation A6 (Agent.MD removed across all agents, including the Analyst Agent) showed a 7.7% decrease in churn model AUC. While the Analyst Agent is not directly involved in model building, its domain context feeds the DataScientist Agent's feature engineering. When the Analyst Agent loses its Agent.MD — which encodes organizational-specific business definitions — the domain context becomes generic, and downstream feature engineering suffers.

Try It

Clone DataSims, start the SimShop environment, and use the Analyst Agent to answer: "Which customer segments have the highest churn rate, and what is the average order value for each segment?" The agent will generate Snowflake SQL against the SimShop dimensional model, execute it, and discover that the "High Value" segment has a paradoxically high churn rate — the seed insight for the churn prediction case study in Chapter 26.


Key Takeaways #

For Further Exploration #