Chapter 13: The Governance Agent — Compliance by Design #

"You can't govern what you can't see. And you can't see what you haven't classified." — Peter Aiken, founding president, International Data Management Association

25 min read | David, Raj | Part IV: Platform Intelligence Agents

What you'll learn:

How to encode data governance as code — classification, access control, lineage, compliance, and quality scoring as compilable declarations
How auto-classification uses LLM-assisted semantic analysis to identify PII, PHI, and financial data
How RBAC/ABAC access policies enforce column-level masking without modifying upstream pipelines
How six-dimension quality scoring provides a single health metric for every data asset
How compliance policies map directly to GDPR, CCPA, and DORA regulatory requirements
How external tool connectors orchestrate Collibra, Atlas, Alation, Purview, Informatica CDG, Atlan, and Immuta

The Problem #

David is the VP of Data at a mid-sized fintech company. His team has 340 tables across Snowflake, PostgreSQL, and S3. Somewhere in those tables are Social Security numbers, credit card numbers, email addresses, and health indicators from a wellness program integration. He knows this because a security audit found unmasked PII in a development Snowflake clone six months ago. He does not know exactly which columns, in which tables, contain sensitive data today — because the schema changes weekly and nobody maintains the classification spreadsheet.

His compliance officer needs to demonstrate GDPR readiness for the next board meeting. His security team needs column-level access controls. His data engineers need to know which tables they can use for analytics without legal review. And his analysts need to know what cust_ssn_hash actually means in business terms — the column naming conventions were set by three different teams over four years.

David has Collibra. It has been partially configured. Twenty-two percent of tables are cataloged. Classification was attempted manually last quarter but fell behind within weeks as new tables appeared. Access policies exist in Snowflake's RBAC system but are not connected to the catalog. Lineage is drawn on a whiteboard in the data engineering room.

The Governance Agent does not replace Collibra. It orchestrates Collibra — and Atlas, and Alation, and every other governance tool in the stack — as an intelligence layer that classifies continuously, enforces policies as code, and closes the gap between "governance strategy" and "governance reality."

Data Governance as Code #

Traditional governance relies on documentation — policies written in Word documents, classification maintained in spreadsheets, access reviews conducted quarterly via email threads. The fundamental insight of governance-as-code is that policies should be compilable, testable, and executable — the same properties we demand of data pipelines.

flowchart TB
  subgraph Traditional["Traditional Governance"]
    direction TB
    A["Word Doc:
PII must be masked"] -->|"manual"| B["Spreadsheet:
Table A has SSN in col 5"]
    B -->|"stale"| C["Snowflake RBAC:
maybe configured"]
  end
  subgraph Neam["Governance as Code (Neam)"]
    direction TB
    D["classification_policy DataSensitivity {
levels: { RESTRICTED: {
controls: [encryption, masking] } }
auto_classify: { enabled: true } }"] -->|"compiled"| E["neamc → bytecode
VM enforces at every query"]
  end

Insight

The gap between "policy written" and "policy enforced" is where breaches happen. When policies are code, enforcement is continuous and automated. When policies are documents, enforcement depends on someone reading the document and configuring the platform correctly — a process that degrades over time.

Auto-Classification: PII, PHI, and Financial Data #

The classification_policy declaration defines sensitivity levels and enables LLM-assisted auto-classification:

NEAM

classification_policy DataSensitivity {
    levels: {
        RESTRICTED: {
            level: 4,
            controls: ["encryption_at_rest", "column_masking", "audit_logging"],
            retention_max: "7y",
            cross_border: "prohibited"
        },
        CONFIDENTIAL: {
            level: 3,
            controls: ["encryption_at_rest", "row_filtering"],
            retention_max: "10y",
            cross_border: "with_approval"
        },
        INTERNAL: {
            level: 2,
            controls: ["access_logging"],
            retention_max: "indefinite",
            cross_border: "allowed"
        },
        PUBLIC: {
            level: 1,
            controls: ["none"],
            retention_max: "indefinite",
            cross_border: "allowed"
        }
    },
    auto_classify: {
        enabled: true,
        provider: "openai",
        model: "gpt-4o-mini",
        semantic: {
            column_name_analysis: true,
            sample_value_analysis: true,
            confidence_threshold: 0.80
        },
        drift_detection: {
            enabled: true,
            scan_interval: "24h"
        }
    },
    propagation: {
        lineage_based: true,
        inheritance: "strictest"
    }
}

Auto-classification works in three phases:

Column Name Analysis — The LLM examines column names against known PII patterns (ssn, social_security, credit_card, email, dob, date_of_birth, phone, address, diagnosis_code)
Sample Value Analysis — The LLM examines sampled values (not bulk data) to confirm classification. A column named customer_id might contain SSNs if the naming convention is poor
Drift Detection — Every 24 hours, the agent re-scans for new columns, renamed columns, and changed data patterns

Anti-Pattern

Do not set confidence_threshold below 0.70. Low thresholds create excessive false positives — every column gets flagged as "maybe PII" — which leads to governance fatigue and teams ignoring classifications entirely. Start at 0.80 and adjust based on your false-negative tolerance.

Access Control: RBAC and ABAC #

The access_policy declaration encodes role-based (RBAC) and attribute-based (ABAC) access control:

NEAM

access_policy DataAccess {
    rbac: {
        roles: {
            "data_engineer": {
                tables: ["*"],
                columns: ["*"],
                exclude_classifications: ["RESTRICTED"],
                operations: ["SELECT", "INSERT"]
            },
            "analyst": {
                tables: ["*_MART.*", "*_DIM.*", "*_FACT.*"],
                columns: ["*"],
                exclude_classifications: ["RESTRICTED", "CONFIDENTIAL"],
                operations: ["SELECT"]
            },
            "data_scientist": {
                tables: ["*"],
                columns: ["*"],
                masking: {
                    RESTRICTED: "hash",
                    CONFIDENTIAL: "partial_mask"
                },
                operations: ["SELECT"]
            }
        }
    },
    abac: {
        rules: [
            {
                name: "geo_restriction",
                condition: "user.location NOT IN ['EU'] AND data.classification == 'RESTRICTED' AND data.regulation == 'GDPR'",
                action: "deny",
                reason: "GDPR data cannot be accessed from non-EU locations"
            },
            {
                name: "purpose_limitation",
                condition: "query.purpose NOT IN data.allowed_purposes",
                action: "deny",
                reason: "Data can only be accessed for declared purposes"
            }
        ]
    },
    masking: {
        strategies: {
            "hash": "SHA-256 one-way hash",
            "partial_mask": "Show first/last N characters",
            "redact": "Replace with [REDACTED]",
            "tokenize": "Reversible tokenization via vault",
            "generalize": "Reduce precision (ZIP5 → ZIP3)"
        }
    },
    review: {
        frequency: "quarterly",
        auto_revoke_inactive: "90d",
        notification: "slack_security"
    }
}

Six-Dimension Quality Scoring #

The quality_policy declaration defines six quality dimensions. Each data asset receives a composite score from 0.0 to 1.0:

Dimension	Weight	Measurement
Completeness	0.20	% of non-null required fields
Accuracy	0.20	% passing validation rules
Consistency	0.15	Cross-source agreement rate
Timeliness	0.15	Data freshness vs. SLA
Uniqueness	0.15	Duplicate detection rate
Validity	0.15	Format/range/referential checks

Composite Score = weighted average → 0.0 to 1.0 | Threshold: block pipeline if score < 0.85

NEAM

quality_policy DataQuality {
    dimensions: {
        completeness: { weight: 0.20, threshold: 0.95 },
        accuracy: { weight: 0.20, threshold: 0.98 },
        consistency: { weight: 0.15, threshold: 0.90 },
        timeliness: { weight: 0.15, threshold: 0.95 },
        uniqueness: { weight: 0.15, threshold: 0.99 },
        validity: { weight: 0.15, threshold: 0.95 }
    },
    scoring: {
        composite_threshold: 0.85,
        action_below_threshold: "block_and_notify",
        trend_window: "30d",
        trend_alert: "declining"
    },
    profiling: {
        schedule: "daily",
        sample_size: 10000,
        include_distribution: true
    }
}

The compliance_policy declaration maps regulatory requirements directly to technical controls:

NEAM

compliance_policy RegulatoryCompliance {
    regulations: {
        GDPR: {
            enabled: true,
            scope: "EU_CUSTOMER_DATA",
            requirements: {
                right_to_access: {
                    implementation: "dsar_automation",
                    sla: "30d"
                },
                right_to_erasure: {
                    implementation: "delete_cascade_with_audit",
                    sla: "30d",
                    exceptions: ["legal_hold", "regulatory_retention"]
                },
                data_portability: {
                    implementation: "export_json_csv",
                    sla: "30d"
                },
                breach_notification: {
                    implementation: "auto_detect_and_notify",
                    sla: "72h",
                    channels: ["dpo_email", "legal_team"]
                },
                privacy_by_design: {
                    implementation: "classification_propagation",
                    default_classification: "CONFIDENTIAL"
                }
            }
        },
        CCPA: {
            enabled: true,
            scope: "CA_CONSUMER_DATA",
            requirements: {
                right_to_know: { implementation: "dsar_automation", sla: "45d" },
                right_to_delete: { implementation: "delete_cascade_with_audit", sla: "45d" },
                right_to_opt_out: { implementation: "consent_flag_check", sla: "15d" }
            }
        },
        DORA: {
            enabled: true,
            scope: "FINANCIAL_SYSTEMS",
            requirements: {
                ict_risk_management: { implementation: "continuous_monitoring" },
                incident_reporting: { implementation: "auto_classify_and_report", sla: "4h" },
                resilience_testing: { implementation: "chaos_engineering_schedule", frequency: "quarterly" }
            }
        }
    },
    audit_trail: {
        retention: "7y",
        immutable: true,
        format: "json_lines",
        storage: "s3://compliance-audit/trails/"
    }
}

DSAR Automation #

Data Subject Access Requests (DSARs) are a major operational burden under GDPR and CCPA. The Governance Agent automates the process:

flowchart TB
  A["1. Request received
(email/portal)"] --> B["2. Identity verification"]
  B --> C["3. Governance Agent traces lineage:
Where does data for subject X exist?"]
  C --> D["ANALYTICS.CUSTOMERS
(email, name, phone)"]
  C --> E["RAW_VAULT.HUB_CUSTOMER
(customer_key)"]
  C --> F["FINANCE_MART.FACT_ORDERS
(order history)"]
  C --> G["S3://logs/web-events/
(click data)"]
  C --> H["Salesforce CRM
(via API connector)"]
  D --> I["4. Generate access report /
execute deletion"]
  E --> I
  F --> I
  G --> I
  H --> I
  I --> J["5. Audit trail entry (immutable)"]
  J --> K["6. Confirmation to data subject"]

Insight

The average DSAR takes 12-18 person-hours to fulfill manually, primarily because of the lineage discovery step: finding every table and system that contains data for a specific individual. The Governance Agent's lineage policy reduces this to minutes because it already knows where data flows.

Column-Level Lineage #

The lineage_policy declaration configures automated lineage discovery:

NEAM

lineage_policy DataLineage {
    discovery: {
        mode: "automatic",
        sources: ["sql_parsing", "etl_metadata", "query_history"],
        granularity: "column",
        refresh_interval: "6h"
    },
    impact_analysis: {
        enabled: true,
        downstream_depth: 10,
        notification: "slack_data_engineering"
    },
    visualization: {
        format: "interactive_graph",
        export: ["svg", "json"]
    }
}

Column-level lineage answers questions that table-level lineage cannot:

"If I change the data type of CUSTOMERS.phone_number, which downstream reports break?"
"Which columns in the finance mart originate from the Oracle source vs. the Salesforce source?"
"Is the revenue metric in Dashboard A calculated the same way as in Dashboard B?"

External Tool Connectors #

The Governance Agent does not replace existing governance tools — it orchestrates them:

NEAM

external_tool CollibraSync {
    type: "collibra",
    connection: env("COLLIBRA_API_URL"),
    credentials: vault("collibra/prod/api_key"),
    sync_mode: "bidirectional",
    sync_interval: "1h",
    capabilities: [
        "catalog_sync", "classification_sync",
        "glossary_sync", "lineage_sync"
    ],
    conflict_resolution: "external_wins"
}

external_tool AtlasSync {
    type: "atlas",
    connection: env("ATLAS_API_URL"),
    credentials: vault("atlas/prod/creds"),
    sync_mode: "pull",
    sync_interval: "6h",
    capabilities: ["catalog_sync", "lineage_sync"]
}

Tool	Vendor	Capabilities	Sync Mode
Collibra	Collibra	Catalog, classification, glossary, lineage, workflow	Bidirectional
Apache Atlas	Apache	Catalog, lineage, classification	Pull
Alation	Alation	Catalog, glossary, lineage, query log analysis	Bidirectional
Microsoft Purview	Microsoft	Catalog, classification, lineage, access	Bidirectional
Informatica CDG	Informatica	Catalog, quality, classification, lineage	Bidirectional
Atlan	Atlan	Catalog, glossary, lineage, collaboration	Pull/Push
Immuta	Immuta	Access control, masking, audit	Push

Anti-Pattern

Do not set conflict_resolution: "neam_wins" when connecting to a production Collibra instance that your governance team actively maintains. Start with "external_wins" so the Governance Agent supplements the existing catalog without overwriting manual curation. Switch to "bidirectional" after trust is established.

Neam Code: A Complete Governance Agent #

NEAM

// ═══════════════════════════════════════════════════════════════
// Governance Agent — Digital Data Steward for SimShop
// ═══════════════════════════════════════════════════════════════

budget GovBudget { cost: 50.00, tokens: 500000, time: 86400000 }

// ... classification_policy, access_policy, quality_policy,
// ... lineage_policy, compliance_policy defined above ...

governance agent DataSteward {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a data governance agent enforcing classification, access, and compliance policies.",
    temperature: 0.3,

    classification: DataSensitivity,
    access_control: DataAccess,
    quality: DataQuality,
    lineage: DataLineage,
    compliance: RegulatoryCompliance,

    external_tools: [CollibraSync, AtlasSync],

    coordinates_with: [PlatformWatch, ETLBuilder],

    reports: {
        governance_dashboard: { frequency: "daily", channel: "slack_governance" },
        compliance_report: { frequency: "weekly", channel: "email_compliance" },
        cost_report: { frequency: "monthly", channel: "email_finance" }
    },

    budget: GovBudget,
    agent_md: "./agents/data_steward.md"
}

// ─── Operational Commands ───
DataSteward.classify_all()
DataSteward.run_quality_scan()
let lineage = DataSteward.trace_lineage("FINANCE_MART.fact_orders")
let dsar = DataSteward.process_dsar("customer_id", "CUST-12345")

Industry Perspective #

DAMA-DMBOK Alignment #

The DAMA Data Management Body of Knowledge (DMBOK 2.0) defines 11 knowledge areas for data management. The Governance Agent maps to six of them:

DAMA-DMBOK Knowledge Area	Governance Agent Implementation
Data Governance	`governance agent` orchestration
Data Quality	`quality_policy` with 6-dimension scoring
Data Security	`access_policy` with RBAC/ABAC/masking
Metadata Management	`catalog_source`, `glossary`, `lineage_policy`
Reference & Master Data	`master_data` golden record management
Regulatory Compliance	`compliance_policy` with GDPR/CCPA/DORA

The Real Cost of Non-Compliance #

Regulation	Maximum Fine	Notable Fines (2023-2025)
GDPR	4% of global revenue or 20M EUR	Meta: 1.2B EUR, Amazon: 746M EUR
CCPA	$7,500 per intentional violation	Sephora: $1.2M (first major)
DORA	Up to 1% of average daily global turnover	Enforcement began Jan 2025

The cost of a Governance Agent (LLM tokens for classification and monitoring) is negligible compared to the cost of a single compliance failure.

The Evidence #

DataSims experiments (DataSims repository) demonstrate the Governance Agent's impact on the churn prediction pipeline:

Metric	Without Governance	With Governance	Improvement
PII Exposure Incidents	3.4 / quarter	0	100%
Classification Coverage	22% (manual)	98.7% (auto)	4.5x
DSAR Fulfillment Time	14.2 hours	23 minutes	97.3%
Access Policy Drift	34% stale after 90 days	0% (continuous enforcement)	100%
Quality Score Availability	Ad hoc	Every table, daily	Continuous

Ablation A4 (Governance Agent removed) in the churn prediction experiment showed two consequences: (1) PII columns were included in model training features, creating a compliance violation, and (2) access controls were not enforced during analyst queries, allowing unrestricted access to sensitive financial data. The GovernanceAgent's classification propagation caught PII flowing from source to mart through three transformation stages — a lineage depth that manual classification consistently misses.

Key Takeaways #

Governance as code means policies are compilable, testable, executable, and version-controlled — not Word documents that drift from reality
Auto-classification using LLM-assisted semantic analysis (column names + sample values) achieves 98.7% coverage vs. 22% for manual approaches
RBAC/ABAC access policies enforce column-level masking and purpose limitation without modifying upstream pipelines
Six-dimension quality scoring (completeness, accuracy, consistency, timeliness, uniqueness, validity) provides a single 0.0-1.0 health metric for every data asset
Compliance policies map directly to GDPR, CCPA, and DORA requirements — including automated DSAR fulfillment
External tool connectors orchestrate Collibra, Atlas, Alation, Purview, Informatica CDG, Atlan, and Immuta — the Governance Agent is an intelligence layer, not a replacement
The Governance Agent governs the data estate — it never builds pipelines, moves data, or designs schemas

For Further Exploration #

Neam Language Reference: Governance Agent
DataSims: Simulated Enterprise Environment — 340 tables with controlled governance scenarios
DAMA-DMBOK 2.0 (DAMA International, 2017) — the definitive data management framework
GDPR Regulation (EU) 2016/679 — full text and recitals
DORA Regulation (EU) 2022/2554 — Digital Operational Resilience Act

Chapter 13: The Governance Agent — Compliance by Design #

The Problem #

Data Governance as Code #

Auto-Classification: PII, PHI, and Financial Data #

Access Control: RBAC and ABAC #

Six-Dimension Quality Scoring #

Compliance Mapping: GDPR, CCPA, DORA #

DSAR Automation #

Column-Level Lineage #

External Tool Connectors #

Neam Code: A Complete Governance Agent #

Industry Perspective #

DAMA-DMBOK Alignment #

The Real Cost of Non-Compliance #

The Evidence #

Key Takeaways #

For Further Exploration #