Chapter 13: The Governance Agent — Compliance by Design #

"You can't govern what you can't see. And you can't see what you haven't classified." — Peter Aiken, founding president, International Data Management Association


25 min read | David, Raj | Part IV: Platform Intelligence Agents

What you'll learn:


The Problem #

David is the VP of Data at a mid-sized fintech company. His team has 340 tables across Snowflake, PostgreSQL, and S3. Somewhere in those tables are Social Security numbers, credit card numbers, email addresses, and health indicators from a wellness program integration. He knows this because a security audit found unmasked PII in a development Snowflake clone six months ago. He does not know exactly which columns, in which tables, contain sensitive data today — because the schema changes weekly and nobody maintains the classification spreadsheet.

His compliance officer needs to demonstrate GDPR readiness for the next board meeting. His security team needs column-level access controls. His data engineers need to know which tables they can use for analytics without legal review. And his analysts need to know what cust_ssn_hash actually means in business terms — the column naming conventions were set by three different teams over four years.

David has Collibra. It has been partially configured. Twenty-two percent of tables are cataloged. Classification was attempted manually last quarter but fell behind within weeks as new tables appeared. Access policies exist in Snowflake's RBAC system but are not connected to the catalog. Lineage is drawn on a whiteboard in the data engineering room.

The Governance Agent does not replace Collibra. It orchestrates Collibra — and Atlas, and Alation, and every other governance tool in the stack — as an intelligence layer that classifies continuously, enforces policies as code, and closes the gap between "governance strategy" and "governance reality."


Data Governance as Code #

Traditional governance relies on documentation — policies written in Word documents, classification maintained in spreadsheets, access reviews conducted quarterly via email threads. The fundamental insight of governance-as-code is that policies should be compilable, testable, and executable — the same properties we demand of data pipelines.

COMPARISON Traditional Governance vs Governance as Code
flowchart TB
  subgraph Traditional["Traditional Governance"]
    direction TB
    A["Word Doc:
PII must be masked"] -->|"manual"| B["Spreadsheet:
Table A has SSN in col 5"] B -->|"stale"| C["Snowflake RBAC:
maybe configured"] end subgraph Neam["Governance as Code (Neam)"] direction TB D["classification_policy DataSensitivity {
levels: { RESTRICTED: {
controls: [encryption, masking] } }
auto_classify: { enabled: true } }"] -->|"compiled"| E["neamc → bytecode
VM enforces at every query"] end
Insight

The gap between "policy written" and "policy enforced" is where breaches happen. When policies are code, enforcement is continuous and automated. When policies are documents, enforcement depends on someone reading the document and configuring the platform correctly — a process that degrades over time.


Auto-Classification: PII, PHI, and Financial Data #

The classification_policy declaration defines sensitivity levels and enables LLM-assisted auto-classification:

NEAM
classification_policy DataSensitivity {
    levels: {
        RESTRICTED: {
            level: 4,
            controls: ["encryption_at_rest", "column_masking", "audit_logging"],
            retention_max: "7y",
            cross_border: "prohibited"
        },
        CONFIDENTIAL: {
            level: 3,
            controls: ["encryption_at_rest", "row_filtering"],
            retention_max: "10y",
            cross_border: "with_approval"
        },
        INTERNAL: {
            level: 2,
            controls: ["access_logging"],
            retention_max: "indefinite",
            cross_border: "allowed"
        },
        PUBLIC: {
            level: 1,
            controls: ["none"],
            retention_max: "indefinite",
            cross_border: "allowed"
        }
    },
    auto_classify: {
        enabled: true,
        provider: "openai",
        model: "gpt-4o-mini",
        semantic: {
            column_name_analysis: true,
            sample_value_analysis: true,
            confidence_threshold: 0.80
        },
        drift_detection: {
            enabled: true,
            scan_interval: "24h"
        }
    },
    propagation: {
        lineage_based: true,
        inheritance: "strictest"
    }
}

Auto-classification works in three phases:

  1. Column Name Analysis — The LLM examines column names against known PII patterns (ssn, social_security, credit_card, email, dob, date_of_birth, phone, address, diagnosis_code)
  2. Sample Value Analysis — The LLM examines sampled values (not bulk data) to confirm classification. A column named customer_id might contain SSNs if the naming convention is poor
  3. Drift Detection — Every 24 hours, the agent re-scans for new columns, renamed columns, and changed data patterns
Anti-Pattern

Do not set confidence_threshold below 0.70. Low thresholds create excessive false positives — every column gets flagged as "maybe PII" — which leads to governance fatigue and teams ignoring classifications entirely. Start at 0.80 and adjust based on your false-negative tolerance.


Access Control: RBAC and ABAC #

The access_policy declaration encodes role-based (RBAC) and attribute-based (ABAC) access control:

NEAM
access_policy DataAccess {
    rbac: {
        roles: {
            "data_engineer": {
                tables: ["*"],
                columns: ["*"],
                exclude_classifications: ["RESTRICTED"],
                operations: ["SELECT", "INSERT"]
            },
            "analyst": {
                tables: ["*_MART.*", "*_DIM.*", "*_FACT.*"],
                columns: ["*"],
                exclude_classifications: ["RESTRICTED", "CONFIDENTIAL"],
                operations: ["SELECT"]
            },
            "data_scientist": {
                tables: ["*"],
                columns: ["*"],
                masking: {
                    RESTRICTED: "hash",
                    CONFIDENTIAL: "partial_mask"
                },
                operations: ["SELECT"]
            }
        }
    },
    abac: {
        rules: [
            {
                name: "geo_restriction",
                condition: "user.location NOT IN ['EU'] AND data.classification == 'RESTRICTED' AND data.regulation == 'GDPR'",
                action: "deny",
                reason: "GDPR data cannot be accessed from non-EU locations"
            },
            {
                name: "purpose_limitation",
                condition: "query.purpose NOT IN data.allowed_purposes",
                action: "deny",
                reason: "Data can only be accessed for declared purposes"
            }
        ]
    },
    masking: {
        strategies: {
            "hash": "SHA-256 one-way hash",
            "partial_mask": "Show first/last N characters",
            "redact": "Replace with [REDACTED]",
            "tokenize": "Reversible tokenization via vault",
            "generalize": "Reduce precision (ZIP5 → ZIP3)"
        }
    },
    review: {
        frequency: "quarterly",
        auto_revoke_inactive: "90d",
        notification: "slack_security"
    }
}

Six-Dimension Quality Scoring #

The quality_policy declaration defines six quality dimensions. Each data asset receives a composite score from 0.0 to 1.0:

DimensionWeightMeasurement
Completeness0.20% of non-null required fields
Accuracy0.20% passing validation rules
Consistency0.15Cross-source agreement rate
Timeliness0.15Data freshness vs. SLA
Uniqueness0.15Duplicate detection rate
Validity0.15Format/range/referential checks

Composite Score = weighted average → 0.0 to 1.0 | Threshold: block pipeline if score < 0.85

NEAM
quality_policy DataQuality {
    dimensions: {
        completeness: { weight: 0.20, threshold: 0.95 },
        accuracy: { weight: 0.20, threshold: 0.98 },
        consistency: { weight: 0.15, threshold: 0.90 },
        timeliness: { weight: 0.15, threshold: 0.95 },
        uniqueness: { weight: 0.15, threshold: 0.99 },
        validity: { weight: 0.15, threshold: 0.95 }
    },
    scoring: {
        composite_threshold: 0.85,
        action_below_threshold: "block_and_notify",
        trend_window: "30d",
        trend_alert: "declining"
    },
    profiling: {
        schedule: "daily",
        sample_size: 10000,
        include_distribution: true
    }
}

Compliance Mapping: GDPR, CCPA, DORA #

The compliance_policy declaration maps regulatory requirements directly to technical controls:

NEAM
compliance_policy RegulatoryCompliance {
    regulations: {
        GDPR: {
            enabled: true,
            scope: "EU_CUSTOMER_DATA",
            requirements: {
                right_to_access: {
                    implementation: "dsar_automation",
                    sla: "30d"
                },
                right_to_erasure: {
                    implementation: "delete_cascade_with_audit",
                    sla: "30d",
                    exceptions: ["legal_hold", "regulatory_retention"]
                },
                data_portability: {
                    implementation: "export_json_csv",
                    sla: "30d"
                },
                breach_notification: {
                    implementation: "auto_detect_and_notify",
                    sla: "72h",
                    channels: ["dpo_email", "legal_team"]
                },
                privacy_by_design: {
                    implementation: "classification_propagation",
                    default_classification: "CONFIDENTIAL"
                }
            }
        },
        CCPA: {
            enabled: true,
            scope: "CA_CONSUMER_DATA",
            requirements: {
                right_to_know: { implementation: "dsar_automation", sla: "45d" },
                right_to_delete: { implementation: "delete_cascade_with_audit", sla: "45d" },
                right_to_opt_out: { implementation: "consent_flag_check", sla: "15d" }
            }
        },
        DORA: {
            enabled: true,
            scope: "FINANCIAL_SYSTEMS",
            requirements: {
                ict_risk_management: { implementation: "continuous_monitoring" },
                incident_reporting: { implementation: "auto_classify_and_report", sla: "4h" },
                resilience_testing: { implementation: "chaos_engineering_schedule", frequency: "quarterly" }
            }
        }
    },
    audit_trail: {
        retention: "7y",
        immutable: true,
        format: "json_lines",
        storage: "s3://compliance-audit/trails/"
    }
}

DSAR Automation #

Data Subject Access Requests (DSARs) are a major operational burden under GDPR and CCPA. The Governance Agent automates the process:

WORKFLOW DSAR Workflow
flowchart TB
  A["1. Request received
(email/portal)"] --> B["2. Identity verification"] B --> C["3. Governance Agent traces lineage:
Where does data for subject X exist?"] C --> D["ANALYTICS.CUSTOMERS
(email, name, phone)"] C --> E["RAW_VAULT.HUB_CUSTOMER
(customer_key)"] C --> F["FINANCE_MART.FACT_ORDERS
(order history)"] C --> G["S3://logs/web-events/
(click data)"] C --> H["Salesforce CRM
(via API connector)"] D --> I["4. Generate access report /
execute deletion"] E --> I F --> I G --> I H --> I I --> J["5. Audit trail entry (immutable)"] J --> K["6. Confirmation to data subject"]
Insight

The average DSAR takes 12-18 person-hours to fulfill manually, primarily because of the lineage discovery step: finding every table and system that contains data for a specific individual. The Governance Agent's lineage policy reduces this to minutes because it already knows where data flows.


Column-Level Lineage #

The lineage_policy declaration configures automated lineage discovery:

NEAM
lineage_policy DataLineage {
    discovery: {
        mode: "automatic",
        sources: ["sql_parsing", "etl_metadata", "query_history"],
        granularity: "column",
        refresh_interval: "6h"
    },
    impact_analysis: {
        enabled: true,
        downstream_depth: 10,
        notification: "slack_data_engineering"
    },
    visualization: {
        format: "interactive_graph",
        export: ["svg", "json"]
    }
}

Column-level lineage answers questions that table-level lineage cannot:


External Tool Connectors #

The Governance Agent does not replace existing governance tools — it orchestrates them:

NEAM
external_tool CollibraSync {
    type: "collibra",
    connection: env("COLLIBRA_API_URL"),
    credentials: vault("collibra/prod/api_key"),
    sync_mode: "bidirectional",
    sync_interval: "1h",
    capabilities: [
        "catalog_sync", "classification_sync",
        "glossary_sync", "lineage_sync"
    ],
    conflict_resolution: "external_wins"
}

external_tool AtlasSync {
    type: "atlas",
    connection: env("ATLAS_API_URL"),
    credentials: vault("atlas/prod/creds"),
    sync_mode: "pull",
    sync_interval: "6h",
    capabilities: ["catalog_sync", "lineage_sync"]
}
ToolVendorCapabilitiesSync Mode
CollibraCollibraCatalog, classification, glossary, lineage, workflowBidirectional
Apache AtlasApacheCatalog, lineage, classificationPull
AlationAlationCatalog, glossary, lineage, query log analysisBidirectional
Microsoft PurviewMicrosoftCatalog, classification, lineage, accessBidirectional
Informatica CDGInformaticaCatalog, quality, classification, lineageBidirectional
AtlanAtlanCatalog, glossary, lineage, collaborationPull/Push
ImmutaImmutaAccess control, masking, auditPush
Anti-Pattern

Do not set conflict_resolution: "neam_wins" when connecting to a production Collibra instance that your governance team actively maintains. Start with "external_wins" so the Governance Agent supplements the existing catalog without overwriting manual curation. Switch to "bidirectional" after trust is established.


Neam Code: A Complete Governance Agent #

NEAM
// ═══════════════════════════════════════════════════════════════
// Governance Agent — Digital Data Steward for SimShop
// ═══════════════════════════════════════════════════════════════

budget GovBudget { cost: 50.00, tokens: 500000, time: 86400000 }

// ... classification_policy, access_policy, quality_policy,
// ... lineage_policy, compliance_policy defined above ...

governance agent DataSteward {
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    system: "You are a data governance agent enforcing classification, access, and compliance policies.",
    temperature: 0.3,

    classification: DataSensitivity,
    access_control: DataAccess,
    quality: DataQuality,
    lineage: DataLineage,
    compliance: RegulatoryCompliance,

    external_tools: [CollibraSync, AtlasSync],

    coordinates_with: [PlatformWatch, ETLBuilder],

    reports: {
        governance_dashboard: { frequency: "daily", channel: "slack_governance" },
        compliance_report: { frequency: "weekly", channel: "email_compliance" },
        cost_report: { frequency: "monthly", channel: "email_finance" }
    },

    budget: GovBudget,
    agent_md: "./agents/data_steward.md"
}

// ─── Operational Commands ───
DataSteward.classify_all()
DataSteward.run_quality_scan()
let lineage = DataSteward.trace_lineage("FINANCE_MART.fact_orders")
let dsar = DataSteward.process_dsar("customer_id", "CUST-12345")

Industry Perspective #

DAMA-DMBOK Alignment #

The DAMA Data Management Body of Knowledge (DMBOK 2.0) defines 11 knowledge areas for data management. The Governance Agent maps to six of them:

DAMA-DMBOK Knowledge AreaGovernance Agent Implementation
Data Governancegovernance agent orchestration
Data Qualityquality_policy with 6-dimension scoring
Data Securityaccess_policy with RBAC/ABAC/masking
Metadata Managementcatalog_source, glossary, lineage_policy
Reference & Master Datamaster_data golden record management
Regulatory Compliancecompliance_policy with GDPR/CCPA/DORA

The Real Cost of Non-Compliance #

RegulationMaximum FineNotable Fines (2023-2025)
GDPR4% of global revenue or 20M EURMeta: 1.2B EUR, Amazon: 746M EUR
CCPA$7,500 per intentional violationSephora: $1.2M (first major)
DORAUp to 1% of average daily global turnoverEnforcement began Jan 2025

The cost of a Governance Agent (LLM tokens for classification and monitoring) is negligible compared to the cost of a single compliance failure.


The Evidence #

DataSims experiments (DataSims repository) demonstrate the Governance Agent's impact on the churn prediction pipeline:

MetricWithout GovernanceWith GovernanceImprovement
PII Exposure Incidents3.4 / quarter0100%
Classification Coverage22% (manual)98.7% (auto)4.5x
DSAR Fulfillment Time14.2 hours23 minutes97.3%
Access Policy Drift34% stale after 90 days0% (continuous enforcement)100%
Quality Score AvailabilityAd hocEvery table, dailyContinuous

Ablation A4 (Governance Agent removed) in the churn prediction experiment showed two consequences: (1) PII columns were included in model training features, creating a compliance violation, and (2) access controls were not enforced during analyst queries, allowing unrestricted access to sensitive financial data. The GovernanceAgent's classification propagation caught PII flowing from source to mart through three transformation stages — a lineage depth that manual classification consistently misses.


Key Takeaways #

For Further Exploration #