Chapter 13: The Governance Agent — Compliance by Design #
"You can't govern what you can't see. And you can't see what you haven't classified." — Peter Aiken, founding president, International Data Management Association
25 min read | David, Raj | Part IV: Platform Intelligence Agents
What you'll learn:
- How to encode data governance as code — classification, access control, lineage, compliance, and quality scoring as compilable declarations
- How auto-classification uses LLM-assisted semantic analysis to identify PII, PHI, and financial data
- How RBAC/ABAC access policies enforce column-level masking without modifying upstream pipelines
- How six-dimension quality scoring provides a single health metric for every data asset
- How compliance policies map directly to GDPR, CCPA, and DORA regulatory requirements
- How external tool connectors orchestrate Collibra, Atlas, Alation, Purview, Informatica CDG, Atlan, and Immuta
The Problem #
David is the VP of Data at a mid-sized fintech company. His team has 340 tables across Snowflake, PostgreSQL, and S3. Somewhere in those tables are Social Security numbers, credit card numbers, email addresses, and health indicators from a wellness program integration. He knows this because a security audit found unmasked PII in a development Snowflake clone six months ago. He does not know exactly which columns, in which tables, contain sensitive data today — because the schema changes weekly and nobody maintains the classification spreadsheet.
His compliance officer needs to demonstrate GDPR readiness for the next board meeting. His security team needs column-level access controls. His data engineers need to know which tables they can use for analytics without legal review. And his analysts need to know what cust_ssn_hash actually means in business terms — the column naming conventions were set by three different teams over four years.
David has Collibra. It has been partially configured. Twenty-two percent of tables are cataloged. Classification was attempted manually last quarter but fell behind within weeks as new tables appeared. Access policies exist in Snowflake's RBAC system but are not connected to the catalog. Lineage is drawn on a whiteboard in the data engineering room.
The Governance Agent does not replace Collibra. It orchestrates Collibra — and Atlas, and Alation, and every other governance tool in the stack — as an intelligence layer that classifies continuously, enforces policies as code, and closes the gap between "governance strategy" and "governance reality."
Data Governance as Code #
Traditional governance relies on documentation — policies written in Word documents, classification maintained in spreadsheets, access reviews conducted quarterly via email threads. The fundamental insight of governance-as-code is that policies should be compilable, testable, and executable — the same properties we demand of data pipelines.
flowchart TB
subgraph Traditional["Traditional Governance"]
direction TB
A["Word Doc:
PII must be masked"] -->|"manual"| B["Spreadsheet:
Table A has SSN in col 5"]
B -->|"stale"| C["Snowflake RBAC:
maybe configured"]
end
subgraph Neam["Governance as Code (Neam)"]
direction TB
D["classification_policy DataSensitivity {
levels: { RESTRICTED: {
controls: [encryption, masking] } }
auto_classify: { enabled: true } }"] -->|"compiled"| E["neamc → bytecode
VM enforces at every query"]
end
The gap between "policy written" and "policy enforced" is where breaches happen. When policies are code, enforcement is continuous and automated. When policies are documents, enforcement depends on someone reading the document and configuring the platform correctly — a process that degrades over time.
Auto-Classification: PII, PHI, and Financial Data #
The classification_policy declaration defines sensitivity levels and enables LLM-assisted auto-classification:
classification_policy DataSensitivity {
levels: {
RESTRICTED: {
level: 4,
controls: ["encryption_at_rest", "column_masking", "audit_logging"],
retention_max: "7y",
cross_border: "prohibited"
},
CONFIDENTIAL: {
level: 3,
controls: ["encryption_at_rest", "row_filtering"],
retention_max: "10y",
cross_border: "with_approval"
},
INTERNAL: {
level: 2,
controls: ["access_logging"],
retention_max: "indefinite",
cross_border: "allowed"
},
PUBLIC: {
level: 1,
controls: ["none"],
retention_max: "indefinite",
cross_border: "allowed"
}
},
auto_classify: {
enabled: true,
provider: "openai",
model: "gpt-4o-mini",
semantic: {
column_name_analysis: true,
sample_value_analysis: true,
confidence_threshold: 0.80
},
drift_detection: {
enabled: true,
scan_interval: "24h"
}
},
propagation: {
lineage_based: true,
inheritance: "strictest"
}
}
Auto-classification works in three phases:
- Column Name Analysis — The LLM examines column names against known PII patterns (
ssn,social_security,credit_card,email,dob,date_of_birth,phone,address,diagnosis_code) - Sample Value Analysis — The LLM examines sampled values (not bulk data) to confirm classification. A column named
customer_idmight contain SSNs if the naming convention is poor - Drift Detection — Every 24 hours, the agent re-scans for new columns, renamed columns, and changed data patterns
Do not set confidence_threshold below 0.70. Low thresholds create excessive false positives — every column gets flagged as "maybe PII" — which leads to governance fatigue and teams ignoring classifications entirely. Start at 0.80 and adjust based on your false-negative tolerance.
Access Control: RBAC and ABAC #
The access_policy declaration encodes role-based (RBAC) and attribute-based (ABAC) access control:
access_policy DataAccess {
rbac: {
roles: {
"data_engineer": {
tables: ["*"],
columns: ["*"],
exclude_classifications: ["RESTRICTED"],
operations: ["SELECT", "INSERT"]
},
"analyst": {
tables: ["*_MART.*", "*_DIM.*", "*_FACT.*"],
columns: ["*"],
exclude_classifications: ["RESTRICTED", "CONFIDENTIAL"],
operations: ["SELECT"]
},
"data_scientist": {
tables: ["*"],
columns: ["*"],
masking: {
RESTRICTED: "hash",
CONFIDENTIAL: "partial_mask"
},
operations: ["SELECT"]
}
}
},
abac: {
rules: [
{
name: "geo_restriction",
condition: "user.location NOT IN ['EU'] AND data.classification == 'RESTRICTED' AND data.regulation == 'GDPR'",
action: "deny",
reason: "GDPR data cannot be accessed from non-EU locations"
},
{
name: "purpose_limitation",
condition: "query.purpose NOT IN data.allowed_purposes",
action: "deny",
reason: "Data can only be accessed for declared purposes"
}
]
},
masking: {
strategies: {
"hash": "SHA-256 one-way hash",
"partial_mask": "Show first/last N characters",
"redact": "Replace with [REDACTED]",
"tokenize": "Reversible tokenization via vault",
"generalize": "Reduce precision (ZIP5 → ZIP3)"
}
},
review: {
frequency: "quarterly",
auto_revoke_inactive: "90d",
notification: "slack_security"
}
}
Six-Dimension Quality Scoring #
The quality_policy declaration defines six quality dimensions. Each data asset receives a composite score from 0.0 to 1.0:
| Dimension | Weight | Measurement |
|---|---|---|
| Completeness | 0.20 | % of non-null required fields |
| Accuracy | 0.20 | % passing validation rules |
| Consistency | 0.15 | Cross-source agreement rate |
| Timeliness | 0.15 | Data freshness vs. SLA |
| Uniqueness | 0.15 | Duplicate detection rate |
| Validity | 0.15 | Format/range/referential checks |
Composite Score = weighted average → 0.0 to 1.0 | Threshold: block pipeline if score < 0.85
quality_policy DataQuality {
dimensions: {
completeness: { weight: 0.20, threshold: 0.95 },
accuracy: { weight: 0.20, threshold: 0.98 },
consistency: { weight: 0.15, threshold: 0.90 },
timeliness: { weight: 0.15, threshold: 0.95 },
uniqueness: { weight: 0.15, threshold: 0.99 },
validity: { weight: 0.15, threshold: 0.95 }
},
scoring: {
composite_threshold: 0.85,
action_below_threshold: "block_and_notify",
trend_window: "30d",
trend_alert: "declining"
},
profiling: {
schedule: "daily",
sample_size: 10000,
include_distribution: true
}
}
Compliance Mapping: GDPR, CCPA, DORA #
The compliance_policy declaration maps regulatory requirements directly to technical controls:
compliance_policy RegulatoryCompliance {
regulations: {
GDPR: {
enabled: true,
scope: "EU_CUSTOMER_DATA",
requirements: {
right_to_access: {
implementation: "dsar_automation",
sla: "30d"
},
right_to_erasure: {
implementation: "delete_cascade_with_audit",
sla: "30d",
exceptions: ["legal_hold", "regulatory_retention"]
},
data_portability: {
implementation: "export_json_csv",
sla: "30d"
},
breach_notification: {
implementation: "auto_detect_and_notify",
sla: "72h",
channels: ["dpo_email", "legal_team"]
},
privacy_by_design: {
implementation: "classification_propagation",
default_classification: "CONFIDENTIAL"
}
}
},
CCPA: {
enabled: true,
scope: "CA_CONSUMER_DATA",
requirements: {
right_to_know: { implementation: "dsar_automation", sla: "45d" },
right_to_delete: { implementation: "delete_cascade_with_audit", sla: "45d" },
right_to_opt_out: { implementation: "consent_flag_check", sla: "15d" }
}
},
DORA: {
enabled: true,
scope: "FINANCIAL_SYSTEMS",
requirements: {
ict_risk_management: { implementation: "continuous_monitoring" },
incident_reporting: { implementation: "auto_classify_and_report", sla: "4h" },
resilience_testing: { implementation: "chaos_engineering_schedule", frequency: "quarterly" }
}
}
},
audit_trail: {
retention: "7y",
immutable: true,
format: "json_lines",
storage: "s3://compliance-audit/trails/"
}
}
DSAR Automation #
Data Subject Access Requests (DSARs) are a major operational burden under GDPR and CCPA. The Governance Agent automates the process:
flowchart TB A["1. Request received
(email/portal)"] --> B["2. Identity verification"] B --> C["3. Governance Agent traces lineage:
Where does data for subject X exist?"] C --> D["ANALYTICS.CUSTOMERS
(email, name, phone)"] C --> E["RAW_VAULT.HUB_CUSTOMER
(customer_key)"] C --> F["FINANCE_MART.FACT_ORDERS
(order history)"] C --> G["S3://logs/web-events/
(click data)"] C --> H["Salesforce CRM
(via API connector)"] D --> I["4. Generate access report /
execute deletion"] E --> I F --> I G --> I H --> I I --> J["5. Audit trail entry (immutable)"] J --> K["6. Confirmation to data subject"]
The average DSAR takes 12-18 person-hours to fulfill manually, primarily because of the lineage discovery step: finding every table and system that contains data for a specific individual. The Governance Agent's lineage policy reduces this to minutes because it already knows where data flows.
Column-Level Lineage #
The lineage_policy declaration configures automated lineage discovery:
lineage_policy DataLineage {
discovery: {
mode: "automatic",
sources: ["sql_parsing", "etl_metadata", "query_history"],
granularity: "column",
refresh_interval: "6h"
},
impact_analysis: {
enabled: true,
downstream_depth: 10,
notification: "slack_data_engineering"
},
visualization: {
format: "interactive_graph",
export: ["svg", "json"]
}
}
Column-level lineage answers questions that table-level lineage cannot:
- "If I change the data type of
CUSTOMERS.phone_number, which downstream reports break?" - "Which columns in the finance mart originate from the Oracle source vs. the Salesforce source?"
- "Is the
revenuemetric in Dashboard A calculated the same way as in Dashboard B?"
External Tool Connectors #
The Governance Agent does not replace existing governance tools — it orchestrates them:
external_tool CollibraSync {
type: "collibra",
connection: env("COLLIBRA_API_URL"),
credentials: vault("collibra/prod/api_key"),
sync_mode: "bidirectional",
sync_interval: "1h",
capabilities: [
"catalog_sync", "classification_sync",
"glossary_sync", "lineage_sync"
],
conflict_resolution: "external_wins"
}
external_tool AtlasSync {
type: "atlas",
connection: env("ATLAS_API_URL"),
credentials: vault("atlas/prod/creds"),
sync_mode: "pull",
sync_interval: "6h",
capabilities: ["catalog_sync", "lineage_sync"]
}
| Tool | Vendor | Capabilities | Sync Mode |
|---|---|---|---|
| Collibra | Collibra | Catalog, classification, glossary, lineage, workflow | Bidirectional |
| Apache Atlas | Apache | Catalog, lineage, classification | Pull |
| Alation | Alation | Catalog, glossary, lineage, query log analysis | Bidirectional |
| Microsoft Purview | Microsoft | Catalog, classification, lineage, access | Bidirectional |
| Informatica CDG | Informatica | Catalog, quality, classification, lineage | Bidirectional |
| Atlan | Atlan | Catalog, glossary, lineage, collaboration | Pull/Push |
| Immuta | Immuta | Access control, masking, audit | Push |
Do not set conflict_resolution: "neam_wins" when connecting to a production Collibra instance that your governance team actively maintains. Start with "external_wins" so the Governance Agent supplements the existing catalog without overwriting manual curation. Switch to "bidirectional" after trust is established.
Neam Code: A Complete Governance Agent #
// ═══════════════════════════════════════════════════════════════
// Governance Agent — Digital Data Steward for SimShop
// ═══════════════════════════════════════════════════════════════
budget GovBudget { cost: 50.00, tokens: 500000, time: 86400000 }
// ... classification_policy, access_policy, quality_policy,
// ... lineage_policy, compliance_policy defined above ...
governance agent DataSteward {
provider: "anthropic",
model: "claude-sonnet-4-6",
system: "You are a data governance agent enforcing classification, access, and compliance policies.",
temperature: 0.3,
classification: DataSensitivity,
access_control: DataAccess,
quality: DataQuality,
lineage: DataLineage,
compliance: RegulatoryCompliance,
external_tools: [CollibraSync, AtlasSync],
coordinates_with: [PlatformWatch, ETLBuilder],
reports: {
governance_dashboard: { frequency: "daily", channel: "slack_governance" },
compliance_report: { frequency: "weekly", channel: "email_compliance" },
cost_report: { frequency: "monthly", channel: "email_finance" }
},
budget: GovBudget,
agent_md: "./agents/data_steward.md"
}
// ─── Operational Commands ───
DataSteward.classify_all()
DataSteward.run_quality_scan()
let lineage = DataSteward.trace_lineage("FINANCE_MART.fact_orders")
let dsar = DataSteward.process_dsar("customer_id", "CUST-12345")
Industry Perspective #
DAMA-DMBOK Alignment #
The DAMA Data Management Body of Knowledge (DMBOK 2.0) defines 11 knowledge areas for data management. The Governance Agent maps to six of them:
| DAMA-DMBOK Knowledge Area | Governance Agent Implementation |
|---|---|
| Data Governance | governance agent orchestration |
| Data Quality | quality_policy with 6-dimension scoring |
| Data Security | access_policy with RBAC/ABAC/masking |
| Metadata Management | catalog_source, glossary, lineage_policy |
| Reference & Master Data | master_data golden record management |
| Regulatory Compliance | compliance_policy with GDPR/CCPA/DORA |
The Real Cost of Non-Compliance #
| Regulation | Maximum Fine | Notable Fines (2023-2025) |
|---|---|---|
| GDPR | 4% of global revenue or 20M EUR | Meta: 1.2B EUR, Amazon: 746M EUR |
| CCPA | $7,500 per intentional violation | Sephora: $1.2M (first major) |
| DORA | Up to 1% of average daily global turnover | Enforcement began Jan 2025 |
The cost of a Governance Agent (LLM tokens for classification and monitoring) is negligible compared to the cost of a single compliance failure.
The Evidence #
DataSims experiments (DataSims repository) demonstrate the Governance Agent's impact on the churn prediction pipeline:
| Metric | Without Governance | With Governance | Improvement |
|---|---|---|---|
| PII Exposure Incidents | 3.4 / quarter | 0 | 100% |
| Classification Coverage | 22% (manual) | 98.7% (auto) | 4.5x |
| DSAR Fulfillment Time | 14.2 hours | 23 minutes | 97.3% |
| Access Policy Drift | 34% stale after 90 days | 0% (continuous enforcement) | 100% |
| Quality Score Availability | Ad hoc | Every table, daily | Continuous |
Ablation A4 (Governance Agent removed) in the churn prediction experiment showed two consequences: (1) PII columns were included in model training features, creating a compliance violation, and (2) access controls were not enforced during analyst queries, allowing unrestricted access to sensitive financial data. The GovernanceAgent's classification propagation caught PII flowing from source to mart through three transformation stages — a lineage depth that manual classification consistently misses.
Key Takeaways #
- Governance as code means policies are compilable, testable, executable, and version-controlled — not Word documents that drift from reality
- Auto-classification using LLM-assisted semantic analysis (column names + sample values) achieves 98.7% coverage vs. 22% for manual approaches
- RBAC/ABAC access policies enforce column-level masking and purpose limitation without modifying upstream pipelines
- Six-dimension quality scoring (completeness, accuracy, consistency, timeliness, uniqueness, validity) provides a single 0.0-1.0 health metric for every data asset
- Compliance policies map directly to GDPR, CCPA, and DORA requirements — including automated DSAR fulfillment
- External tool connectors orchestrate Collibra, Atlas, Alation, Purview, Informatica CDG, Atlan, and Immuta — the Governance Agent is an intelligence layer, not a replacement
- The Governance Agent governs the data estate — it never builds pipelines, moves data, or designs schemas
For Further Exploration #
- Neam Language Reference: Governance Agent
- DataSims: Simulated Enterprise Environment — 340 tables with controlled governance scenarios
- DAMA-DMBOK 2.0 (DAMA International, 2017) — the definitive data management framework
- GDPR Regulation (EU) 2016/679 — full text and recitals
- DORA Regulation (EU) 2022/2554 — Digital Operational Resilience Act