Chapter 7: Agent.MD -- Encoding Human Expertise #

"An expert is a person who has made all the mistakes that can be made in a very narrow field." -- Niels Bohr

20 min read | All personas | Part II: The Architecture

What you'll learn:

Why prompts fail for encoding domain knowledge (ephemeral, unversioned, unreviewable)
Why agent memory fails (volatile, context-dependent, non-transferable)
Why Agent.MD works (persistent, versioned, reviewable, composable)
The full Agent.MD structure with all six sections
How each agent consumes different sections of the Agent.MD
The evidence: ablation A6 shows 7.7% AUC improvement with Agent.MD
The knowledge compounding effect across repeated runs

The Problem #

Kim is a data analyst who has been with her company for six years. She knows that the signup_date column in the customers table has timezone inconsistencies before March 2024 -- a migration artifact that was never cleaned up. She knows that product ratings are self-reported and skew positive because unhappy customers do not leave reviews. She knows that the support ticket sentiment scores were computed by a basic model with roughly 75% accuracy, so they should be treated as noisy features, not ground truth.

None of this knowledge exists in any system. It lives in Kim's head, in scattered Slack messages, in a Confluence page last updated eighteen months ago, and in the tribal memory of a team that has 40% annual turnover. When Kim goes on vacation, the data scientist building a churn model has no idea that the event data contains bot traffic that needs filtering.

This is the domain knowledge problem, and it is universal. Every data organization accumulates hard-won knowledge about data quirks, business definitions, methodology preferences, and known issues. The question is how to encode that knowledge so that agents -- and new team members -- can access it reliably.

Three Approaches, Two Failures #

Before we present Agent.MD, let us examine why the two obvious alternatives fail.

Approach 1: Prompts #

The simplest encoding is to stuff domain knowledge into the system prompt:

CODE

System prompt (attempt):
"You are a data scientist. Note: the signup_date column has
timezone issues before March 2024. Product ratings skew positive.
Event data has bot traffic. Support ticket sentiment scores have
~75% accuracy. Use XGBoost for tabular data. Prefer AUC-ROC.
GDPR applies -- mask PII columns: email, phone, dob..."

This approach has five fatal flaws:

Why Prompts Fail for Domain Knowledge

Ephemeral -- Prompts exist for one conversation. Close the chat, knowledge is gone. Re-open, you start from zero.
Unversioned -- Who changed the prompt? When? Why? No git history, no diff, no code review, no rollback.
Unreviewable -- Domain experts cannot review prompts in a pull request. There is no "prompt review" process in any organization.
Monolithic -- One giant prompt for all agents. The ETL Agent does not need to know about SHAP preferences. The Causal Agent does not need ETL pipeline schedules. Yet both pay the token cost.
Context-Window Limited -- As domain knowledge grows (and it always grows), the prompt eventually exhausts the context window. What gets cut? Nobody knows. The LLM silently drops knowledge.

Anti-Pattern

- Encoding domain knowledge in system prompts. Prompts are for behavior instructions ("be concise," "output JSON"), not for domain facts. Domain knowledge changes on a different cadence than behavior instructions and should be versioned, reviewed, and maintained separately.

Approach 2: Agent Memory #

Some frameworks give agents persistent memory -- a vector store of past interactions that the agent can recall:

CODE

Agent memory (attempt):
- Stored: "In conversation on Jan 15, user mentioned timezone
  issues in signup_date before March 2024"
- Stored: "User prefers XGBoost for tabular data"
- Stored: "GDPR compliance required for EU customers"

This is better than prompts -- at least it persists. But it has its own problems:

Why Agent Memory Fails for Domain Knowledge

Volatile -- Memory is tied to one agent instance. Deploy a new version, memory resets. Scale to multiple instances, each has different memories. Crash, and recovery is uncertain.
Context-Dependent -- Memories are stored as they were discussed, not as they should be structured. "User mentioned timezone issues" is less useful than a structured entry: "Column: signup_date, Issue: timezone inconsistency, Affected Period: before 2024-03, Action: normalize to UTC before feature engineering."
Non-Transferable -- Agent A's memory cannot be shared with Agent B. The ETL Agent learns about a data issue but cannot transfer that knowledge to the DataScientist Agent. Each agent builds its own silo of understanding.
Unreviewable -- Same problem as prompts: no code review, no expert validation, no approval workflow. The agent might "remember" something incorrectly.
Drift-Prone -- Over time, memories accumulate outdated information. "Prefer scikit-learn" from two years ago conflicts with "prefer XGBoost" from last month. There is no mechanism to resolve conflicts or deprecate old knowledge.

The Agent.MD Solution #

Agent.MD is a structured Markdown file that encodes domain knowledge in a format that is:

Persistent -- It is a file on disk, checked into git.
Versioned -- Every change has a commit, a diff, a reviewer, and a rollback path.
Reviewable -- Domain experts can review it in a pull request, just like code.
Composable -- Different agents consume different sections. No wasted tokens.
Structured -- Organized into semantic sections, not free-form text.
Human-readable -- It is Markdown. Anyone can read and understand it.

Property	Prompt	Agent Memory	Agent.MD
Persistence	Ephemeral	Volatile	Persistent
Version control	Unversioned	Unversioned	Git-versioned
Reviewability	Unreviewable	Unreviewable	PR-reviewable
Scope	Monolithic	Per-agent silo	Composable sections
Content quality	Context-limited	Accumulates noise	Curated + pruned
Authorship	Written by engineer	Written by LLM	Written by domain expert
Change cadence	Changed per conversation	Drifts over time	Changed per release
Audit trail	No audit trail	No audit trail	Full git history
Result	Knowledge lost in hours	Knowledge decays in weeks	Knowledge compounds over months/years

Full Agent.MD Structure #

An Agent.MD file contains six sections, each prefixed with @ for machine-parseable identification. Here is the complete structure, using the SimShop DIO Agent.MD as a reference:

Agent.MD Structure

# Agent.MD -- [Organization] [Agent Role]

@organization-context -- Company, platform, scale, data period
@causal-domain-knowledge (also: @data-landscape in DIO variant) -- Schemas, tables, relationships, known patterns
@methodology-preferences (also: @agent-preferences) -- Algorithm choices, metric preferences, tool prefs
@known-data-issues -- Documented quirks, biases, quality problems
@delegation-rules -- Ordering constraints, mandatory checks, limits
@etl-pipeline-catalog -- Pipeline names, schedules, dependencies

Section 1: @organization-context #

This section provides the business context that every agent needs to understand the operating environment:

MARKDOWN

## @organization-context
Company: SimShop (simulated e-commerce)
Platform: PostgreSQL data warehouse + MLflow + Evidently
Scale: 100K customers, 10K products, 2M orders, 50M events
Data period: Jan 2024 - Dec 2025 (24 months)

Why it matters: Without organizational context, agents make generic decisions. The DataScientist Agent might suggest a deep learning approach for 100K customers when XGBoost would be more appropriate for that scale. The Governance Agent might apply HIPAA rules when only GDPR applies. Context eliminates entire categories of misguided decisions.

Section 2: @causal-domain-knowledge / @data-landscape #

This section maps the data terrain -- schemas, tables, relationships, and the logical flow of data through the organization:

MARKDOWN

## @data-landscape
OLTP source: simshop_oltp schema (20 tables)
Staging: simshop_staging schema (cleaned, validated intermediate)
Warehouse: simshop_dw schema (star schema - dims + facts + aggregates)
Feature store: ml_features schema (churn, recommendation, LTV features)
Predictions: ml_predictions schema (scored outputs)
Monitoring: ml_monitoring schema (drift checks, model performance)
Catalog: data_catalog schema (Unity Catalog simulation)
Quality: data_quality schema (profiling, check results)
Operations: operational schema (pipeline definitions, runs, alerts)

Why it matters: The ETL Agent needs to know which schemas exist and how data flows between them. Without this section, it would need to discover the schema landscape through exploratory queries -- a time-consuming and error-prone process.

Section 3: @methodology-preferences / @agent-preferences #

This section encodes the organization's preferred approaches for each type of work:

MARKDOWN

## @agent-preferences
DataScientist: Use XGBoost/LightGBM for tabular, prefer AUC-ROC metric
Causal: Use PyMC for Bayesian, DoWhy for identification
Testing: All quality gates mandatory, require >95% data completeness
MLOps: Canary deployment, Evidently for drift detection

Why it matters: Without methodology preferences, the DataScientist Agent makes its own choices. It might use random forests when the organization has standardized on gradient boosting. It might optimize for accuracy when the business cares about AUC-ROC. Preferences align agent behavior with organizational standards.

Insight

- Methodology preferences are not constraints -- they are defaults. The DataScientist Agent can deviate from "prefer XGBoost" if the data characteristics strongly favor a different algorithm (e.g., a computer vision task would not use XGBoost). The preference tells the agent "all else being equal, choose this." It is a prior, not a mandate.

Section 4: @known-data-issues #

This is the section that makes the biggest difference. It encodes the tribal knowledge that experienced team members carry in their heads:

MARKDOWN

## @known-data-issues
- Customer data for signup_date before 2024-03 has timezone inconsistencies
- Product ratings are self-reported (potential bias)
- Event data may have bot traffic (filter user_agent patterns)
- Support ticket sentiment scores computed by basic model (accuracy ~75%)

Why it matters: This is Kim's knowledge from the opening of this chapter, encoded in a structured format. When the ETL Agent builds features, it knows to normalize signup_date timezones before March 2024. When the DataScientist Agent uses product ratings as features, it knows they skew positive. When the Causal Agent analyzes support ticket sentiment, it knows the signal is noisy.

Without this section, agents treat all data as equally reliable. They build features from noisy columns, train models on biased data, and draw causal conclusions from unreliable measurements. The @known-data-issues section is the difference between a senior engineer and a junior one.

Section 5: @delegation-rules #

This section defines ordering constraints and mandatory checkpoints for the DIO:

MARKDOWN

## @delegation-rules
- Requirements must be validated before build starts
- Governance check mandatory (GDPR - PII columns identified in catalog)
- Quality gates block deployment on any critical failure
- Max 3 retries before human escalation

Why it matters: Without delegation rules, the DIO might skip the governance check (faster execution, but compliance risk) or allow deployment despite quality gate failures (the model is "probably fine"). Delegation rules encode organizational risk tolerance into the orchestration logic.

Section 6: @etl-pipeline-catalog #

This section documents existing data pipelines, their schedules, and dependencies:

MARKDOWN

## @etl-pipeline-catalog
15 ETL pipelines defined in operational.pipeline_definitions:
  raw_to_staging_customers (daily 2AM) -> staging_to_dim_customers (daily 4AM)
  raw_to_staging_orders (daily 2AM) -> staging_to_fact_orders (daily 5AM)
  raw_to_staging_events (daily 3AM) -> staging_to_fact_activity (daily 5AM)
  dw_to_churn_features (daily 6AM) -> churn_model_scoring (daily 7AM)
  dw_to_rec_features (daily 6AM) | dw_to_ltv_features (daily 6AM)
  daily_revenue_agg (daily 8AM) | data_quality_checks (daily 9AM)
  drift_detection (daily 10AM)

Why it matters: When the DIO plans execution, it needs to know when data is fresh. If it triggers the DataScientist Agent at 5:30 AM, the churn features table has not been refreshed yet (the dw_to_churn_features pipeline runs at 6 AM). The pipeline catalog prevents the DIO from scheduling work against stale data.

How Each Agent Consumes Agent.MD #

Not every agent reads every section. The Neam runtime selectively loads relevant sections to keep agent context focused:

Agent	@org-context	@data-landscape	@method-prefs	@known-issues	@deleg-rules	@etl-catalog
DIO	X	X	X	X	X	X
Data-BA	X		X	X
ETL Agent	X	X	X	X		X
DataScientist	X		X	X
Causal	X	X	X	X
DataTest	X		X	X
DataOps	X	X		X		X
Governance	X	X			X
Modeling	X	X
Analyst	X	X		X
MLOps	X		X	X		X
Migration	X	X
Deploy	X

X = Agent reads this section at initialization

flowchart LR
  FILE["Agent.MD file on disk"]
  ORG["@organization-context"]
  DATA["@data-landscape"]
  METH["@methodology-preferences"]
  KNOWN["@known-data-issues"]
  DELEG["@delegation-rules"]
  ETL["@etl-pipeline-catalog"]
  ALL["All agents"]
  INFRA["Infrastructure + Platform agents"]
  ANALYTICAL["Analytical agents"]
  MOST["Most agents"]
  DIOGOV["DIO + Governance"]
  ETLDATAOPS["ETL + DataOps + MLOps"]

  FILE --- ORG --> ALL
  FILE --- DATA --> INFRA
  FILE --- METH --> ANALYTICAL
  FILE --- KNOWN --> MOST
  FILE --- DELEG --> DIOGOV
  FILE --- ETL --> ETLDATAOPS

Each agent receives only its relevant sections in the system prompt, minimizing token usage and context noise.

Insight

- Selective section loading is a token optimization with a quality benefit. By giving the DataScientist Agent only @organization-context, @methodology-preferences, and @known-data-issues, its context window is focused on what matters for modeling. It does not waste attention on ETL pipeline schedules or delegation rules. Focused context means better generation quality.

The Evidence: Ablation A6 #

Ablation A6 in the DataSims experiment suite removes Agent.MD from the system while keeping all other components intact. The same DIO, the same 14 agents, the same infrastructure, the same task -- but without the structured domain knowledge.

Metric	With Agent.MD	Without Agent.MD	Delta
Model AUC-ROC	0.847	0.782	-7.7%
Root cause found	Yes (support)	Partial	Degraded
Feature quality	47/47 correct	39/47	-17%
Known issues caught	4/4	0/4	-100%
Pipeline scheduling	Optimal	Sub-optimal	Degraded

Statistical Significance: p-value < 0.01 (Welch's t-test), Cohen's d: 0.85 (large effect size), CI (95%): [0.055, 0.098] for AUC difference, Runs: 5 repetitions per condition

The 7.7% AUC improvement (0.782 to 0.847) is a large, statistically significant effect. In practical terms, this means the model with Agent.MD correctly identifies approximately 65 additional churning customers per 10,000 -- customers who would have been missed without the domain knowledge.

Breaking down where the improvement comes from:

Agent.MD Impact Decomposition

@known-data-issues -- Feature Quality (+4.2% AUC)
- Without: Bot traffic in events creates noisy features
- With: Bot traffic filtered, timezone normalized, sentiment scores treated as noisy
@methodology-preferences -- Algorithm Selection (+1.8% AUC)
- Without: Agent experiments with 5 algorithms, settles on random forest (locally optimal, not globally)
- With: Starts with XGBoost (known good for tabular), focuses tuning budget on hyperparameters
@data-landscape -- Feature Discovery (+1.7% AUC)
- Without: Agent explores schemas incrementally, misses cross-schema features (support + purchase patterns)
- With: Knows all 9 schemas upfront, engineers features that span OLTP + DW + support data

Try It

- Run the churn prediction experiment in DataSims twice: once with the Agent.MD file, once with an empty Agent.MD. Compare the model AUC, the features selected, and the causal DAG produced. The difference is immediately visible in the feature engineering step -- without @known-data-issues, the ETL Agent includes raw signup_date without timezone normalization and includes bot traffic in the event features.

The Knowledge Compounding Effect #

Agent.MD is not static. It improves over time as the organization learns. Each project run can surface new knowledge that gets encoded back into the Agent.MD:

Knowledge Compounding Over Time

Run 1 (Month 1):

Agent.MD: Initial 4 known issues
Result: AUC = 0.847
Learning: Coupon-heavy customers have different churn patterns
Update: Add to @known-data-issues: "Coupon-heavy customers (>5 coupons/month) have 30% higher retention but 40% lower LTV -- segment separately"

Run 2 (Month 3):

Agent.MD: 5 known issues (original 4 + coupon insight)
Result: AUC = 0.861 (+1.4% from new knowledge)
Learning: Weekend orders have different return rates
Update: Add to @known-data-issues: "Weekend orders have 15% higher return rates -- include is_weekend as feature"

Run 3 (Month 6):

Agent.MD: 6 known issues + refined preferences
Result: AUC = 0.873 (+1.2% cumulative improvement)
Learning: LightGBM outperforms XGBoost on this dataset
Update: @methodology-preferences: "LightGBM for SimShop churn, XGBoost for recommendation"

xychart-beta
  title "AUC Improvement with Agent.MD"
  x-axis ["M1", "M3", "M6"]
  y-axis "AUC" 0.77 --> 0.89
  line "With Agent.MD" [0.847, 0.861, 0.873]
  line "Without Agent.MD" [0.782, 0.782, 0.782]

This is the compounding effect: each run generates learnings that improve the Agent.MD, which improves the next run, which generates more learnings. The gap between "with Agent.MD" and "without Agent.MD" widens over time, not narrows.

Traditional approaches also compound knowledge, but in people's heads. When those people leave (40% annual turnover in data teams is common), the compounded knowledge leaves with them. Agent.MD stays.

Insight

- Agent.MD is the institutional memory of the data organization. It captures the lessons that would otherwise be lost to turnover, context-switching, and the passage of time. A well-maintained Agent.MD after two years of operation represents millions of dollars worth of accumulated domain expertise -- encoded in a file that any new team member (or agent) can read in minutes.

Writing an Agent.MD: Practical Guidelines #

Agent.MD is written by domain experts, not by engineers. Here are practical guidelines for each section:

Section	Who Writes It	When Updated	Length	Tip
@organization-context	VP of Data / Data Architecture lead	Quarterly or on major platform changes	5-10 lines	Include scale numbers (customers, orders, events) -- they drive compute routing decisions
@data-landscape	Data Engineer / Data Architect	When schemas change	10-30 lines	List ALL schemas, even empty ones. Agents cannot use what they do not know about.
@methodology-preferences	Lead Data Scientist / ML Engineer	After each model retraining cycle	5-15 lines	Be specific: "XGBoost for tabular churn" not "use good algorithms"
@known-data-issues	Anyone who discovers an issue	Continuously (every discovery is a PR)	5-20 lines (grows over time)	Include the IMPACT, not just the issue. "Timezone inconsistency" becomes "Timezone inconsistency in signup_date before 2024-03, causing 2% feature error rate if not normalized"
@delegation-rules	Project Manager / VP of Data	On process changes or incident learnings	5-10 lines	Frame as constraints, not preferences. "Requirements must be validated before build" not "it would be nice to have requirements"
@etl-pipeline-catalog	Data Engineer / DataOps	When pipelines change	10-30 lines	Include schedules -- agents need to know when data is fresh

Anti-Pattern

- Writing Agent.MD once and never updating it. An Agent.MD that was accurate six months ago and has not been updated since is worse than no Agent.MD at all -- it encodes outdated knowledge that agents will act on confidently. Treat Agent.MD like code: it requires maintenance, review, and periodic audits.

Industry Perspective #

Agent.MD addresses a well-documented challenge in knowledge management:

Nonaka and Takeuchi's SECI Model (1995) describes four modes of knowledge conversion: Socialization (tacit to tacit), Externalization (tacit to explicit), Combination (explicit to explicit), and Internalization (explicit to tacit). Agent.MD is the Externalization step -- converting the tacit knowledge in domain experts' heads into explicit, structured, machine-readable documentation.

DAMA-DMBOK 2.0 identifies "Data Knowledge Management" as a cross-cutting concern but provides no specific mechanism for encoding domain-specific data quality knowledge into automated systems. Agent.MD fills this gap.

Google's MLOps Maturity Model (Levels 0-4) describes Level 3 as "automated ML pipeline with human-in-the-loop monitoring" and Level 4 as "fully automated, self-improving systems." Agent.MD is the mechanism that enables the transition from Level 3 to Level 4: the domain knowledge that humans contribute is encoded in a structured, versioned format that the automated system can consume and improve upon.

In regulated industries (healthcare, finance), Agent.MD also serves as documentation of organizational knowledge for audit purposes. When a regulator asks "how does your model handle known data quality issues?", the answer is in the Agent.MD file, with full git history showing when each issue was documented and how it was addressed.

Agent.MD vs. Other Knowledge Formats #

How does Agent.MD compare to other approaches organizations have tried?

Format	Persistent	Versioned	Structured	Machine-Readable	Human-Readable
System prompts	No	No	No	Yes	Yes
Agent memory	Partial	No	No	Yes	No
Wiki/Confluence	Yes	Partial	No	No	Yes
YAML config	Yes	Yes	Yes	Yes	Partial
JSON schema	Yes	Yes	Yes	Yes	No
Agent.MD	Yes	Yes	Yes	Yes	Yes

Agent.MD combines the human-readability of documentation with the machine-parsability of structured configuration.

The key advantage is the dual readability. A domain expert reads the Agent.MD and validates the content as a human document. The Neam runtime parses the @-prefixed sections and loads them as structured context for agents. The same file serves both audiences without translation.

The Evidence #

Beyond ablation A6 (the 7.7% AUC improvement), Agent.MD's impact shows across multiple dimensions of the DataSims evaluation:

Dimension	With Agent.MD	Without Agent.MD	Impact
Speed (hours)	3.2	4.8	-33%
Quality (AUC)	0.847	0.782	+8.3%
Reliability	100%	72%	+28%
Traceability	95%	60%	+35%
Documentation	Complete	Partial	Improved
Cost ($)	$34.70	$52.10	-33%
Adaptability	4/4 issues	0/4 issues	+100%

Statistical Summary: All differences significant at p < 0.05 (Bonferroni corrected). Composite Effectiveness Score: 1.42x with Agent.MD

Agent.MD does not just improve model quality. It reduces execution time (agents do not waste time exploring known issues), reduces cost (focused context means fewer LLM tokens), and improves reliability (agents avoid known data traps). The 33% cost reduction alone justifies maintaining Agent.MD -- the engineering effort to keep it updated is a fraction of the LLM cost savings.

Key Takeaways #

Prompts fail for domain knowledge because they are ephemeral, unversioned, and unreviewable. Agent memory fails because it is volatile, context-dependent, and non-transferable. Agent.MD succeeds because it is persistent, versioned, reviewable, and composable.
Agent.MD has six sections (@organization-context, @data-landscape, @methodology-preferences, @known-data-issues, @delegation-rules, @etl-pipeline-catalog), each written by the appropriate domain expert and consumed by the relevant agents.
Ablation A6 demonstrates a 7.7% AUC improvement with Agent.MD (0.782 to 0.847, p<0.01, Cohen's d=0.85) -- a large, statistically significant effect.
The knowledge compounding effect means the gap between "with Agent.MD" and "without Agent.MD" widens over time as organizational learnings accumulate.
Agent.MD is the institutional memory of the data organization. When team members leave, the knowledge stays.
Treat Agent.MD like code: it needs maintenance, review, and periodic audits. An outdated Agent.MD is worse than none at all.

For Further Exploration #

Neam Agent.MD Specification -- Language-level documentation for Agent.MD declarations
DataSims Agent.MD Example -- The SimShop DIO Agent.MD used in all experiments (neam-agents/agents/simshop_dio.agent.md)
Nonaka, I. & Takeuchi, H. (1995). The Knowledge-Creating Company -- The SECI model of organizational knowledge creation