Chapter 7: Agent.MD -- Encoding Human Expertise #

"An expert is a person who has made all the mistakes that can be made in a very narrow field." -- Niels Bohr


20 min read | All personas | Part II: The Architecture

What you'll learn:


The Problem #

Kim is a data analyst who has been with her company for six years. She knows that the signup_date column in the customers table has timezone inconsistencies before March 2024 -- a migration artifact that was never cleaned up. She knows that product ratings are self-reported and skew positive because unhappy customers do not leave reviews. She knows that the support ticket sentiment scores were computed by a basic model with roughly 75% accuracy, so they should be treated as noisy features, not ground truth.

None of this knowledge exists in any system. It lives in Kim's head, in scattered Slack messages, in a Confluence page last updated eighteen months ago, and in the tribal memory of a team that has 40% annual turnover. When Kim goes on vacation, the data scientist building a churn model has no idea that the event data contains bot traffic that needs filtering.

This is the domain knowledge problem, and it is universal. Every data organization accumulates hard-won knowledge about data quirks, business definitions, methodology preferences, and known issues. The question is how to encode that knowledge so that agents -- and new team members -- can access it reliably.


Three Approaches, Two Failures #

Before we present Agent.MD, let us examine why the two obvious alternatives fail.

Approach 1: Prompts #

The simplest encoding is to stuff domain knowledge into the system prompt:

CODE
System prompt (attempt):
"You are a data scientist. Note: the signup_date column has
timezone issues before March 2024. Product ratings skew positive.
Event data has bot traffic. Support ticket sentiment scores have
~75% accuracy. Use XGBoost for tabular data. Prefer AUC-ROC.
GDPR applies -- mask PII columns: email, phone, dob..."

This approach has five fatal flaws:

Why Prompts Fail for Domain Knowledge
  1. Ephemeral -- Prompts exist for one conversation. Close the chat, knowledge is gone. Re-open, you start from zero.
  2. Unversioned -- Who changed the prompt? When? Why? No git history, no diff, no code review, no rollback.
  3. Unreviewable -- Domain experts cannot review prompts in a pull request. There is no "prompt review" process in any organization.
  4. Monolithic -- One giant prompt for all agents. The ETL Agent does not need to know about SHAP preferences. The Causal Agent does not need ETL pipeline schedules. Yet both pay the token cost.
  5. Context-Window Limited -- As domain knowledge grows (and it always grows), the prompt eventually exhausts the context window. What gets cut? Nobody knows. The LLM silently drops knowledge.
Anti-Pattern

- Encoding domain knowledge in system prompts. Prompts are for behavior instructions ("be concise," "output JSON"), not for domain facts. Domain knowledge changes on a different cadence than behavior instructions and should be versioned, reviewed, and maintained separately.

Approach 2: Agent Memory #

Some frameworks give agents persistent memory -- a vector store of past interactions that the agent can recall:

CODE
Agent memory (attempt):
- Stored: "In conversation on Jan 15, user mentioned timezone
  issues in signup_date before March 2024"
- Stored: "User prefers XGBoost for tabular data"
- Stored: "GDPR compliance required for EU customers"

This is better than prompts -- at least it persists. But it has its own problems:

Why Agent Memory Fails for Domain Knowledge
  1. Volatile -- Memory is tied to one agent instance. Deploy a new version, memory resets. Scale to multiple instances, each has different memories. Crash, and recovery is uncertain.
  2. Context-Dependent -- Memories are stored as they were discussed, not as they should be structured. "User mentioned timezone issues" is less useful than a structured entry: "Column: signup_date, Issue: timezone inconsistency, Affected Period: before 2024-03, Action: normalize to UTC before feature engineering."
  3. Non-Transferable -- Agent A's memory cannot be shared with Agent B. The ETL Agent learns about a data issue but cannot transfer that knowledge to the DataScientist Agent. Each agent builds its own silo of understanding.
  4. Unreviewable -- Same problem as prompts: no code review, no expert validation, no approval workflow. The agent might "remember" something incorrectly.
  5. Drift-Prone -- Over time, memories accumulate outdated information. "Prefer scikit-learn" from two years ago conflicts with "prefer XGBoost" from last month. There is no mechanism to resolve conflicts or deprecate old knowledge.

The Agent.MD Solution #

Agent.MD is a structured Markdown file that encodes domain knowledge in a format that is:

PropertyPromptAgent MemoryAgent.MD
PersistenceEphemeralVolatilePersistent
Version controlUnversionedUnversionedGit-versioned
ReviewabilityUnreviewableUnreviewablePR-reviewable
ScopeMonolithicPer-agent siloComposable sections
Content qualityContext-limitedAccumulates noiseCurated + pruned
AuthorshipWritten by engineerWritten by LLMWritten by domain expert
Change cadenceChanged per conversationDrifts over timeChanged per release
Audit trailNo audit trailNo audit trailFull git history
ResultKnowledge lost in hoursKnowledge decays in weeksKnowledge compounds over months/years

Full Agent.MD Structure #

An Agent.MD file contains six sections, each prefixed with @ for machine-parseable identification. Here is the complete structure, using the SimShop DIO Agent.MD as a reference:

Agent.MD Structure

# Agent.MD -- [Organization] [Agent Role]

  • @organization-context -- Company, platform, scale, data period
  • @causal-domain-knowledge (also: @data-landscape in DIO variant) -- Schemas, tables, relationships, known patterns
  • @methodology-preferences (also: @agent-preferences) -- Algorithm choices, metric preferences, tool prefs
  • @known-data-issues -- Documented quirks, biases, quality problems
  • @delegation-rules -- Ordering constraints, mandatory checks, limits
  • @etl-pipeline-catalog -- Pipeline names, schedules, dependencies

Section 1: @organization-context #

This section provides the business context that every agent needs to understand the operating environment:

MARKDOWN
## @organization-context
Company: SimShop (simulated e-commerce)
Platform: PostgreSQL data warehouse + MLflow + Evidently
Scale: 100K customers, 10K products, 2M orders, 50M events
Data period: Jan 2024 - Dec 2025 (24 months)

Why it matters: Without organizational context, agents make generic decisions. The DataScientist Agent might suggest a deep learning approach for 100K customers when XGBoost would be more appropriate for that scale. The Governance Agent might apply HIPAA rules when only GDPR applies. Context eliminates entire categories of misguided decisions.

Section 2: @causal-domain-knowledge / @data-landscape #

This section maps the data terrain -- schemas, tables, relationships, and the logical flow of data through the organization:

MARKDOWN
## @data-landscape
OLTP source: simshop_oltp schema (20 tables)
Staging: simshop_staging schema (cleaned, validated intermediate)
Warehouse: simshop_dw schema (star schema - dims + facts + aggregates)
Feature store: ml_features schema (churn, recommendation, LTV features)
Predictions: ml_predictions schema (scored outputs)
Monitoring: ml_monitoring schema (drift checks, model performance)
Catalog: data_catalog schema (Unity Catalog simulation)
Quality: data_quality schema (profiling, check results)
Operations: operational schema (pipeline definitions, runs, alerts)

Why it matters: The ETL Agent needs to know which schemas exist and how data flows between them. Without this section, it would need to discover the schema landscape through exploratory queries -- a time-consuming and error-prone process.

Section 3: @methodology-preferences / @agent-preferences #

This section encodes the organization's preferred approaches for each type of work:

MARKDOWN
## @agent-preferences
DataScientist: Use XGBoost/LightGBM for tabular, prefer AUC-ROC metric
Causal: Use PyMC for Bayesian, DoWhy for identification
Testing: All quality gates mandatory, require >95% data completeness
MLOps: Canary deployment, Evidently for drift detection

Why it matters: Without methodology preferences, the DataScientist Agent makes its own choices. It might use random forests when the organization has standardized on gradient boosting. It might optimize for accuracy when the business cares about AUC-ROC. Preferences align agent behavior with organizational standards.

Insight

- Methodology preferences are not constraints -- they are defaults. The DataScientist Agent can deviate from "prefer XGBoost" if the data characteristics strongly favor a different algorithm (e.g., a computer vision task would not use XGBoost). The preference tells the agent "all else being equal, choose this." It is a prior, not a mandate.

Section 4: @known-data-issues #

This is the section that makes the biggest difference. It encodes the tribal knowledge that experienced team members carry in their heads:

MARKDOWN
## @known-data-issues
- Customer data for signup_date before 2024-03 has timezone inconsistencies
- Product ratings are self-reported (potential bias)
- Event data may have bot traffic (filter user_agent patterns)
- Support ticket sentiment scores computed by basic model (accuracy ~75%)

Why it matters: This is Kim's knowledge from the opening of this chapter, encoded in a structured format. When the ETL Agent builds features, it knows to normalize signup_date timezones before March 2024. When the DataScientist Agent uses product ratings as features, it knows they skew positive. When the Causal Agent analyzes support ticket sentiment, it knows the signal is noisy.

Without this section, agents treat all data as equally reliable. They build features from noisy columns, train models on biased data, and draw causal conclusions from unreliable measurements. The @known-data-issues section is the difference between a senior engineer and a junior one.

Section 5: @delegation-rules #

This section defines ordering constraints and mandatory checkpoints for the DIO:

MARKDOWN
## @delegation-rules
- Requirements must be validated before build starts
- Governance check mandatory (GDPR - PII columns identified in catalog)
- Quality gates block deployment on any critical failure
- Max 3 retries before human escalation

Why it matters: Without delegation rules, the DIO might skip the governance check (faster execution, but compliance risk) or allow deployment despite quality gate failures (the model is "probably fine"). Delegation rules encode organizational risk tolerance into the orchestration logic.

Section 6: @etl-pipeline-catalog #

This section documents existing data pipelines, their schedules, and dependencies:

MARKDOWN
## @etl-pipeline-catalog
15 ETL pipelines defined in operational.pipeline_definitions:
  raw_to_staging_customers (daily 2AM) -> staging_to_dim_customers (daily 4AM)
  raw_to_staging_orders (daily 2AM) -> staging_to_fact_orders (daily 5AM)
  raw_to_staging_events (daily 3AM) -> staging_to_fact_activity (daily 5AM)
  dw_to_churn_features (daily 6AM) -> churn_model_scoring (daily 7AM)
  dw_to_rec_features (daily 6AM) | dw_to_ltv_features (daily 6AM)
  daily_revenue_agg (daily 8AM) | data_quality_checks (daily 9AM)
  drift_detection (daily 10AM)

Why it matters: When the DIO plans execution, it needs to know when data is fresh. If it triggers the DataScientist Agent at 5:30 AM, the churn features table has not been refreshed yet (the dw_to_churn_features pipeline runs at 6 AM). The pipeline catalog prevents the DIO from scheduling work against stale data.


How Each Agent Consumes Agent.MD #

Not every agent reads every section. The Neam runtime selectively loads relevant sections to keep agent context focused:

Agent@org-context@data-landscape@method-prefs@known-issues@deleg-rules@etl-catalog
DIOXXXXXX
Data-BAXXX
ETL AgentXXXXX
DataScientistXXX
CausalXXXX
DataTestXXX
DataOpsXXXX
GovernanceXXX
ModelingXX
AnalystXXX
MLOpsXXXX
MigrationXX
DeployX

X = Agent reads this section at initialization

DIAGRAM Section Loading Flow
flowchart LR
  FILE["Agent.MD file on disk"]
  ORG["@organization-context"]
  DATA["@data-landscape"]
  METH["@methodology-preferences"]
  KNOWN["@known-data-issues"]
  DELEG["@delegation-rules"]
  ETL["@etl-pipeline-catalog"]
  ALL["All agents"]
  INFRA["Infrastructure + Platform agents"]
  ANALYTICAL["Analytical agents"]
  MOST["Most agents"]
  DIOGOV["DIO + Governance"]
  ETLDATAOPS["ETL + DataOps + MLOps"]

  FILE --- ORG --> ALL
  FILE --- DATA --> INFRA
  FILE --- METH --> ANALYTICAL
  FILE --- KNOWN --> MOST
  FILE --- DELEG --> DIOGOV
  FILE --- ETL --> ETLDATAOPS

Each agent receives only its relevant sections in the system prompt, minimizing token usage and context noise.

Insight

- Selective section loading is a token optimization with a quality benefit. By giving the DataScientist Agent only @organization-context, @methodology-preferences, and @known-data-issues, its context window is focused on what matters for modeling. It does not waste attention on ETL pipeline schedules or delegation rules. Focused context means better generation quality.


The Evidence: Ablation A6 #

Ablation A6 in the DataSims experiment suite removes Agent.MD from the system while keeping all other components intact. The same DIO, the same 14 agents, the same infrastructure, the same task -- but without the structured domain knowledge.

MetricWith Agent.MDWithout Agent.MDDelta
Model AUC-ROC0.8470.782-7.7%
Root cause foundYes (support)PartialDegraded
Feature quality47/47 correct39/47-17%
Known issues caught4/40/4-100%
Pipeline schedulingOptimalSub-optimalDegraded

Statistical Significance: p-value < 0.01 (Welch's t-test), Cohen's d: 0.85 (large effect size), CI (95%): [0.055, 0.098] for AUC difference, Runs: 5 repetitions per condition

The 7.7% AUC improvement (0.782 to 0.847) is a large, statistically significant effect. In practical terms, this means the model with Agent.MD correctly identifies approximately 65 additional churning customers per 10,000 -- customers who would have been missed without the domain knowledge.

Breaking down where the improvement comes from:

Agent.MD Impact Decomposition
  1. @known-data-issues -- Feature Quality (+4.2% AUC)
    • Without: Bot traffic in events creates noisy features
    • With: Bot traffic filtered, timezone normalized, sentiment scores treated as noisy
  2. @methodology-preferences -- Algorithm Selection (+1.8% AUC)
    • Without: Agent experiments with 5 algorithms, settles on random forest (locally optimal, not globally)
    • With: Starts with XGBoost (known good for tabular), focuses tuning budget on hyperparameters
  3. @data-landscape -- Feature Discovery (+1.7% AUC)
    • Without: Agent explores schemas incrementally, misses cross-schema features (support + purchase patterns)
    • With: Knows all 9 schemas upfront, engineers features that span OLTP + DW + support data
Try It

- Run the churn prediction experiment in DataSims twice: once with the Agent.MD file, once with an empty Agent.MD. Compare the model AUC, the features selected, and the causal DAG produced. The difference is immediately visible in the feature engineering step -- without @known-data-issues, the ETL Agent includes raw signup_date without timezone normalization and includes bot traffic in the event features.


The Knowledge Compounding Effect #

Agent.MD is not static. It improves over time as the organization learns. Each project run can surface new knowledge that gets encoded back into the Agent.MD:

Knowledge Compounding Over Time

Run 1 (Month 1):

  • Agent.MD: Initial 4 known issues
  • Result: AUC = 0.847
  • Learning: Coupon-heavy customers have different churn patterns
  • Update: Add to @known-data-issues: "Coupon-heavy customers (>5 coupons/month) have 30% higher retention but 40% lower LTV -- segment separately"

Run 2 (Month 3):

  • Agent.MD: 5 known issues (original 4 + coupon insight)
  • Result: AUC = 0.861 (+1.4% from new knowledge)
  • Learning: Weekend orders have different return rates
  • Update: Add to @known-data-issues: "Weekend orders have 15% higher return rates -- include is_weekend as feature"

Run 3 (Month 6):

  • Agent.MD: 6 known issues + refined preferences
  • Result: AUC = 0.873 (+1.2% cumulative improvement)
  • Learning: LightGBM outperforms XGBoost on this dataset
  • Update: @methodology-preferences: "LightGBM for SimShop churn, XGBoost for recommendation"
DIAGRAM AUC Improvement Over Time
xychart-beta
  title "AUC Improvement with Agent.MD"
  x-axis ["M1", "M3", "M6"]
  y-axis "AUC" 0.77 --> 0.89
  line "With Agent.MD" [0.847, 0.861, 0.873]
  line "Without Agent.MD" [0.782, 0.782, 0.782]

This is the compounding effect: each run generates learnings that improve the Agent.MD, which improves the next run, which generates more learnings. The gap between "with Agent.MD" and "without Agent.MD" widens over time, not narrows.

Traditional approaches also compound knowledge, but in people's heads. When those people leave (40% annual turnover in data teams is common), the compounded knowledge leaves with them. Agent.MD stays.

Insight

- Agent.MD is the institutional memory of the data organization. It captures the lessons that would otherwise be lost to turnover, context-switching, and the passage of time. A well-maintained Agent.MD after two years of operation represents millions of dollars worth of accumulated domain expertise -- encoded in a file that any new team member (or agent) can read in minutes.


Writing an Agent.MD: Practical Guidelines #

Agent.MD is written by domain experts, not by engineers. Here are practical guidelines for each section:

SectionWho Writes ItWhen UpdatedLengthTip
@organization-contextVP of Data / Data Architecture leadQuarterly or on major platform changes5-10 linesInclude scale numbers (customers, orders, events) -- they drive compute routing decisions
@data-landscapeData Engineer / Data ArchitectWhen schemas change10-30 linesList ALL schemas, even empty ones. Agents cannot use what they do not know about.
@methodology-preferencesLead Data Scientist / ML EngineerAfter each model retraining cycle5-15 linesBe specific: "XGBoost for tabular churn" not "use good algorithms"
@known-data-issuesAnyone who discovers an issueContinuously (every discovery is a PR)5-20 lines (grows over time)Include the IMPACT, not just the issue. "Timezone inconsistency" becomes "Timezone inconsistency in signup_date before 2024-03, causing 2% feature error rate if not normalized"
@delegation-rulesProject Manager / VP of DataOn process changes or incident learnings5-10 linesFrame as constraints, not preferences. "Requirements must be validated before build" not "it would be nice to have requirements"
@etl-pipeline-catalogData Engineer / DataOpsWhen pipelines change10-30 linesInclude schedules -- agents need to know when data is fresh
Anti-Pattern

- Writing Agent.MD once and never updating it. An Agent.MD that was accurate six months ago and has not been updated since is worse than no Agent.MD at all -- it encodes outdated knowledge that agents will act on confidently. Treat Agent.MD like code: it requires maintenance, review, and periodic audits.


Industry Perspective #

Agent.MD addresses a well-documented challenge in knowledge management:

Nonaka and Takeuchi's SECI Model (1995) describes four modes of knowledge conversion: Socialization (tacit to tacit), Externalization (tacit to explicit), Combination (explicit to explicit), and Internalization (explicit to tacit). Agent.MD is the Externalization step -- converting the tacit knowledge in domain experts' heads into explicit, structured, machine-readable documentation.

DAMA-DMBOK 2.0 identifies "Data Knowledge Management" as a cross-cutting concern but provides no specific mechanism for encoding domain-specific data quality knowledge into automated systems. Agent.MD fills this gap.

Google's MLOps Maturity Model (Levels 0-4) describes Level 3 as "automated ML pipeline with human-in-the-loop monitoring" and Level 4 as "fully automated, self-improving systems." Agent.MD is the mechanism that enables the transition from Level 3 to Level 4: the domain knowledge that humans contribute is encoded in a structured, versioned format that the automated system can consume and improve upon.

In regulated industries (healthcare, finance), Agent.MD also serves as documentation of organizational knowledge for audit purposes. When a regulator asks "how does your model handle known data quality issues?", the answer is in the Agent.MD file, with full git history showing when each issue was documented and how it was addressed.


Agent.MD vs. Other Knowledge Formats #

How does Agent.MD compare to other approaches organizations have tried?

FormatPersistentVersionedStructuredMachine-ReadableHuman-Readable
System promptsNoNoNoYesYes
Agent memoryPartialNoNoYesNo
Wiki/ConfluenceYesPartialNoNoYes
YAML configYesYesYesYesPartial
JSON schemaYesYesYesYesNo
Agent.MDYesYesYesYesYes

Agent.MD combines the human-readability of documentation with the machine-parsability of structured configuration.

The key advantage is the dual readability. A domain expert reads the Agent.MD and validates the content as a human document. The Neam runtime parses the @-prefixed sections and loads them as structured context for agents. The same file serves both audiences without translation.


The Evidence #

Beyond ablation A6 (the 7.7% AUC improvement), Agent.MD's impact shows across multiple dimensions of the DataSims evaluation:

DimensionWith Agent.MDWithout Agent.MDImpact
Speed (hours)3.24.8-33%
Quality (AUC)0.8470.782+8.3%
Reliability100%72%+28%
Traceability95%60%+35%
DocumentationCompletePartialImproved
Cost ($)$34.70$52.10-33%
Adaptability4/4 issues0/4 issues+100%

Statistical Summary: All differences significant at p < 0.05 (Bonferroni corrected). Composite Effectiveness Score: 1.42x with Agent.MD

Agent.MD does not just improve model quality. It reduces execution time (agents do not waste time exploring known issues), reduces cost (focused context means fewer LLM tokens), and improves reliability (agents avoid known data traps). The 33% cost reduction alone justifies maintaining Agent.MD -- the engineering effort to keep it updated is a fraction of the LLM cost savings.


Key Takeaways #

For Further Exploration #