📖 9 min read

Case Study: Autonomous Code Builder #

The previous case studies demonstrated agents that answer questions, search documents, and manage data pipelines. This chapter tackles a fundamentally different challenge: building software autonomously. The system we construct here takes a feature requirement in plain English, generates a plan, writes Python code with tests using a forge agent, verifies correctness through test execution and coverage checks, uses git checkpoints to preserve every verified step, tracks progress and learnings across iterations, and integrates with a claw agent "Project Manager" that orchestrates the entire workflow via spawn.

By the end of this case study, you will have a complete TDD-driven autonomous code builder that combines the forge agent's iterative build-verify loop with the claw agent's persistent session management and the Orchestrable trait's spawn/delegate callbacks. This is the full NeamClaw build workflow in action.

💠 Why This Matters

Writing code is a multi-step, iterative process. A human developer reads requirements, plans an approach, writes code, runs tests, reads error messages, fixes bugs, and repeats until the tests pass. No single LLM call can replicate this workflow reliably. The forge agent encodes the entire cycle into a language construct: write, verify, checkpoint, repeat. Combined with a claw agent PM that manages requirements and tracks progress across features, you get a system that mirrors how real engineering teams operate -- a project manager assigns work, a developer builds it iteratively, and the PM reviews the result.

Requirements #

An autonomous code builder must satisfy seven core requirements:

Feature intake. The system accepts feature requirements in natural language. A user describes what they want, and the system produces working code.
Plan generation. Before writing any code, the system generates a step-by-step plan that decomposes the feature into discrete, testable tasks.
TDD code generation. A forge agent writes Python source code and test files following test-driven development. The agent never modifies pre-existing tests.
Automated verification. After each iteration, the system runs pytest, checks that all tests pass, validates that code coverage meets 80%, and runs pylint for static analysis. Verification is external and deterministic.
Git checkpoints. Every verified task produces a git commit. The commit history provides a clean audit trail of incremental progress.
Progress and learnings tracking. The system records completed tasks, iteration counts, costs, and cross-iteration insights that carry forward into subsequent iterations via the fresh context model.
Project Manager integration. A claw agent PM receives feature requests, creates plans, spawns the forge agent Builder, monitors progress, and reports results with full session persistence.

Architecture Design #

The claw agent PM handles the conversational interface and orchestration. The forge agent Builder handles the iterative code generation loop. They communicate through spawn, with the PM launching the Builder and receiving results through the Orchestrable trait callbacks.

FORGE AGENT: Builder

Parse

requirement

▶

Generate

plan

▶

Write plan.md

to workspace

▼

spawn(Builder, { workspace: ..., plan: plan_content })

▼

Fresh

context

(plan +

progress

+learn)

▶

LLM +

tool

calls

(write,

read,

exec)

▶

Verify:

pytest

coverage

pylint

▶

Git

commit

Workspace Layout #

The forge agent operates inside a structured workspace directory:

📁workspace/

├── 📄plan.md ← Task plan generated by PM

├── 📄progress.jsonl ← Completed task log (auto-managed)

├── 📄learnings.jsonl ← Cross-iteration insights

├── 📁src/

│ ├── 📄__init__.py

│ ├── 📄models.py ← Generated source code

│ ├── 📄repository.py

│ └── 📄auth_service.py

├── 📁tests/

│ ├── 📄__init__.py

│ ├── 📄test_models.py ← Generated test files

│ ├── 📄test_repository.py

│ └── 📄test_auth_service.py

├── 📄requirements.txt

├── 📄setup.cfg ← pylint configuration

└── 📁.git/ ← Git repository for checkpoints

🌎 Real-World Analogy

Think of the workspace as a developer's desk. The plan is the task board. The progress file is the standup report. The learnings file is the team wiki. The forge agent sits down at this desk fresh each iteration, reads the board, checks the wiki, does its work, and commits. The next iteration is a fresh developer sitting down at the same desk.

The Project Manager #

The Project Manager is a claw agent that receives feature requests, creates plans, spawns the Builder, and reports results. Session persistence lets users ask follow-up questions about previous builds.

neam

// ================================================================
// Project Manager -- Claw Agent
// ================================================================

channel pm_cli {
  type: "cli"
  prompt: "pm> "
  greeting: "Autonomous Code Builder ready. Describe a feature to build."
}

channel pm_http {
  type: "http"
  port: 8080
  path: "/pm/chat"
}

skill create_plan {
  description: "Create a step-by-step build plan from a feature description"
  params: { feature_description: string }
  impl(feature_description) {
    let plan = ProjectManager.ask(
      "Generate a TDD build plan for the following feature. "
      + "Output ONLY the plan as a numbered list, one task per line. "
      + "Each task must be small enough to implement in a single file. "
      + "Each task must have a clear test criterion.\n\n"
      + "Feature: " + feature_description
    );
    return plan;
  }
}

skill check_build_status {
  description: "Check the status of the current or most recent build"
  params: { workspace_path: string }
  impl(workspace_path) {
    let progress = workspace_read(workspace_path + "/progress.jsonl");
    if (progress == nil) {
      return "No build in progress. No progress file found.";
    }
    return "Build progress:\n" + progress;
  }
}

claw agent ProjectManager {
  provider: "anthropic"
  model: "claude-sonnet-4-20250514"
  temperature: 0.4

  system: "You are a Project Manager for an autonomous code builder.

YOUR RESPONSIBILITIES:
1. Receive feature requests from the user.
2. Create a clear, step-by-step TDD build plan.
3. Launch the Builder agent to implement the plan.
4. Report build results back to the user.
5. Answer follow-up questions about builds.

RULES:
- Always create a plan before launching a build.
- Each plan task must be small, focused, and testable.
- Report both successes and failures honestly.
- Track costs and iterations for the user."

  channels: [pm_cli, pm_http]
  skills: [create_plan, check_build_status]

  session: {
    idle_reset_minutes: 120
    daily_reset_hour: 4
    max_history_turns: 100
    compaction: "auto"
  }

  semantic_memory: {
    backend: "sqlite"
    embedding_model: "nomic-embed-text"
    search: "hybrid"
    top_k: 5
  }
}

💡 Tip

Separating plan creation from code generation gives you a review checkpoint. The PM can present the plan to the user before spawning the Builder, letting the user approve, modify, or reject the plan before spending any build budget.

The Forge Agent #

The Builder is a forge agent that executes the TDD build loop. It reads the plan, writes code, and relies on the verify callback to validate its work.

neam

// ================================================================
// Builder -- Forge Agent
// ================================================================

forge agent Builder {
  provider: "anthropic"
  model: "claude-sonnet-4-20250514"

  system: "You are a senior Python developer practicing strict TDD.

YOUR WORKFLOW FOR EACH TASK:
1. Read the current task description carefully.
2. Use read_file to examine existing source and test files.
3. Use list_files to understand the project structure.
4. Write test files FIRST using write_file (if tests do not exist yet).
5. Write implementation code using write_file.
6. Do NOT run tests yourself -- the verification system handles that.

CODE STANDARDS:
- Use Python type hints on all function signatures.
- Write docstrings for all public functions and classes.
- Follow PEP 8 naming conventions (snake_case).
- Keep functions small and focused (under 30 lines).
- Use dataclasses for data models where appropriate.

CRITICAL RULES:
- Never modify pre-existing test files that you did not create.
- Never run pytest yourself. The external verifier does this.
- Never assume a file exists -- always read it first.
- If you receive feedback about a test failure, read the error carefully
  and fix only the specific issue identified."

  temperature: 0.2

  skills: [write_file, read_file, exec_command, list_files]

  verify: fun(ctx) {
    // Run pytest with coverage
    let result = exec(
      "cd " + ctx.workspace_path
      + " && python -m pytest tests/ -q --cov=src --cov-report=term-missing",
      60
    );

    if (result["exit_code"] != 0) {
      return f"Tests failed:\n{result['stderr']}\n{result['stdout']}";
    }

    // Parse coverage percentage from pytest-cov output
    let cov_match = regex_find(
      result["stdout"],
      "TOTAL\\s+\\d+\\s+\\d+\\s+(\\d+)%"
    );
    if (cov_match != "" && num(cov_match) < 80) {
      return f"Coverage is {cov_match}%, need at least 80%. Add more tests.";
    }

    // Run pylint for static analysis
    let lint = exec(
      "cd " + ctx.workspace_path
      + " && python -m pylint src/ --disable=C0114,C0115,C0116 --fail-under=7.0",
      30
    );
    if (lint["exit_code"] != 0) {
      return f"Lint errors:\n{lint['stdout']}";
    }

    return true;
  }

  max_iterations: 15
  max_cost: 5.0
  max_tokens: 500000

  checkpoint: "git"
  files: ["src/**/*.py", "tests/**/*.py"]
}

Each field explained:

Field	Value	Purpose
`provider`	`"anthropic"`	LLM provider for code generation
`model`	`"claude-sonnet-4-20250514"`	Model with strong coding ability
`system`	(multi-line)	Detailed TDD instructions and coding standards
`temperature`	`0.2`	Low temperature for deterministic code output
`tools`	`[write_file, ...]`	File I/O and command execution capabilities
`verify`	inline function	External verification: pytest + coverage + pylint
`max_iterations`	`15`	Upper bound on build-verify cycles
`max_cost`	`5.0`	Maximum USD spend before `BudgetExhausted`
`max_tokens`	`500000`	Maximum token consumption across all iterations
`checkpoint`	`"git"`	Git commit after each verified task
`files`	`["src/*/.py", ...]`	File patterns the agent can modify

❌ Common Mistake

Setting max_iterations too low forces the agent to succeed on the first or second try. TDD workflows inherently require iteration. A value of 15 allows up to 3 retries per task for a 5-task plan.

The Verify Callback #

The verify callback is the quality gate that makes the autonomous builder reliable. Let us examine the verify function in depth.

The Context Object #

The verify callback receives a context object (ctx) with five fields:

Field	Type	Description
`ctx.iteration`	int	Current iteration number, starting from 1
`ctx.workspace_path`	string	Absolute path to the forge workspace directory
`ctx.files_changed`	list	List of file paths modified during this iteration
`ctx.cost_so_far`	float	Total USD spent across all iterations so far
`ctx.tokens_so_far`	int	Total tokens consumed across all iterations so far

Return Values #

The verify function controls the loop through its return value:

Return Value	Effect
`true`	Done. Checkpoint, mark task complete, advance to next task.
`false` or `nil`	Retry. Re-run the current task with no additional feedback.
`"abort"`	Abort. Stop the entire loop immediately.
A string (any other)	Retry with feedback. The string becomes context in the next iteration.

The VerifyResult Sealed Type #

For structured control flow, Neam provides the VerifyResult sealed type:

neam

sealed VerifyResult {
  Done,
  Retry(feedback: string),
  Abort(reason: string)
}

Variant	Behavior
`VerifyResult.Done`	Checkpoint, mark task complete, advance. If no tasks remain, return `Completed`.
`VerifyResult.Retry(feedback)`	Feed `feedback` into the next iteration's context.
`VerifyResult.Abort(reason)`	Stop immediately. Return `LoopOutcome.Aborted(reason)`.

Full Verify Implementation #

neam

fun verify_build(ctx) {
  // ── Step 1: Run pytest with coverage ────────────────────────────
  let test_result = exec(
    "cd " + ctx.workspace_path
    + " && python -m pytest tests/ -q --cov=src --cov-report=term-missing",
    60
  );

  if (test_result["exit_code"] != 0) {
    if (ctx.iteration >= 12) {
      return VerifyResult.Abort(
        f"Tests still failing after {ctx.iteration} iterations. "
        + "Manual intervention required."
      );
    }

    let feedback = f"Tests failed on iteration {ctx.iteration}.\n";
    feedback = feedback + f"Output:\n{test_result['stdout']}\n";
    feedback = feedback + f"Errors:\n{test_result['stderr']}\n";
    feedback = feedback + "Fix ONLY the specific issue identified.";

    return VerifyResult.Retry(feedback);
  }

  // ── Step 2: Parse coverage percentage ───────────────────────────
  let cov_match = regex_find(
    test_result["stdout"],
    "TOTAL\\s+\\d+\\s+\\d+\\s+(\\d+)%"
  );

  if (cov_match != "" && num(cov_match) < 80) {
    return VerifyResult.Retry(
      f"Tests pass but coverage is {cov_match}%, need at least 80%. "
      + "Add test cases for uncovered code paths."
    );
  }

  // ── Step 3: Run pylint ──────────────────────────────────────────
  let lint_result = exec(
    "cd " + ctx.workspace_path
    + " && python -m pylint src/ --disable=C0114,C0115,C0116 --fail-under=7.0",
    30
  );

  if (lint_result["exit_code"] != 0) {
    return VerifyResult.Retry(
      f"Lint check failed:\n{lint_result['stdout']}\n"
      + "Fix the lint issues identified above."
    );
  }

  // ── All checks passed ──────────────────────────────────────────
  return VerifyResult.Done;
}

The LoopOutcome Sealed Type #

When .run() completes, it returns a LoopOutcome value:

neam

sealed LoopOutcome {
  Completed,
  MaxIterations,
  Aborted(reason: string),
  BudgetExhausted
}

Handle the outcome with a match expression:

neam

let outcome = Builder.run();

match outcome {
  Completed => {
    emit "All tasks completed and verified.";
    emit f"Iterations: {outcome.iterations}, Cost: ${outcome.total_cost}";
  },
  MaxIterations => {
    emit f"Reached {outcome.iterations} iterations.";
    emit f"Tasks done: {outcome.tasks_completed}/{outcome.tasks_total}";
  },
  Aborted(reason) => emit f"Build aborted: {reason}",
  BudgetExhausted => emit f"Budget exhausted. Spent ${outcome.total_cost}."
}

💡 Tip

Always log outcome.total_cost in production. If you see a task consistently requiring more than 3-4 retries, the task is probably too broad and should be decomposed into smaller steps.

Skills #

The Builder requires four skills for interacting with the workspace. Each skill wraps a built-in workspace function.

write_file #

neam

skill write_file {
  description: "Write content to a file in the workspace. Creates parent
directories if they do not exist."
  params: { path: string, content: string }
  impl(path, content) {
    workspace_write(path, content);
    return "Wrote " + path + " (" + str(len(content)) + " bytes)";
  }
}

read_file #

neam

skill read_file {
  description: "Read the contents of a file in the workspace."
  params: { path: string }
  impl(path) {
    let content = workspace_read(path);
    if (content == nil) {
      return "File not found: " + path;
    }
    return content;
  }
}

exec_command #

neam

skill exec_command {
  description: "Execute a shell command in the workspace directory. Do NOT use
this to run tests -- the verification system handles testing automatically."
  params: { command: string }
  impl(command) {
    let result = exec(command, 30);
    let output = "Exit code: " + str(result["exit_code"]) + "\n";
    output = output + "stdout:\n" + result["stdout"] + "\n";
    if (result["stderr"] != "") {
      output = output + "stderr:\n" + result["stderr"];
    }
    return output;
  }
}

list_files #

neam

skill list_files {
  description: "List files matching a glob pattern in the workspace."
  params: { pattern: string }
  impl(pattern) {
    let files = glob(pattern);
    if (len(files) == 0) {
      return "No files match pattern: " + pattern;
    }
    return join(files, "\n");
  }
}

❌ Common Mistake

Giving the agent exec_command without restricting test execution leads to the agent running tests itself and claiming success based on its own interpretation. The system prompt forbids this, and the verify callback is the only authority on test results. This separation of concerns is what makes the builder trustworthy.

The .run() Pipeline #

When you call Builder.run(), the forge runtime executes a multi-stage pipeline:

SETUP

1. Read plan.md → extract task list

2. Read progress.jsonl → identify completed tasks

3. Read learnings.jsonl → load accumulated insights

4. Connect to Anthropic API

5. Initialize git in workspace (if needed)

▼

RETURN: Completed | MaxIterations | Aborted | BudgetExhausted

Fresh Context Model #

The critical insight is that context is built fresh on every iteration. The VM discards all messages from the previous iteration. It re-reads the plan, progress, and learnings from disk. This means:

The context window size is bounded and predictable, regardless of iteration count.
The system prompt always receives maximum attention (no drift).
The agent on iteration 10 follows instructions as faithfully as on iteration 1.
The filesystem is the continuity mechanism, not the conversation history.

If the agent wrote src/models.py during iteration 3, iteration 4 does not "remember" writing it. But the file is on disk, and the agent can read it with read_file. The world carries the state. The agent carries none.

Plan Injection and Learnings Accumulation #

On each iteration, the runtime injects the current task and a progress summary:

text

"Current task: Implement the User class with name, email, and password_hash fields"
"Completed tasks: [1] Models module (iter 2), [2] Repository (iter 5)"

Learnings from learnings.jsonl are also included:

text

"Learnings from previous iterations:
- This project uses dataclasses, not plain classes
- The repository expects async methods
- Test files use pytest fixtures defined in conftest.py"

This gives the fresh-context agent access to hard-won knowledge without carrying the full conversation history.

Checkpoint Strategies #

Strategy	Value	What It Does	When to Use
Git	`"git"`	`git add -A && git commit -m "forge: <task>"`	Code projects in git repos
Snapshot	`"snapshot"`	Copy workspace to `.neam/snapshots/<iteration>/`	Binary files, non-git projects
None	`"none"`	No checkpointing	Development, ephemeral workspaces

Git Checkpoints in Detail #

With checkpoint: "git", the forge runtime creates a commit after each verified task. Retries do not produce commits.

┌─────────────────────────────────────────────────────────────────────┐
│  Git History After a 5-Task Build                                    │
│                                                                     │
│  Iteration 1: Verify → Done                                         │
│  ● commit: "forge: Implement User class"                            │
│  │                                                                  │
│  Iteration 2: Verify → Retry (test failure)                         │
│  │ (no commit)                                                      │
│  │                                                                  │
│  Iteration 3: Verify → Done                                         │
│  ● commit: "forge: Implement UserRepository"                        │
│  │                                                                  │
│  Iteration 4: Verify → Done                                         │
│  ● commit: "forge: Implement AuthService"                           │
│  │                                                                  │
│  Iteration 5: Verify → Retry (coverage low)                         │
│  │ (no commit)                                                      │
│  │                                                                  │
│  Iteration 6: Verify → Done                                         │
│  ● commit: "forge: Add input validation"                            │
│  │                                                                  │
│  Iteration 7: Verify → Done                                         │
│  ● commit: "forge: Add integration tests"                           │
│                                                                     │
│  Result: 5 clean commits, 7 iterations total                        │
└─────────────────────────────────────────────────────────────────────┘

Every commit represents verified, working code. Use git log to see build progression, git diff between commits to see what changed, and git revert to undo a specific task.

Snapshot Checkpoints #

Snapshots create a full filesystem copy to .neam/snapshots/<iteration>/. Useful for workspaces with binary artifacts that git handles poorly. The tradeoff is disk space -- each snapshot is a complete copy.

💡 Tip

Start with checkpoint: "none" during development. Switch to "git" once your verify function is reliable. The overhead is negligible and the rollback safety net is invaluable.

Plan Files and Progress Tracking #

The plan file, progress file, and learnings file form the forge agent's external memory system.

Plan File Format #

markdown

# Build Plan: User Authentication Module

1. Create the User dataclass with name, email, and password_hash fields in src/models.py
2. Create the UserRepository class with create, find_by_id, and find_by_email methods
3. Create the AuthService class with register and login methods
4. Add input validation to the register method (email format, password length)
5. Write integration tests for the full registration and login flow

The runtime parses numbered tasks, skipping blank lines and # comments.

Progress File (progress.jsonl) #

The runtime appends a line each time verify returns Done:

jsonl

{"task": 1, "description": "Create User dataclass", "status": "done", "iteration": 2, "timestamp": "2026-02-18T10:23:45Z", "cost": 0.42, "tokens": 12840}
{"task": 2, "description": "Create UserRepository", "status": "done", "iteration": 5, "timestamp": "2026-02-18T10:31:12Z", "cost": 1.15, "tokens": 34200}

If the program crashes and restarts, the runtime reads progress.jsonl and resumes from the last completed task -- built-in resumability.

Learnings File (learnings.jsonl) #

Stores cross-iteration insights:

jsonl

{"learning": "This project uses dataclasses for all models", "iteration": 1, "source": "code_analysis"}
{"learning": "The repository uses async methods -- all database calls must be awaited", "iteration": 4, "source": "test_failure"}
{"learning": "Email validation uses the re module, not a third-party library", "iteration": 6, "source": "lint_feedback"}

The learnings file creates a feedback loop across iterations:

┌──────────────────────────────────────────────────────────────────┐
│  Iteration 3: Agent writes async code without await             │
│  → Test fails → Verify returns Retry with error feedback         │
│                                                                  │
│  Iteration 4: Agent reads feedback, fixes code, verify passes    │
│  → Verify writes learning: "repository uses async methods"       │
│                                                                  │
│  Iteration 5: New task, fresh context                            │
│  → Context includes learning about async methods                 │
│  → Agent correctly writes async code from the start              │
│  → Verify passes on first try (no retry needed)                  │
└──────────────────────────────────────────────────────────────────┘

🌎 Real-World Analogy

The learnings file is like a team wiki. When a developer discovers that "the CI requires Python 3.11, not 3.10," they add it to the wiki. The next developer reads the wiki first and avoids the same mistake.

Spawning from the PM #

The PM uses spawn to launch the Builder, passing the workspace and plan. The Orchestrable trait callbacks provide hooks for logging.

Spawn Call #

neam

fun launch_build(feature_description) {
  // Step 1: Generate the plan
  let plan_content = create_plan.call({ feature_description: feature_description });

  // Step 2: Set up the workspace
  let workspace = "./workspace/" + slugify(feature_description);
  workspace_write(workspace + "/plan.md", plan_content);
  workspace_write(workspace + "/requirements.txt", "pytest\npytest-cov\npylint\n");
  workspace_write(workspace + "/src/__init__.py", "");
  workspace_write(workspace + "/tests/__init__.py", "");

  // Step 3: Initialize git
  exec("cd " + workspace + " && git init && git add -A && git commit -m 'initial'", 10);

  // Step 4: Spawn the Builder
  let build_result = spawn(Builder, {
    workspace: workspace,
    plan: plan_content
  });

  return build_result;
}

The Orchestrable Trait #

neam

impl Orchestrable for ProjectManager {
  fn on_spawn(self, child_name) {
    emit f"[PM] Spawning builder agent: {child_name}";
    workspace_append(
      "build_log.jsonl",
      f'{{"event": "spawn", "agent": "{child_name}", '
      + f'"timestamp": "{now()}"}}\n'
    );
    return nil;
  }

  fn on_delegate(self, result) {
    let preview = str(result);
    if (len(preview) > 200) {
      preview = preview[0:200] + "...";
    }
    emit f"[PM] Build result received: {preview}";
    workspace_append(
      "build_log.jsonl",
      f'{{"event": "delegate_result", "preview": "{preview}", '
      + f'"timestamp": "{now()}"}}\n'
    );
    return nil;
  }
}

Spawn-Delegate Lifecycle #

┌─────────────────────────────────────────────────────────────────────┐
│  PM Spawn-Delegate Flow                                              │
│                                                                     │
│  User: "Build a user auth module"                                    │
│       │                                                             │
│       ▼                                                             │
│  PM: Generate plan → spawn(Builder, { workspace, plan })             │
│       │                                                             │
│       ├─── on_spawn(self, "Builder") triggered                      │
│       │                                                             │
│       ▼                                                             │
│  Builder: Runs forge loop (iterations 1..N)                          │
│       │                                                             │
│       ▼                                                             │
│  Builder: Returns LoopOutcome                                        │
│       │                                                             │
│       ├─── on_delegate(self, result) triggered                      │
│       │                                                             │
│       ▼                                                             │
│  PM: Reports to user                                                 │
│  PM: "Build complete. 5 tasks verified in 7 iterations. Cost: $2.15" │
└─────────────────────────────────────────────────────────────────────┘

Result Handling #

neam

fun report_build_result(outcome) {
  match outcome {
    Completed => {
      return f"Build completed successfully.\n"
        + f"Tasks: {outcome.tasks_completed}/{outcome.tasks_total}\n"
        + f"Iterations: {outcome.iterations}\n"
        + f"Cost: ${outcome.total_cost}\n"
        + "All tests pass. Code is committed to the workspace.";
    },
    MaxIterations => {
      return f"Build reached the iteration limit.\n"
        + f"Completed {outcome.tasks_completed} of {outcome.tasks_total} tasks.";
    },
    Aborted(reason) => {
      return f"Build was aborted: {reason}\n"
        + "Review the workspace and progress.jsonl for details.";
    },
    BudgetExhausted => {
      return f"Build exceeded its budget (${outcome.total_cost}).\n"
        + "Consider increasing max_cost or simplifying the plan.";
    }
  }
}

Testing the System #

Testing the Verify Function #

Test the verify function in isolation with known-good and known-bad workspaces:

neam

// test_verify.neam -- Run with: neamc test test_verify.neam

fun test_verify_passes_with_good_code() {
  let ws = "/tmp/test_workspace_good";
  exec("mkdir -p " + ws + "/src " + ws + "/tests", 5);
  workspace_write(ws + "/src/__init__.py", "");
  workspace_write(ws + "/src/models.py",
    "class User:\n    def __init__(self, name: str, email: str):\n"
    + "        self.name = name\n        self.email = email\n");
  workspace_write(ws + "/tests/__init__.py", "");
  workspace_write(ws + "/tests/test_models.py",
    "from src.models import User\n\n"
    + "def test_user_creation():\n"
    + "    user = User('Alice', 'alice@example.com')\n"
    + "    assert user.name == 'Alice'\n");

  let ctx = { "iteration": 1, "workspace_path": ws,
    "files_changed": ["src/models.py"], "cost_so_far": 0.5,
    "tokens_so_far": 15000 };
  let result = verify_build(ctx);
  assert(result == VerifyResult.Done, "Expected Done for passing tests");
}

fun test_verify_retries_on_failure() {
  let ws = "/tmp/test_workspace_fail";
  exec("mkdir -p " + ws + "/src " + ws + "/tests", 5);
  workspace_write(ws + "/src/__init__.py", "");
  workspace_write(ws + "/src/models.py",
    "class User:\n    def __init__(self, name: str):\n"
    + "        self.name = name\n");
  workspace_write(ws + "/tests/__init__.py", "");
  workspace_write(ws + "/tests/test_models.py",
    "from src.models import User\n\n"
    + "def test_user_has_email():\n"
    + "    user = User('Alice', 'alice@example.com')\n"
    + "    assert user.email == 'alice@example.com'\n");

  let ctx = { "iteration": 1, "workspace_path": ws,
    "files_changed": ["src/models.py"], "cost_so_far": 0.5,
    "tokens_so_far": 15000 };
  let result = verify_build(ctx);
  assert(is_retry(result), "Expected Retry for failing tests");
}

fun test_verify_aborts_after_max_retries() {
  let ws = "/tmp/test_workspace_abort";
  exec("mkdir -p " + ws + "/src " + ws + "/tests", 5);
  workspace_write(ws + "/src/__init__.py", "");
  workspace_write(ws + "/src/models.py", "invalid python syntax !!!");
  workspace_write(ws + "/tests/__init__.py", "");
  workspace_write(ws + "/tests/test_models.py", "def test_something(): pass");

  let ctx = { "iteration": 12, "workspace_path": ws,
    "files_changed": ["src/models.py"], "cost_so_far": 4.5,
    "tokens_so_far": 400000 };
  let result = verify_build(ctx);
  assert(is_abort(result), "Expected Abort after iteration 12");
}

Testing Spawn Integration #

neam

fun test_full_build_cycle() {
  let outcome = launch_build("Create a simple calculator with add and multiply");
  assert(outcome.tasks_completed > 0, "Should complete at least one task");

  let git_log = exec("cd ./workspace/calculator && git log --oneline", 5);
  assert(git_log["stdout"].contains("forge:"), "Should have forge commits");
}

Run all tests:

bash

neamc test test_verify.neam test_integration.neam

Running with neam-forge CLI #

The neam-forge CLI runs the Builder directly from the terminal, without the PM.

Basic Usage #

bash

# Run with all defaults
neam-forge --agent code_builder.neam

# Run with overrides
neam-forge --agent code_builder.neam \
  --workspace ./workspace/auth-feature \
  --max-iterations 10 \
  --max-cost 3.0 \
  --verbose

CLI Output #

Progress messages go to stderr (with --verbose). The LoopOutcome prints as JSON to stdout:

bash

$ neam-forge --agent code_builder.neam --verbose

# stderr:
# [forge] Compiling code_builder.neam...
# [forge] Found forge agent: Builder
# [forge] Plan: 5 tasks loaded from plan.md
# [forge] Iteration 1: Task 1 → Verify → Done
# [forge] Iteration 2: Task 2 → Verify → Retry
# [forge] Iteration 3: Task 2 → Verify → Done
# ...
# [forge] All tasks completed.

# stdout:
{"outcome": "completed", "iterations": 7, "tasks_completed": 5, "tasks_total": 5, "total_cost": 2.15, "total_tokens": 78400}

Exit Code	Meaning
`0`	Forge loop completed successfully
`1`	Compilation failure, missing agent, or runtime error

💡 Tip

Pipe the JSON output to jq for scripting: neam-forge --agent code_builder.neam | jq '.total_cost'

Docker Deployment #

Package the system as Docker containers: the PM runs as neam-api, the Builder runs as neam-forge.

Dockerfile #

dockerfile

# Autonomous Code Builder -- Docker Image

# Stage 1: Build Neam binaries
FROM ubuntu:24.04 AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential cmake g++ libcurl4-openssl-dev libssl-dev ca-certificates \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY CMakeLists.txt ./
COPY deps/ deps/
COPY NeamC/ NeamC/
RUN cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)

# Stage 2: Runtime
FROM ubuntu:24.04
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates curl libcurl4 libssl3 git \
    python3 python3-pip python3-venv \
    && rm -rf /var/lib/apt/lists/*

COPY --from=builder /build/build/neamc      /usr/local/bin/neamc
COPY --from=builder /build/build/neam       /usr/local/bin/neam
COPY --from=builder /build/build/neam-api   /usr/local/bin/neam-api
COPY --from=builder /build/build/neam-forge /usr/local/bin/neam-forge

RUN useradd --create-home --shell /bin/bash neam
USER neam
WORKDIR /home/neam

RUN python3 -m venv /home/neam/venv
ENV PATH="/home/neam/venv/bin:$PATH"
RUN pip install pytest pytest-cov pylint

RUN mkdir -p /home/neam/sessions /home/neam/workspace /home/neam/agents
COPY --chown=neam:neam agents/ /home/neam/agents/

EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

ENTRYPOINT ["neam-api"]
CMD ["--program", "/home/neam/agents/code_builder.neam", "--port", "8080"]

Note: python3 and python3-venv are included because the Builder runs Python tests. The virtual environment pre-installs pytest, pytest-cov, and pylint.

Docker Compose #

yaml

services:
  pm:
    build: .
    ports:
      - "8080:8080"
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - NEAM_API_KEY=${NEAM_API_KEY}
    volumes:
      - session-data:/home/neam/sessions
      - workspace-data:/home/neam/workspace
    command: ["--program", "/home/neam/agents/code_builder.neam",
              "--port", "8080", "--workers", "2"]

  builder:
    build: .
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    volumes:
      - workspace-data:/home/neam/workspace
    entrypoint: ["neam-forge"]
    command: ["--agent", "/home/neam/agents/code_builder.neam",
              "--name", "Builder"]
    profiles:
      - build

volumes:
  session-data:
  workspace-data:

CI/CD Integration #

yaml

# .github/workflows/auto-build.yml
name: Autonomous Build
on:
  workflow_dispatch:
    inputs:
      feature:
        description: "Feature to build"
        required: true

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run autonomous builder
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          neam-forge --agent agents/code_builder.neam \
            --workspace ./workspace --max-cost 5.0
      - uses: actions/upload-artifact@v4
        with:
          name: build-output
          path: workspace/

❌ Common Mistake

Forgetting to mount the workspace volume as shared between PM and Builder containers. The PM writes the plan, and the Builder reads it. Separate volumes mean the Builder will not find the plan file.

Monitoring the Build Loop #

Reading progress.jsonl #

neam

fun monitor_build(workspace_path) {
  let progress = workspace_read(workspace_path + "/progress.jsonl");
  if (progress == nil) { emit "No build in progress."; return; }

  let lines = split(progress, "\n");
  let tasks_done = 0;
  for (line in lines) {
    if (line == "") { continue; }
    let entry = json_parse(line);
    tasks_done = tasks_done + 1;
    emit f"Task {entry['task']}: {entry['description']} "
      + f"(iteration {entry['iteration']}, ${entry['cost']})";
  }
  emit f"\nSummary: {tasks_done} tasks completed.";
}

Cost Tracking Dashboard #

📁┌──────────────────────────────────────────────────────────────────┐

│ 📁Build Cost Dashboard │

│ │📁

│ 📁Build #1: auth-module │

│ ├── 📄Task 1: $0.42 (2 iterations) │

│ ├── 📄Task 2: $0.73 (3 iterations -- 1 retry) │

│ ├── 📄Task 3: $0.38 (1 iteration) │

│ ├── 📄Task 4: $0.27 (1 iteration) │

│ └── 📄Task 5: $0.35 (1 iteration) │

│ 📁Total: $2.15 / 7 iterations / 5 tasks │

│ │📁

│ 📁Build #2: payment-processor │

│ ├── 📄Task 1: $0.55 (2 iterations) │

│ ├── 📄Task 2: $1.20 (4 iterations -- 3 retries) ← investigate │

│ └── 📄[ABORTED] Budget limit reached at $4.85 │

│ 📁Total: $4.85 / 8 iterations / 3 of 6 tasks │

│ │📁

│ 📄Average cost per task: $0.58 │

│ 📁Tasks requiring 3+ retries: investigate plan decomposition │

└──────────────────────────────────────────────────────────────────📁┘

Alerting on Budget Exhaustion #

neam

fun handle_build_outcome(outcome) {
  match outcome {
    BudgetExhausted => {
      exec("curl -X POST https://alerts.example.com/webhook "
        + "-H 'Content-Type: application/json' "
        + "-d '{\"severity\":\"warning\",\"message\":\"Budget exhausted\"}'", 10);
    },
    Aborted(reason) => {
      exec("curl -X POST https://alerts.example.com/webhook "
        + "-H 'Content-Type: application/json' "
        + "-d '{\"severity\":\"error\",\"message\":\"Build aborted\"}'", 10);
    },
    _ => {}
  }
}

Lessons Learned #

1. Verify design is the most important decision. A verify function that is too strict causes excessive retries and budget exhaustion. One that is too lenient lets broken code through. The sweet spot is correctness (tests pass) plus a reasonable quality bar (80% coverage, no critical lint errors).

2. Git checkpoints pay for themselves immediately. The ability to git diff between verified tasks is invaluable for debugging. The overhead is under 100ms per checkpoint. Always use checkpoint: "git" for code generation workloads.

3. Cost management requires active monitoring. A forge agent can exhaust its budget in under 10 minutes in a retry loop. Monitor progress.jsonl in real time. Consider a per-task cost limit in your verify function.

4. Fresh context is a feature, not a limitation. Iteration 10 follows the system prompt as faithfully as iteration 1. The learnings file carries essential knowledge, and the filesystem carries the code. Accumulated context would introduce drift.

5. The PM integration pattern is reusable. The claw agent PM + forge agent Builder pattern generalizes to document generation, data migration, infrastructure setup, and any domain where you need a conversational interface for requirements and an iterative builder for execution.

6. Plan decomposition quality determines build success. Builds with 5-8 small, specific tasks succeed more often and cost less than builds with 2-3 large tasks. Invest time in the plan generation prompt.

7. Learnings accumulation prevents redundant retries. An insight discovered on task 1 prevents the same error on tasks 2 through 5. Over a 5-task build, this typically saves 2-3 iterations and $0.50-$1.00 in cost.

Exercises #

Exercise 1: Add a Linting-Only Verify Mode

Modify the verify_build function to support a lint_only mode that checks only pylint results without running pytest. Add a configuration flag to the Builder that selects between "full" and "lint_only" modes. Write tests for both.

Exercise 2: Multi-Language Support

Extend the Builder to support JavaScript/Node.js projects. Create a JSBuilder forge agent with npm test and eslint in its verify function. The PM should detect the target language from the feature description and spawn the right Builder.

Exercise 3: Plan Review Checkpoint

Add a human-in-the-loop step between plan generation and build execution. The PM should present the plan and wait for "approve", "revise", or "cancel" before spawning the Builder. Use session persistence across the approval flow.

Exercise 4: Cost Optimization Dashboard

Write a standalone Neam program that reads all progress.jsonl files from a directory and generates a report showing: average cost per task, tasks requiring the most retries, total spending over 7 days, and decomposition recommendations.

Exercise 5: Rollback and Retry

Implement a rollback_and_retry skill for the PM that rolls back to the last successful git checkpoint when a build fails, modifies the failed task's description based on the abort reason, and spawns a new Builder instance.

Exercise 6: Parallel Task Execution

Redesign the Builder for parallel task execution. Modify the plan format to include dependency annotations. Implement a scheduler that identifies independent tasks and runs multiple forge agents concurrently using dag_execute.