Case Study: Autonomous Code Builder #
The previous case studies demonstrated agents that answer questions, search documents,
and manage data pipelines. This chapter tackles a fundamentally different challenge:
building software autonomously. The system we construct here takes a feature requirement
in plain English, generates a plan, writes Python code with tests using a forge agent,
verifies correctness through test execution and coverage checks, uses git checkpoints
to preserve every verified step, tracks progress and learnings across iterations, and
integrates with a claw agent "Project Manager" that orchestrates the entire workflow
via spawn.
By the end of this case study, you will have a complete TDD-driven autonomous code builder that combines the forge agent's iterative build-verify loop with the claw agent's persistent session management and the Orchestrable trait's spawn/delegate callbacks. This is the full NeamClaw build workflow in action.
Writing code is a multi-step, iterative process. A human developer reads requirements, plans an approach, writes code, runs tests, reads error messages, fixes bugs, and repeats until the tests pass. No single LLM call can replicate this workflow reliably. The forge agent encodes the entire cycle into a language construct: write, verify, checkpoint, repeat. Combined with a claw agent PM that manages requirements and tracks progress across features, you get a system that mirrors how real engineering teams operate -- a project manager assigns work, a developer builds it iteratively, and the PM reviews the result.
Requirements #
An autonomous code builder must satisfy seven core requirements:
-
Feature intake. The system accepts feature requirements in natural language. A user describes what they want, and the system produces working code.
-
Plan generation. Before writing any code, the system generates a step-by-step plan that decomposes the feature into discrete, testable tasks.
-
TDD code generation. A forge agent writes Python source code and test files following test-driven development. The agent never modifies pre-existing tests.
-
Automated verification. After each iteration, the system runs
pytest, checks that all tests pass, validates that code coverage meets 80%, and runspylintfor static analysis. Verification is external and deterministic. -
Git checkpoints. Every verified task produces a git commit. The commit history provides a clean audit trail of incremental progress.
-
Progress and learnings tracking. The system records completed tasks, iteration counts, costs, and cross-iteration insights that carry forward into subsequent iterations via the fresh context model.
-
Project Manager integration. A claw agent PM receives feature requests, creates plans, spawns the forge agent Builder, monitors progress, and reports results with full session persistence.
Architecture Design #
The claw agent PM handles the conversational interface and orchestration. The forge
agent Builder handles the iterative code generation loop. They communicate through
spawn, with the PM launching the Builder and receiving results through the
Orchestrable trait callbacks.
Workspace Layout #
The forge agent operates inside a structured workspace directory:
Think of the workspace as a developer's desk. The plan is the task board. The progress file is the standup report. The learnings file is the team wiki. The forge agent sits down at this desk fresh each iteration, reads the board, checks the wiki, does its work, and commits. The next iteration is a fresh developer sitting down at the same desk.
The Project Manager #
The Project Manager is a claw agent that receives feature requests, creates plans, spawns the Builder, and reports results. Session persistence lets users ask follow-up questions about previous builds.
// ================================================================
// Project Manager -- Claw Agent
// ================================================================
channel pm_cli {
type: "cli"
prompt: "pm> "
greeting: "Autonomous Code Builder ready. Describe a feature to build."
}
channel pm_http {
type: "http"
port: 8080
path: "/pm/chat"
}
skill create_plan {
description: "Create a step-by-step build plan from a feature description"
params: { feature_description: string }
impl(feature_description) {
let plan = ProjectManager.ask(
"Generate a TDD build plan for the following feature. "
+ "Output ONLY the plan as a numbered list, one task per line. "
+ "Each task must be small enough to implement in a single file. "
+ "Each task must have a clear test criterion.\n\n"
+ "Feature: " + feature_description
);
return plan;
}
}
skill check_build_status {
description: "Check the status of the current or most recent build"
params: { workspace_path: string }
impl(workspace_path) {
let progress = workspace_read(workspace_path + "/progress.jsonl");
if (progress == nil) {
return "No build in progress. No progress file found.";
}
return "Build progress:\n" + progress;
}
}
claw agent ProjectManager {
provider: "anthropic"
model: "claude-sonnet-4-20250514"
temperature: 0.4
system: "You are a Project Manager for an autonomous code builder.
YOUR RESPONSIBILITIES:
1. Receive feature requests from the user.
2. Create a clear, step-by-step TDD build plan.
3. Launch the Builder agent to implement the plan.
4. Report build results back to the user.
5. Answer follow-up questions about builds.
RULES:
- Always create a plan before launching a build.
- Each plan task must be small, focused, and testable.
- Report both successes and failures honestly.
- Track costs and iterations for the user."
channels: [pm_cli, pm_http]
skills: [create_plan, check_build_status]
session: {
idle_reset_minutes: 120
daily_reset_hour: 4
max_history_turns: 100
compaction: "auto"
}
semantic_memory: {
backend: "sqlite"
embedding_model: "nomic-embed-text"
search: "hybrid"
top_k: 5
}
}
Separating plan creation from code generation gives you a review checkpoint. The PM can present the plan to the user before spawning the Builder, letting the user approve, modify, or reject the plan before spending any build budget.
The Forge Agent #
The Builder is a forge agent that executes the TDD build loop. It reads the plan, writes code, and relies on the verify callback to validate its work.
// ================================================================
// Builder -- Forge Agent
// ================================================================
forge agent Builder {
provider: "anthropic"
model: "claude-sonnet-4-20250514"
system: "You are a senior Python developer practicing strict TDD.
YOUR WORKFLOW FOR EACH TASK:
1. Read the current task description carefully.
2. Use read_file to examine existing source and test files.
3. Use list_files to understand the project structure.
4. Write test files FIRST using write_file (if tests do not exist yet).
5. Write implementation code using write_file.
6. Do NOT run tests yourself -- the verification system handles that.
CODE STANDARDS:
- Use Python type hints on all function signatures.
- Write docstrings for all public functions and classes.
- Follow PEP 8 naming conventions (snake_case).
- Keep functions small and focused (under 30 lines).
- Use dataclasses for data models where appropriate.
CRITICAL RULES:
- Never modify pre-existing test files that you did not create.
- Never run pytest yourself. The external verifier does this.
- Never assume a file exists -- always read it first.
- If you receive feedback about a test failure, read the error carefully
and fix only the specific issue identified."
temperature: 0.2
skills: [write_file, read_file, exec_command, list_files]
verify: fun(ctx) {
// Run pytest with coverage
let result = exec(
"cd " + ctx.workspace_path
+ " && python -m pytest tests/ -q --cov=src --cov-report=term-missing",
60
);
if (result["exit_code"] != 0) {
return f"Tests failed:\n{result['stderr']}\n{result['stdout']}";
}
// Parse coverage percentage from pytest-cov output
let cov_match = regex_find(
result["stdout"],
"TOTAL\\s+\\d+\\s+\\d+\\s+(\\d+)%"
);
if (cov_match != "" && num(cov_match) < 80) {
return f"Coverage is {cov_match}%, need at least 80%. Add more tests.";
}
// Run pylint for static analysis
let lint = exec(
"cd " + ctx.workspace_path
+ " && python -m pylint src/ --disable=C0114,C0115,C0116 --fail-under=7.0",
30
);
if (lint["exit_code"] != 0) {
return f"Lint errors:\n{lint['stdout']}";
}
return true;
}
max_iterations: 15
max_cost: 5.0
max_tokens: 500000
checkpoint: "git"
files: ["src/**/*.py", "tests/**/*.py"]
}
Each field explained:
| Field | Value | Purpose |
|---|---|---|
provider |
"anthropic" |
LLM provider for code generation |
model |
"claude-sonnet-4-20250514" |
Model with strong coding ability |
system |
(multi-line) | Detailed TDD instructions and coding standards |
temperature |
0.2 |
Low temperature for deterministic code output |
tools |
[write_file, ...] |
File I/O and command execution capabilities |
verify |
inline function | External verification: pytest + coverage + pylint |
max_iterations |
15 |
Upper bound on build-verify cycles |
max_cost |
5.0 |
Maximum USD spend before BudgetExhausted |
max_tokens |
500000 |
Maximum token consumption across all iterations |
checkpoint |
"git" |
Git commit after each verified task |
files |
["src/**/*.py", ...] |
File patterns the agent can modify |
Setting max_iterations too low forces the agent to succeed on the first or
second try. TDD workflows inherently require iteration. A value of 15 allows
up to 3 retries per task for a 5-task plan.
The Verify Callback #
The verify callback is the quality gate that makes the autonomous builder reliable. Let us examine the verify function in depth.
The Context Object #
The verify callback receives a context object (ctx) with five fields:
| Field | Type | Description |
|---|---|---|
ctx.iteration |
int | Current iteration number, starting from 1 |
ctx.workspace_path |
string | Absolute path to the forge workspace directory |
ctx.files_changed |
list | List of file paths modified during this iteration |
ctx.cost_so_far |
float | Total USD spent across all iterations so far |
ctx.tokens_so_far |
int | Total tokens consumed across all iterations so far |
Return Values #
The verify function controls the loop through its return value:
| Return Value | Effect |
|---|---|
true |
Done. Checkpoint, mark task complete, advance to next task. |
false or nil |
Retry. Re-run the current task with no additional feedback. |
"abort" |
Abort. Stop the entire loop immediately. |
| A string (any other) | Retry with feedback. The string becomes context in the next iteration. |
The VerifyResult Sealed Type #
For structured control flow, Neam provides the VerifyResult sealed type:
sealed VerifyResult {
Done,
Retry(feedback: string),
Abort(reason: string)
}
| Variant | Behavior |
|---|---|
VerifyResult.Done |
Checkpoint, mark task complete, advance. If no tasks remain, return Completed. |
VerifyResult.Retry(feedback) |
Feed feedback into the next iteration's context. |
VerifyResult.Abort(reason) |
Stop immediately. Return LoopOutcome.Aborted(reason). |
Full Verify Implementation #
fun verify_build(ctx) {
// ── Step 1: Run pytest with coverage ────────────────────────────
let test_result = exec(
"cd " + ctx.workspace_path
+ " && python -m pytest tests/ -q --cov=src --cov-report=term-missing",
60
);
if (test_result["exit_code"] != 0) {
if (ctx.iteration >= 12) {
return VerifyResult.Abort(
f"Tests still failing after {ctx.iteration} iterations. "
+ "Manual intervention required."
);
}
let feedback = f"Tests failed on iteration {ctx.iteration}.\n";
feedback = feedback + f"Output:\n{test_result['stdout']}\n";
feedback = feedback + f"Errors:\n{test_result['stderr']}\n";
feedback = feedback + "Fix ONLY the specific issue identified.";
return VerifyResult.Retry(feedback);
}
// ── Step 2: Parse coverage percentage ───────────────────────────
let cov_match = regex_find(
test_result["stdout"],
"TOTAL\\s+\\d+\\s+\\d+\\s+(\\d+)%"
);
if (cov_match != "" && num(cov_match) < 80) {
return VerifyResult.Retry(
f"Tests pass but coverage is {cov_match}%, need at least 80%. "
+ "Add test cases for uncovered code paths."
);
}
// ── Step 3: Run pylint ──────────────────────────────────────────
let lint_result = exec(
"cd " + ctx.workspace_path
+ " && python -m pylint src/ --disable=C0114,C0115,C0116 --fail-under=7.0",
30
);
if (lint_result["exit_code"] != 0) {
return VerifyResult.Retry(
f"Lint check failed:\n{lint_result['stdout']}\n"
+ "Fix the lint issues identified above."
);
}
// ── All checks passed ──────────────────────────────────────────
return VerifyResult.Done;
}
The LoopOutcome Sealed Type #
When .run() completes, it returns a LoopOutcome value:
sealed LoopOutcome {
Completed,
MaxIterations,
Aborted(reason: string),
BudgetExhausted
}
Handle the outcome with a match expression:
let outcome = Builder.run();
match outcome {
Completed => {
emit "All tasks completed and verified.";
emit f"Iterations: {outcome.iterations}, Cost: ${outcome.total_cost}";
},
MaxIterations => {
emit f"Reached {outcome.iterations} iterations.";
emit f"Tasks done: {outcome.tasks_completed}/{outcome.tasks_total}";
},
Aborted(reason) => emit f"Build aborted: {reason}",
BudgetExhausted => emit f"Budget exhausted. Spent ${outcome.total_cost}."
}
Always log outcome.total_cost in production. If you see a task consistently
requiring more than 3-4 retries, the task is probably too broad and should be
decomposed into smaller steps.
Skills #
The Builder requires four skills for interacting with the workspace. Each skill wraps a built-in workspace function.
write_file #
skill write_file {
description: "Write content to a file in the workspace. Creates parent
directories if they do not exist."
params: { path: string, content: string }
impl(path, content) {
workspace_write(path, content);
return "Wrote " + path + " (" + str(len(content)) + " bytes)";
}
}
read_file #
skill read_file {
description: "Read the contents of a file in the workspace."
params: { path: string }
impl(path) {
let content = workspace_read(path);
if (content == nil) {
return "File not found: " + path;
}
return content;
}
}
exec_command #
skill exec_command {
description: "Execute a shell command in the workspace directory. Do NOT use
this to run tests -- the verification system handles testing automatically."
params: { command: string }
impl(command) {
let result = exec(command, 30);
let output = "Exit code: " + str(result["exit_code"]) + "\n";
output = output + "stdout:\n" + result["stdout"] + "\n";
if (result["stderr"] != "") {
output = output + "stderr:\n" + result["stderr"];
}
return output;
}
}
list_files #
skill list_files {
description: "List files matching a glob pattern in the workspace."
params: { pattern: string }
impl(pattern) {
let files = glob(pattern);
if (len(files) == 0) {
return "No files match pattern: " + pattern;
}
return join(files, "\n");
}
}
Giving the agent exec_command without restricting test execution leads to the
agent running tests itself and claiming success based on its own interpretation.
The system prompt forbids this, and the verify callback is the only authority on
test results. This separation of concerns is what makes the builder trustworthy.
The .run() Pipeline #
When you call Builder.run(), the forge runtime executes a multi-stage pipeline:
Fresh Context Model #
The critical insight is that context is built fresh on every iteration. The VM discards all messages from the previous iteration. It re-reads the plan, progress, and learnings from disk. This means:
- The context window size is bounded and predictable, regardless of iteration count.
- The system prompt always receives maximum attention (no drift).
- The agent on iteration 10 follows instructions as faithfully as on iteration 1.
- The filesystem is the continuity mechanism, not the conversation history.
If the agent wrote src/models.py during iteration 3, iteration 4 does not "remember"
writing it. But the file is on disk, and the agent can read it with read_file. The
world carries the state. The agent carries none.
Plan Injection and Learnings Accumulation #
On each iteration, the runtime injects the current task and a progress summary:
"Current task: Implement the User class with name, email, and password_hash fields"
"Completed tasks: [1] Models module (iter 2), [2] Repository (iter 5)"
Learnings from learnings.jsonl are also included:
"Learnings from previous iterations:
- This project uses dataclasses, not plain classes
- The repository expects async methods
- Test files use pytest fixtures defined in conftest.py"
This gives the fresh-context agent access to hard-won knowledge without carrying the full conversation history.
Checkpoint Strategies #
| Strategy | Value | What It Does | When to Use |
|---|---|---|---|
| Git | "git" |
git add -A && git commit -m "forge: <task>" |
Code projects in git repos |
| Snapshot | "snapshot" |
Copy workspace to .neam/snapshots/<iteration>/ |
Binary files, non-git projects |
| None | "none" |
No checkpointing | Development, ephemeral workspaces |
Git Checkpoints in Detail #
With checkpoint: "git", the forge runtime creates a commit after each verified
task. Retries do not produce commits.
┌─────────────────────────────────────────────────────────────────────┐
│ Git History After a 5-Task Build │
│ │
│ Iteration 1: Verify → Done │
│ ● commit: "forge: Implement User class" │
│ │ │
│ Iteration 2: Verify → Retry (test failure) │
│ │ (no commit) │
│ │ │
│ Iteration 3: Verify → Done │
│ ● commit: "forge: Implement UserRepository" │
│ │ │
│ Iteration 4: Verify → Done │
│ ● commit: "forge: Implement AuthService" │
│ │ │
│ Iteration 5: Verify → Retry (coverage low) │
│ │ (no commit) │
│ │ │
│ Iteration 6: Verify → Done │
│ ● commit: "forge: Add input validation" │
│ │ │
│ Iteration 7: Verify → Done │
│ ● commit: "forge: Add integration tests" │
│ │
│ Result: 5 clean commits, 7 iterations total │
└─────────────────────────────────────────────────────────────────────┘
Every commit represents verified, working code. Use git log to see build
progression, git diff between commits to see what changed, and git revert to
undo a specific task.
Snapshot Checkpoints #
Snapshots create a full filesystem copy to .neam/snapshots/<iteration>/. Useful
for workspaces with binary artifacts that git handles poorly. The tradeoff is disk
space -- each snapshot is a complete copy.
Start with checkpoint: "none" during development. Switch to "git" once your
verify function is reliable. The overhead is negligible and the rollback safety
net is invaluable.
Plan Files and Progress Tracking #
The plan file, progress file, and learnings file form the forge agent's external memory system.
Plan File Format #
# Build Plan: User Authentication Module
1. Create the User dataclass with name, email, and password_hash fields in src/models.py
2. Create the UserRepository class with create, find_by_id, and find_by_email methods
3. Create the AuthService class with register and login methods
4. Add input validation to the register method (email format, password length)
5. Write integration tests for the full registration and login flow
The runtime parses numbered tasks, skipping blank lines and # comments.
Progress File (progress.jsonl) #
The runtime appends a line each time verify returns Done:
{"task": 1, "description": "Create User dataclass", "status": "done", "iteration": 2, "timestamp": "2026-02-18T10:23:45Z", "cost": 0.42, "tokens": 12840}
{"task": 2, "description": "Create UserRepository", "status": "done", "iteration": 5, "timestamp": "2026-02-18T10:31:12Z", "cost": 1.15, "tokens": 34200}
If the program crashes and restarts, the runtime reads progress.jsonl and resumes
from the last completed task -- built-in resumability.
Learnings File (learnings.jsonl) #
Stores cross-iteration insights:
{"learning": "This project uses dataclasses for all models", "iteration": 1, "source": "code_analysis"}
{"learning": "The repository uses async methods -- all database calls must be awaited", "iteration": 4, "source": "test_failure"}
{"learning": "Email validation uses the re module, not a third-party library", "iteration": 6, "source": "lint_feedback"}
The learnings file creates a feedback loop across iterations:
┌──────────────────────────────────────────────────────────────────┐
│ Iteration 3: Agent writes async code without await │
│ → Test fails → Verify returns Retry with error feedback │
│ │
│ Iteration 4: Agent reads feedback, fixes code, verify passes │
│ → Verify writes learning: "repository uses async methods" │
│ │
│ Iteration 5: New task, fresh context │
│ → Context includes learning about async methods │
│ → Agent correctly writes async code from the start │
│ → Verify passes on first try (no retry needed) │
└──────────────────────────────────────────────────────────────────┘
The learnings file is like a team wiki. When a developer discovers that "the CI requires Python 3.11, not 3.10," they add it to the wiki. The next developer reads the wiki first and avoids the same mistake.
Spawning from the PM #
The PM uses spawn to launch the Builder, passing the workspace and plan. The
Orchestrable trait callbacks provide hooks for logging.
Spawn Call #
fun launch_build(feature_description) {
// Step 1: Generate the plan
let plan_content = create_plan.call({ feature_description: feature_description });
// Step 2: Set up the workspace
let workspace = "./workspace/" + slugify(feature_description);
workspace_write(workspace + "/plan.md", plan_content);
workspace_write(workspace + "/requirements.txt", "pytest\npytest-cov\npylint\n");
workspace_write(workspace + "/src/__init__.py", "");
workspace_write(workspace + "/tests/__init__.py", "");
// Step 3: Initialize git
exec("cd " + workspace + " && git init && git add -A && git commit -m 'initial'", 10);
// Step 4: Spawn the Builder
let build_result = spawn(Builder, {
workspace: workspace,
plan: plan_content
});
return build_result;
}
The Orchestrable Trait #
impl Orchestrable for ProjectManager {
fn on_spawn(self, child_name) {
emit f"[PM] Spawning builder agent: {child_name}";
workspace_append(
"build_log.jsonl",
f'{{"event": "spawn", "agent": "{child_name}", '
+ f'"timestamp": "{now()}"}}\n'
);
return nil;
}
fn on_delegate(self, result) {
let preview = str(result);
if (len(preview) > 200) {
preview = preview[0:200] + "...";
}
emit f"[PM] Build result received: {preview}";
workspace_append(
"build_log.jsonl",
f'{{"event": "delegate_result", "preview": "{preview}", '
+ f'"timestamp": "{now()}"}}\n'
);
return nil;
}
}
Spawn-Delegate Lifecycle #
┌─────────────────────────────────────────────────────────────────────┐
│ PM Spawn-Delegate Flow │
│ │
│ User: "Build a user auth module" │
│ │ │
│ ▼ │
│ PM: Generate plan → spawn(Builder, { workspace, plan }) │
│ │ │
│ ├─── on_spawn(self, "Builder") triggered │
│ │ │
│ ▼ │
│ Builder: Runs forge loop (iterations 1..N) │
│ │ │
│ ▼ │
│ Builder: Returns LoopOutcome │
│ │ │
│ ├─── on_delegate(self, result) triggered │
│ │ │
│ ▼ │
│ PM: Reports to user │
│ PM: "Build complete. 5 tasks verified in 7 iterations. Cost: $2.15" │
└─────────────────────────────────────────────────────────────────────┘
Result Handling #
fun report_build_result(outcome) {
match outcome {
Completed => {
return f"Build completed successfully.\n"
+ f"Tasks: {outcome.tasks_completed}/{outcome.tasks_total}\n"
+ f"Iterations: {outcome.iterations}\n"
+ f"Cost: ${outcome.total_cost}\n"
+ "All tests pass. Code is committed to the workspace.";
},
MaxIterations => {
return f"Build reached the iteration limit.\n"
+ f"Completed {outcome.tasks_completed} of {outcome.tasks_total} tasks.";
},
Aborted(reason) => {
return f"Build was aborted: {reason}\n"
+ "Review the workspace and progress.jsonl for details.";
},
BudgetExhausted => {
return f"Build exceeded its budget (${outcome.total_cost}).\n"
+ "Consider increasing max_cost or simplifying the plan.";
}
}
}
Testing the System #
Testing the Verify Function #
Test the verify function in isolation with known-good and known-bad workspaces:
// test_verify.neam -- Run with: neamc test test_verify.neam
fun test_verify_passes_with_good_code() {
let ws = "/tmp/test_workspace_good";
exec("mkdir -p " + ws + "/src " + ws + "/tests", 5);
workspace_write(ws + "/src/__init__.py", "");
workspace_write(ws + "/src/models.py",
"class User:\n def __init__(self, name: str, email: str):\n"
+ " self.name = name\n self.email = email\n");
workspace_write(ws + "/tests/__init__.py", "");
workspace_write(ws + "/tests/test_models.py",
"from src.models import User\n\n"
+ "def test_user_creation():\n"
+ " user = User('Alice', 'alice@example.com')\n"
+ " assert user.name == 'Alice'\n");
let ctx = { "iteration": 1, "workspace_path": ws,
"files_changed": ["src/models.py"], "cost_so_far": 0.5,
"tokens_so_far": 15000 };
let result = verify_build(ctx);
assert(result == VerifyResult.Done, "Expected Done for passing tests");
}
fun test_verify_retries_on_failure() {
let ws = "/tmp/test_workspace_fail";
exec("mkdir -p " + ws + "/src " + ws + "/tests", 5);
workspace_write(ws + "/src/__init__.py", "");
workspace_write(ws + "/src/models.py",
"class User:\n def __init__(self, name: str):\n"
+ " self.name = name\n");
workspace_write(ws + "/tests/__init__.py", "");
workspace_write(ws + "/tests/test_models.py",
"from src.models import User\n\n"
+ "def test_user_has_email():\n"
+ " user = User('Alice', 'alice@example.com')\n"
+ " assert user.email == 'alice@example.com'\n");
let ctx = { "iteration": 1, "workspace_path": ws,
"files_changed": ["src/models.py"], "cost_so_far": 0.5,
"tokens_so_far": 15000 };
let result = verify_build(ctx);
assert(is_retry(result), "Expected Retry for failing tests");
}
fun test_verify_aborts_after_max_retries() {
let ws = "/tmp/test_workspace_abort";
exec("mkdir -p " + ws + "/src " + ws + "/tests", 5);
workspace_write(ws + "/src/__init__.py", "");
workspace_write(ws + "/src/models.py", "invalid python syntax !!!");
workspace_write(ws + "/tests/__init__.py", "");
workspace_write(ws + "/tests/test_models.py", "def test_something(): pass");
let ctx = { "iteration": 12, "workspace_path": ws,
"files_changed": ["src/models.py"], "cost_so_far": 4.5,
"tokens_so_far": 400000 };
let result = verify_build(ctx);
assert(is_abort(result), "Expected Abort after iteration 12");
}
Testing Spawn Integration #
fun test_full_build_cycle() {
let outcome = launch_build("Create a simple calculator with add and multiply");
assert(outcome.tasks_completed > 0, "Should complete at least one task");
let git_log = exec("cd ./workspace/calculator && git log --oneline", 5);
assert(git_log["stdout"].contains("forge:"), "Should have forge commits");
}
Run all tests:
neamc test test_verify.neam test_integration.neam
Running with neam-forge CLI #
The neam-forge CLI runs the Builder directly from the terminal, without the PM.
Basic Usage #
# Run with all defaults
neam-forge --agent code_builder.neam
# Run with overrides
neam-forge --agent code_builder.neam \
--workspace ./workspace/auth-feature \
--max-iterations 10 \
--max-cost 3.0 \
--verbose
CLI Output #
Progress messages go to stderr (with --verbose). The LoopOutcome prints as
JSON to stdout:
$ neam-forge --agent code_builder.neam --verbose
# stderr:
# [forge] Compiling code_builder.neam...
# [forge] Found forge agent: Builder
# [forge] Plan: 5 tasks loaded from plan.md
# [forge] Iteration 1: Task 1 → Verify → Done
# [forge] Iteration 2: Task 2 → Verify → Retry
# [forge] Iteration 3: Task 2 → Verify → Done
# ...
# [forge] All tasks completed.
# stdout:
{"outcome": "completed", "iterations": 7, "tasks_completed": 5, "tasks_total": 5, "total_cost": 2.15, "total_tokens": 78400}
| Exit Code | Meaning |
|---|---|
0 |
Forge loop completed successfully |
1 |
Compilation failure, missing agent, or runtime error |
Pipe the JSON output to jq for scripting: neam-forge --agent code_builder.neam | jq '.total_cost'
Docker Deployment #
Package the system as Docker containers: the PM runs as neam-api, the Builder
runs as neam-forge.
Dockerfile #
# Autonomous Code Builder -- Docker Image
# Stage 1: Build Neam binaries
FROM ubuntu:24.04 AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential cmake g++ libcurl4-openssl-dev libssl-dev ca-certificates \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY CMakeLists.txt ./
COPY deps/ deps/
COPY NeamC/ NeamC/
RUN cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)
# Stage 2: Runtime
FROM ubuntu:24.04
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates curl libcurl4 libssl3 git \
python3 python3-pip python3-venv \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /build/build/neamc /usr/local/bin/neamc
COPY --from=builder /build/build/neam /usr/local/bin/neam
COPY --from=builder /build/build/neam-api /usr/local/bin/neam-api
COPY --from=builder /build/build/neam-forge /usr/local/bin/neam-forge
RUN useradd --create-home --shell /bin/bash neam
USER neam
WORKDIR /home/neam
RUN python3 -m venv /home/neam/venv
ENV PATH="/home/neam/venv/bin:$PATH"
RUN pip install pytest pytest-cov pylint
RUN mkdir -p /home/neam/sessions /home/neam/workspace /home/neam/agents
COPY --chown=neam:neam agents/ /home/neam/agents/
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
ENTRYPOINT ["neam-api"]
CMD ["--program", "/home/neam/agents/code_builder.neam", "--port", "8080"]
Note: python3 and python3-venv are included because the Builder runs Python
tests. The virtual environment pre-installs pytest, pytest-cov, and pylint.
Docker Compose #
services:
pm:
build: .
ports:
- "8080:8080"
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- NEAM_API_KEY=${NEAM_API_KEY}
volumes:
- session-data:/home/neam/sessions
- workspace-data:/home/neam/workspace
command: ["--program", "/home/neam/agents/code_builder.neam",
"--port", "8080", "--workers", "2"]
builder:
build: .
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
volumes:
- workspace-data:/home/neam/workspace
entrypoint: ["neam-forge"]
command: ["--agent", "/home/neam/agents/code_builder.neam",
"--name", "Builder"]
profiles:
- build
volumes:
session-data:
workspace-data:
CI/CD Integration #
# .github/workflows/auto-build.yml
name: Autonomous Build
on:
workflow_dispatch:
inputs:
feature:
description: "Feature to build"
required: true
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run autonomous builder
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
neam-forge --agent agents/code_builder.neam \
--workspace ./workspace --max-cost 5.0
- uses: actions/upload-artifact@v4
with:
name: build-output
path: workspace/
Forgetting to mount the workspace volume as shared between PM and Builder containers. The PM writes the plan, and the Builder reads it. Separate volumes mean the Builder will not find the plan file.
Monitoring the Build Loop #
Reading progress.jsonl #
fun monitor_build(workspace_path) {
let progress = workspace_read(workspace_path + "/progress.jsonl");
if (progress == nil) { emit "No build in progress."; return; }
let lines = split(progress, "\n");
let tasks_done = 0;
for (line in lines) {
if (line == "") { continue; }
let entry = json_parse(line);
tasks_done = tasks_done + 1;
emit f"Task {entry['task']}: {entry['description']} "
+ f"(iteration {entry['iteration']}, ${entry['cost']})";
}
emit f"\nSummary: {tasks_done} tasks completed.";
}
Cost Tracking Dashboard #
Alerting on Budget Exhaustion #
fun handle_build_outcome(outcome) {
match outcome {
BudgetExhausted => {
exec("curl -X POST https://alerts.example.com/webhook "
+ "-H 'Content-Type: application/json' "
+ "-d '{\"severity\":\"warning\",\"message\":\"Budget exhausted\"}'", 10);
},
Aborted(reason) => {
exec("curl -X POST https://alerts.example.com/webhook "
+ "-H 'Content-Type: application/json' "
+ "-d '{\"severity\":\"error\",\"message\":\"Build aborted\"}'", 10);
},
_ => {}
}
}
Lessons Learned #
1. Verify design is the most important decision. A verify function that is too strict causes excessive retries and budget exhaustion. One that is too lenient lets broken code through. The sweet spot is correctness (tests pass) plus a reasonable quality bar (80% coverage, no critical lint errors).
2. Git checkpoints pay for themselves immediately. The ability to git diff
between verified tasks is invaluable for debugging. The overhead is under 100ms per
checkpoint. Always use checkpoint: "git" for code generation workloads.
3. Cost management requires active monitoring. A forge agent can exhaust its
budget in under 10 minutes in a retry loop. Monitor progress.jsonl in real time.
Consider a per-task cost limit in your verify function.
4. Fresh context is a feature, not a limitation. Iteration 10 follows the system prompt as faithfully as iteration 1. The learnings file carries essential knowledge, and the filesystem carries the code. Accumulated context would introduce drift.
5. The PM integration pattern is reusable. The claw agent PM + forge agent Builder pattern generalizes to document generation, data migration, infrastructure setup, and any domain where you need a conversational interface for requirements and an iterative builder for execution.
6. Plan decomposition quality determines build success. Builds with 5-8 small, specific tasks succeed more often and cost less than builds with 2-3 large tasks. Invest time in the plan generation prompt.
7. Learnings accumulation prevents redundant retries. An insight discovered on task 1 prevents the same error on tasks 2 through 5. Over a 5-task build, this typically saves 2-3 iterations and $0.50-$1.00 in cost.
Exercises #
Exercise 1: Add a Linting-Only Verify Mode
Modify the verify_build function to support a lint_only mode that checks only
pylint results without running pytest. Add a configuration flag to the Builder that
selects between "full" and "lint_only" modes. Write tests for both.
Exercise 2: Multi-Language Support
Extend the Builder to support JavaScript/Node.js projects. Create a JSBuilder
forge agent with npm test and eslint in its verify function. The PM should
detect the target language from the feature description and spawn the right Builder.
Exercise 3: Plan Review Checkpoint
Add a human-in-the-loop step between plan generation and build execution. The PM should present the plan and wait for "approve", "revise", or "cancel" before spawning the Builder. Use session persistence across the approval flow.
Exercise 4: Cost Optimization Dashboard
Write a standalone Neam program that reads all progress.jsonl files from a
directory and generates a report showing: average cost per task, tasks requiring
the most retries, total spending over 7 days, and decomposition recommendations.
Exercise 5: Rollback and Retry
Implement a rollback_and_retry skill for the PM that rolls back to the last
successful git checkpoint when a build fails, modifies the failed task's description
based on the abort reason, and spawns a new Builder instance.
Exercise 6: Parallel Task Execution
Redesign the Builder for parallel task execution. Modify the plan format to include
dependency annotations. Implement a scheduler that identifies independent tasks and
runs multiple forge agents concurrently using dag_execute.