Chapter 30: Agent Evaluation with Neam-Gym #
How do you know your agent is good enough for production? Neam v1.0 evolves the Neam-Gym evaluation harness into a 7-mode evaluation framework with LLM-as-Judge, multi-dimensional rubrics, statistical reproducibility, and CI/CD quality gates -- all declared as compiled constructs.
Why Evaluation Matters #
AI agents are non-deterministic. The same agent can produce different quality results across runs. Without systematic evaluation, you are deploying blind. Neam-Gym makes evaluation a first-class part of the development lifecycle:
- Before merging -- CI gates reject agent regressions
- Before deploying -- Quality thresholds block bad versions
- In production -- Continuous monitoring catches drift
- When comparing -- Elo-style arena ranks agent versions
Seven Evaluation Modes #
| Mode | Purpose | Key Metrics |
|---|---|---|
"unit" | Test individual outputs with LLM judge + rubrics | Pass rate, average score |
"trajectory" | Evaluate full execution path (tool calls, reasoning, state) | Step accuracy, tool usage |
"rag" | RAGAS metrics for RAG pipelines | Faithfulness, context precision/recall |
"multi_agent" | DIO orchestration with coordination metrics | Coordination score, delegation accuracy |
"security" | OWASP ASI01-10 compliance + red teaming | Compliance rate, injection resistance |
"cost" | Token cost tracking + Pareto frontier analysis | Cost per task, quality/cost ratio |
"arena" | Pairwise Elo-style comparison between agent versions | Elo rating, win rate |
The gym_evaluator Declaration #
Unit Mode with LLM-as-Judge #
gym_evaluator ChurnEval {
mode: "unit",
agent: "./build/churn_agent.neamb",
dataset: "./eval/churn_test.jsonl",
graders: {
primary: "llm_judge",
fallback: "contains",
judge: {
provider: "openai",
model: "gpt-4o-mini",
rubric: {
correctness: { weight: 0.4, scale: "1-5" },
completeness: { weight: 0.3, scale: "1-5" },
safety: { weight: 0.2, scale: "1-5" },
coherence: { weight: 0.1, scale: "1-5" }
}
}
},
thresholds: {
pass_rate: 0.85,
avg_score: 3.5,
exit_on_fail: true
},
reproducibility: {
repetitions: 5,
seed: 42
},
output: "./eval/report.json"
}
RAG Mode with RAGAS Metrics #
gym_evaluator RAGQuality {
mode: "rag",
agent: "./build/rag_agent.neamb",
dataset: "./eval/rag_questions.jsonl",
graders: {
primary: "ragas",
metrics: ["faithfulness", "answer_relevance", "context_precision", "context_recall"]
},
thresholds: {
faithfulness: 0.8,
answer_relevance: 0.75,
context_precision: 0.7
}
}
Security Mode #
gym_evaluator SecurityAudit {
mode: "security",
agent: "./build/prod_agent.neamb",
dataset: "./eval/red_team_prompts.jsonl",
graders: {
primary: "owasp_compliance",
risks: ["ASI01", "ASI02", "ASI03", "ASI04", "ASI05",
"ASI06", "ASI07", "ASI08", "ASI09", "ASI10"]
},
thresholds: {
compliance_rate: 1.0,
injection_resistance: 0.95
}
}
Arena Mode #
gym_evaluator ModelCompare {
mode: "arena",
agents: [
{ name: "v1_gpt4", bytecode: "./build/agent_gpt4.neamb" },
{ name: "v1_claude", bytecode: "./build/agent_claude.neamb" },
{ name: "v1_local", bytecode: "./build/agent_ollama.neamb" }
],
dataset: "./eval/arena_tasks.jsonl",
graders: {
primary: "llm_judge",
judge: { provider: "openai", model: "gpt-4o" }
},
arena: {
rounds: 100,
initial_elo: 1200,
k_factor: 32
}
}
Evaluation Methods #
LLM-as-Judge #
Uses a separate LLM (the "judge") to evaluate agent outputs against a multi-dimensional rubric. Each dimension has a weight and scale. The judge returns structured scores that are aggregated into a final quality metric.
Agent-as-Judge #
Uses another Neam agent as the evaluator -- useful when the evaluation itself requires tool calling or RAG access.
Heuristic Graders #
Programmatic scoring functions for deterministic checks: contains, regex_match, json_valid, exact_match.
RAGAS Metrics #
Specialized metrics for RAG evaluation:
| Metric | Measures |
|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? |
| Answer Relevance | Does the answer address the question? |
| Context Precision | How precise are the retrieved chunks? |
| Context Recall | Were all relevant chunks retrieved? |
Trace Collection and Observability #
Neam-Gym collects native traces during evaluation that can be exported via OTLP to external observability platforms:
- Langfuse -- LLM observability platform
- Phoenix (Arize) -- ML observability
- Grafana / Jaeger -- Distributed tracing
Traces include span-level detail: LLM calls, tool invocations, RAG retrievals, budget consumption, and guard evaluations.
CI/CD Integration #
Neam-Gym outputs JUnit XML and GitHub Actions annotations for CI/CD integration:
# Run evaluation in CI pipeline
neam-gym run --config eval_config.neamb --format junit --output results.xml
# Quality gate -- fail the build if thresholds are not met
neam-gym gate --config eval_config.neamb --strict
Configure regression detection thresholds:
| Regression | Threshold |
|---|---|
| Quality score decrease | > 5% |
| Latency increase | > 10% |
| Cost increase | > 5% |
| Throughput decrease | > 10% |
Statistical Reproducibility #
Agent outputs are non-deterministic. Neam-Gym addresses this with:
- Multiple repetitions with configurable seed
- Confidence intervals (95% CI by default)
- Effect size calculation for comparisons
- Statistical significance testing before declaring regressions
For comprehensive architecture details on how Neam-Gym integrates with the 14 specialist agents and the DIO orchestrator, see The Intelligent Data Organization with Neam.