📖 20 min read

Chapter 30: Agent Evaluation with Neam-Gym #

How do you know your agent is good enough for production? Neam v1.0 evolves the Neam-Gym evaluation harness into a 7-mode evaluation framework with LLM-as-Judge, multi-dimensional rubrics, statistical reproducibility, and CI/CD quality gates -- all declared as compiled constructs.

Why Evaluation Matters #

AI agents are non-deterministic. The same agent can produce different quality results across runs. Without systematic evaluation, you are deploying blind. Neam-Gym makes evaluation a first-class part of the development lifecycle:

Before merging -- CI gates reject agent regressions
Before deploying -- Quality thresholds block bad versions
In production -- Continuous monitoring catches drift
When comparing -- Elo-style arena ranks agent versions

Seven Evaluation Modes #

Mode	Purpose	Key Metrics
`"unit"`	Test individual outputs with LLM judge + rubrics	Pass rate, average score
`"trajectory"`	Evaluate full execution path (tool calls, reasoning, state)	Step accuracy, tool usage
`"rag"`	RAGAS metrics for RAG pipelines	Faithfulness, context precision/recall
`"multi_agent"`	DIO orchestration with coordination metrics	Coordination score, delegation accuracy
`"security"`	OWASP ASI01-10 compliance + red teaming	Compliance rate, injection resistance
`"cost"`	Token cost tracking + Pareto frontier analysis	Cost per task, quality/cost ratio
`"arena"`	Pairwise Elo-style comparison between agent versions	Elo rating, win rate

The gym_evaluator Declaration #

Unit Mode with LLM-as-Judge #

neam

gym_evaluator ChurnEval {
    mode: "unit",
    agent: "./build/churn_agent.neamb",
    dataset: "./eval/churn_test.jsonl",
    graders: {
        primary: "llm_judge",
        fallback: "contains",
        judge: {
            provider: "openai",
            model: "gpt-4o-mini",
            rubric: {
                correctness: { weight: 0.4, scale: "1-5" },
                completeness: { weight: 0.3, scale: "1-5" },
                safety: { weight: 0.2, scale: "1-5" },
                coherence: { weight: 0.1, scale: "1-5" }
            }
        }
    },
    thresholds: {
        pass_rate: 0.85,
        avg_score: 3.5,
        exit_on_fail: true
    },
    reproducibility: {
        repetitions: 5,
        seed: 42
    },
    output: "./eval/report.json"
}

RAG Mode with RAGAS Metrics #

neam

gym_evaluator RAGQuality {
    mode: "rag",
    agent: "./build/rag_agent.neamb",
    dataset: "./eval/rag_questions.jsonl",
    graders: {
        primary: "ragas",
        metrics: ["faithfulness", "answer_relevance", "context_precision", "context_recall"]
    },
    thresholds: {
        faithfulness: 0.8,
        answer_relevance: 0.75,
        context_precision: 0.7
    }
}

Security Mode #

neam

gym_evaluator SecurityAudit {
    mode: "security",
    agent: "./build/prod_agent.neamb",
    dataset: "./eval/red_team_prompts.jsonl",
    graders: {
        primary: "owasp_compliance",
        risks: ["ASI01", "ASI02", "ASI03", "ASI04", "ASI05",
                "ASI06", "ASI07", "ASI08", "ASI09", "ASI10"]
    },
    thresholds: {
        compliance_rate: 1.0,
        injection_resistance: 0.95
    }
}

Arena Mode #

neam

gym_evaluator ModelCompare {
    mode: "arena",
    agents: [
        { name: "v1_gpt4", bytecode: "./build/agent_gpt4.neamb" },
        { name: "v1_claude", bytecode: "./build/agent_claude.neamb" },
        { name: "v1_local", bytecode: "./build/agent_ollama.neamb" }
    ],
    dataset: "./eval/arena_tasks.jsonl",
    graders: {
        primary: "llm_judge",
        judge: { provider: "openai", model: "gpt-4o" }
    },
    arena: {
        rounds: 100,
        initial_elo: 1200,
        k_factor: 32
    }
}

Evaluation Methods #

LLM-as-Judge #

Uses a separate LLM (the "judge") to evaluate agent outputs against a multi-dimensional rubric. Each dimension has a weight and scale. The judge returns structured scores that are aggregated into a final quality metric.

Agent-as-Judge #

Uses another Neam agent as the evaluator -- useful when the evaluation itself requires tool calling or RAG access.

Heuristic Graders #

Programmatic scoring functions for deterministic checks: contains, regex_match, json_valid, exact_match.

RAGAS Metrics #

Specialized metrics for RAG evaluation:

Metric	Measures
Faithfulness	Is the answer grounded in the retrieved context?
Answer Relevance	Does the answer address the question?
Context Precision	How precise are the retrieved chunks?
Context Recall	Were all relevant chunks retrieved?

Trace Collection and Observability #

Neam-Gym collects native traces during evaluation that can be exported via OTLP to external observability platforms:

Langfuse -- LLM observability platform
Phoenix (Arize) -- ML observability
Grafana / Jaeger -- Distributed tracing

Traces include span-level detail: LLM calls, tool invocations, RAG retrievals, budget consumption, and guard evaluations.

CI/CD Integration #

Neam-Gym outputs JUnit XML and GitHub Actions annotations for CI/CD integration:

bash

# Run evaluation in CI pipeline
neam-gym run --config eval_config.neamb --format junit --output results.xml

# Quality gate -- fail the build if thresholds are not met
neam-gym gate --config eval_config.neamb --strict

Configure regression detection thresholds:

Regression	Threshold
Quality score decrease	> 5%
Latency increase	> 10%
Cost increase	> 5%
Throughput decrease	> 10%

Statistical Reproducibility #

Agent outputs are non-deterministic. Neam-Gym addresses this with:

Multiple repetitions with configurable seed
Confidence intervals (95% CI by default)
Effect size calculation for comparisons
Statistical significance testing before declaring regressions

💡 Tip

For comprehensive architecture details on how Neam-Gym integrates with the 14 specialist agents and the DIO orchestrator, see The Intelligent Data Organization with Neam.