Programming Neam
📖 20 min read

Chapter 30: Agent Evaluation with Neam-Gym #

How do you know your agent is good enough for production? Neam v1.0 evolves the Neam-Gym evaluation harness into a 7-mode evaluation framework with LLM-as-Judge, multi-dimensional rubrics, statistical reproducibility, and CI/CD quality gates -- all declared as compiled constructs.


Why Evaluation Matters #

AI agents are non-deterministic. The same agent can produce different quality results across runs. Without systematic evaluation, you are deploying blind. Neam-Gym makes evaluation a first-class part of the development lifecycle:


Seven Evaluation Modes #

ModePurposeKey Metrics
"unit"Test individual outputs with LLM judge + rubricsPass rate, average score
"trajectory"Evaluate full execution path (tool calls, reasoning, state)Step accuracy, tool usage
"rag"RAGAS metrics for RAG pipelinesFaithfulness, context precision/recall
"multi_agent"DIO orchestration with coordination metricsCoordination score, delegation accuracy
"security"OWASP ASI01-10 compliance + red teamingCompliance rate, injection resistance
"cost"Token cost tracking + Pareto frontier analysisCost per task, quality/cost ratio
"arena"Pairwise Elo-style comparison between agent versionsElo rating, win rate

The gym_evaluator Declaration #

Unit Mode with LLM-as-Judge #

neam
gym_evaluator ChurnEval {
    mode: "unit",
    agent: "./build/churn_agent.neamb",
    dataset: "./eval/churn_test.jsonl",
    graders: {
        primary: "llm_judge",
        fallback: "contains",
        judge: {
            provider: "openai",
            model: "gpt-4o-mini",
            rubric: {
                correctness: { weight: 0.4, scale: "1-5" },
                completeness: { weight: 0.3, scale: "1-5" },
                safety: { weight: 0.2, scale: "1-5" },
                coherence: { weight: 0.1, scale: "1-5" }
            }
        }
    },
    thresholds: {
        pass_rate: 0.85,
        avg_score: 3.5,
        exit_on_fail: true
    },
    reproducibility: {
        repetitions: 5,
        seed: 42
    },
    output: "./eval/report.json"
}

RAG Mode with RAGAS Metrics #

neam
gym_evaluator RAGQuality {
    mode: "rag",
    agent: "./build/rag_agent.neamb",
    dataset: "./eval/rag_questions.jsonl",
    graders: {
        primary: "ragas",
        metrics: ["faithfulness", "answer_relevance", "context_precision", "context_recall"]
    },
    thresholds: {
        faithfulness: 0.8,
        answer_relevance: 0.75,
        context_precision: 0.7
    }
}

Security Mode #

neam
gym_evaluator SecurityAudit {
    mode: "security",
    agent: "./build/prod_agent.neamb",
    dataset: "./eval/red_team_prompts.jsonl",
    graders: {
        primary: "owasp_compliance",
        risks: ["ASI01", "ASI02", "ASI03", "ASI04", "ASI05",
                "ASI06", "ASI07", "ASI08", "ASI09", "ASI10"]
    },
    thresholds: {
        compliance_rate: 1.0,
        injection_resistance: 0.95
    }
}

Arena Mode #

neam
gym_evaluator ModelCompare {
    mode: "arena",
    agents: [
        { name: "v1_gpt4", bytecode: "./build/agent_gpt4.neamb" },
        { name: "v1_claude", bytecode: "./build/agent_claude.neamb" },
        { name: "v1_local", bytecode: "./build/agent_ollama.neamb" }
    ],
    dataset: "./eval/arena_tasks.jsonl",
    graders: {
        primary: "llm_judge",
        judge: { provider: "openai", model: "gpt-4o" }
    },
    arena: {
        rounds: 100,
        initial_elo: 1200,
        k_factor: 32
    }
}

Evaluation Methods #

LLM-as-Judge #

Uses a separate LLM (the "judge") to evaluate agent outputs against a multi-dimensional rubric. Each dimension has a weight and scale. The judge returns structured scores that are aggregated into a final quality metric.

Agent-as-Judge #

Uses another Neam agent as the evaluator -- useful when the evaluation itself requires tool calling or RAG access.

Heuristic Graders #

Programmatic scoring functions for deterministic checks: contains, regex_match, json_valid, exact_match.

RAGAS Metrics #

Specialized metrics for RAG evaluation:

MetricMeasures
FaithfulnessIs the answer grounded in the retrieved context?
Answer RelevanceDoes the answer address the question?
Context PrecisionHow precise are the retrieved chunks?
Context RecallWere all relevant chunks retrieved?

Trace Collection and Observability #

Neam-Gym collects native traces during evaluation that can be exported via OTLP to external observability platforms:

Traces include span-level detail: LLM calls, tool invocations, RAG retrievals, budget consumption, and guard evaluations.


CI/CD Integration #

Neam-Gym outputs JUnit XML and GitHub Actions annotations for CI/CD integration:

bash
# Run evaluation in CI pipeline
neam-gym run --config eval_config.neamb --format junit --output results.xml

# Quality gate -- fail the build if thresholds are not met
neam-gym gate --config eval_config.neamb --strict

Configure regression detection thresholds:

RegressionThreshold
Quality score decrease> 5%
Latency increase> 10%
Cost increase> 5%
Throughput decrease> 10%

Statistical Reproducibility #

Agent outputs are non-deterministic. Neam-Gym addresses this with:

💡 Tip

For comprehensive architecture details on how Neam-Gym integrates with the 14 specialist agents and the DIO orchestrator, see The Intelligent Data Organization with Neam.

Start typing to search...