📖 7 min read

Chapter 24: Case Study -- Research Assistant #

Academic research involves a repetitive cycle: search for papers, read abstracts, identify relevant work, extract key findings, synthesize themes, and format citations. Each step is well-suited to AI assistance, but no single agent can handle the entire pipeline well. In this chapter, we build a multi-agent research assistant that combines RAG-enhanced literature search, chain-of-thought analysis, reflective summarization, citation formatting, learning from user feedback, and a voice interface for hands-free operation.

By the end of this chapter, you will have a system that accepts a research question, searches an indexed corpus of academic papers, analyzes relevance, produces a structured summary with citations, and improves over time as you provide feedback.

24.1 Requirements #

A useful research assistant must handle five tasks:

Literature search. Given a research question, retrieve relevant papers from an indexed corpus. The retrieval must go beyond keyword matching -- it should understand semantic similarity and handle paraphrased queries.
Analysis and relevance assessment. For each retrieved paper, the system must assess its relevance to the research question, extract key findings, and identify methodological strengths and weaknesses.
Summarization. The system must synthesize findings from multiple papers into a coherent summary that identifies themes, contradictions, and research gaps.
Citation management. All claims must be properly cited. The system must format citations in multiple styles (APA, IEEE, Chicago) and generate a bibliography.
Continuous improvement. The system must learn from user feedback to improve its relevance assessments and summarization quality over time.

24.2 Architecture #

The research assistant uses a pipeline architecture where each stage is handled by a dedicated agent. The output of one agent feeds into the next. A RAG pipeline connects the search agent to the knowledge base, and cognitive features (reasoning, reflection, learning) improve quality at each stage.

Research

Agent

RAG search

knowledge

base

▶

Analysis

Agent

reasoning:

chain_of_

thought

▶

Summarizer

Agent

reasoning:

plan_and_

execute

▶

Cite

Format

Tool

APA

IEEE

Learning: experience_replay

Evolve: prompt improvement

User

Question

▶

Query

Rewrite

▶

Embedding

Model

▶

Vector

▼

Formatted

Output

◀

Summary

+ Cite

◀

Relevance

Scoring

◀

Top-K

Papers

24.3 Step 1: Knowledge Base with Academic Papers #

The knowledge base indexes academic papers. We configure it with a larger chunk size (512 tokens) because academic papers contain long, interconnected arguments that lose meaning when split too finely. The corrective_rag strategy adds a self-correction loop: if initial retrieval results are scored as low-relevance, the system reformulates the query and searches again.

neam

// ================================================================
// STEP 1: Knowledge Base
// ================================================================

// Academic paper corpus. Each paper is a markdown or PDF file
// in the ./papers/ directory. The RAG engine automatically
// handles parsing, chunking, embedding, and indexing.

knowledge AcademicPapers {
  vector_store: "usearch"
  embedding_model: "text-embedding-3-small"
  chunk_size: 512
  chunk_overlap: 128
  retrieval_strategy: "corrective_rag"

  sources: [
    { type: "directory", path: "./papers/", pattern: "*.md" },
    { type: "directory", path: "./papers/", pattern: "*.pdf" }
  ]
}

// A secondary knowledge base for methodology references.
// This contains textbook-style content on research methods,
// statistical techniques, and experimental design.

knowledge MethodologyRef {
  vector_store: "usearch"
  embedding_model: "text-embedding-3-small"
  chunk_size: 256
  chunk_overlap: 64
  retrieval_strategy: "hybrid"

  sources: [
    { type: "file", path: "./references/research_methods.md" },
    { type: "file", path: "./references/statistics_guide.md" }
  ]
}

📝 Why corrective RAG?

Standard RAG retrieves the top-K chunks and passes them to the LLM. Corrective RAG adds a verification step: after retrieval, a lightweight LLM call scores each chunk for relevance. If the average relevance score falls below a threshold, the system generates an improved query and retrieves again. This is particularly valuable for academic search, where initial queries are often too broad or use different terminology than the indexed papers.

24.4 Step 2: Research Agent with Reasoning #

The research agent is the entry point. It takes the user's research question, searches the knowledge base, and returns a set of relevant paper excerpts with relevance scores. It uses chain_of_thought reasoning to break down complex research questions before searching.

neam

// ================================================================
// STEP 2: Research Agent
// ================================================================

agent ResearchAgent {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.3
  system: "You are an academic research agent. Your job is to:

1. Analyze the user's research question
2. Identify key concepts and search terms
3. Search the knowledge base for relevant papers
4. Return structured results with relevance assessments

For each paper found, provide:
- Title and authors
- Key findings relevant to the question
- Relevance score (1-10)
- Methodology used

If the initial search results are poor, reformulate the query and search again.
Think step by step about what the user is really asking before searching."

  reasoning: chain_of_thought
  connected_knowledge: [AcademicPapers]
  memory: "research_memory"

  reflect: {
    after: each_response
    evaluate: [relevance, completeness, specificity]
    min_confidence: 0.7
    on_low_quality: {
      strategy: "revise"
      max_revisions: 2
    }
  }
}

What happens when you call ResearchAgent.ask(question):

The VM prepends chain-of-thought instructions to the system prompt.
The agent breaks the question into key concepts (e.g., "RLHF" -> "reinforcement learning from human feedback," "preference optimization," "reward modeling").
The RAG engine embeds the query and retrieves the top-K chunks from AcademicPapers.
Corrective RAG scores each chunk for relevance. If scores are low, it reformulates the query (e.g., adds "direct preference optimization" as a synonym) and re-retrieves.
The agent synthesizes the retrieved chunks into a structured response.
Reflection evaluates the response on relevance, completeness, and specificity.
If the average score is below 0.7, the agent revises (up to 2 times).

24.5 Step 3: Summarizer Agent with Reflection #

The summarizer agent takes the research agent's output and produces a coherent narrative summary. It uses plan_and_execute reasoning to first create an outline, then fill in each section, and finally synthesize the complete summary.

neam

// ================================================================
// STEP 3: Summarizer Agent
// ================================================================

agent SummarizerAgent {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.4
  system: "You are an academic summarizer. Given a set of paper excerpts and
relevance assessments, you must:

1. Identify major themes across the papers
2. Note areas of agreement and contradiction
3. Highlight research gaps and open questions
4. Synthesize findings into a coherent narrative
5. Maintain academic tone and precision

Structure your summary with clear sections:
- Overview (1-2 paragraphs)
- Key Themes (2-4 themes, each with supporting evidence)
- Contradictions and Debates
- Research Gaps
- Conclusion

Every factual claim must reference the source paper by title or author.
Do not make claims that are not supported by the retrieved papers."

  reasoning: plan_and_execute

  reflect: {
    after: each_response
    evaluate: [coherence, accuracy, coverage, academic_tone]
    min_confidence: 0.8
    on_low_quality: {
      strategy: "revise"
      max_revisions: 2
    }
  }

  learning: {
    strategy: "experience_replay"
    review_interval: 10
    max_adaptations: 50
    rollback_on_decline: true
  }

  memory: "summarizer_memory"
}

📝 Why plan_and_execute?

Academic summarization is a multi-step task. The plan_and_execute reasoning strategy generates a plan (the outline), executes each step as a separate LLM call (filling in each section), and then synthesizes the results. This produces more structured and thorough summaries than a single LLM call, because each section receives the model's full attention.

24.6 Step 4: Citation Formatter Tool #

The citation formatter is a tool rather than an agent because it performs a deterministic transformation: given paper metadata, it produces a formatted citation string. No LLM call is needed.

neam

// ================================================================
// STEP 4: Citation Formatter Tool
// ================================================================

// Tool: Format a citation in a specified style.
// Supports APA, IEEE, and Chicago formats.

tool format_citation {
  description: "Format a paper citation in the specified style (APA, IEEE, or Chicago)."

  parameters: {
    "title":   { "type": "string", "description": "Paper title" },
    "authors": { "type": "string", "description": "Comma-separated author list" },
    "year":    { "type": "string", "description": "Publication year" },
    "journal": { "type": "string", "description": "Journal or conference name" },
    "volume":  { "type": "string", "description": "Volume number (optional)" },
    "pages":   { "type": "string", "description": "Page range (optional)" },
    "doi":     { "type": "string", "description": "DOI (optional)" },
    "style":   { "type": "string", "description": "Citation style: APA, IEEE, or Chicago" }
  }

  execute: fun(args) {
    let title = args["title"];
    let authors = args["authors"];
    let year = args["year"];
    let journal = args["journal"];
    let volume = args["volume"];
    let pages = args["pages"];
    let doi = args["doi"];
    let style = args["style"];

    if (style == "APA") {
      // APA 7th Edition: Author, A. A., & Author, B. B. (Year). Title. Journal, Volume, Pages. DOI
      let citation = authors + " (" + year + "). " + title + ". ";
      citation = citation + journal;
      if (volume != "" && volume != nil) {
        citation = citation + ", " + volume;
      }
      if (pages != "" && pages != nil) {
        citation = citation + ", " + pages;
      }
      citation = citation + ".";
      if (doi != "" && doi != nil) {
        citation = citation + " https://doi.org/" + doi;
      }
      return citation;
    }

    if (style == "IEEE") {
      // IEEE: A. Author and B. Author, "Title," Journal, vol. V, pp. P-P, Year.
      let citation = authors + ", \"" + title + ",\" " + journal;
      if (volume != "" && volume != nil) {
        citation = citation + ", vol. " + volume;
      }
      if (pages != "" && pages != nil) {
        citation = citation + ", pp. " + pages;
      }
      citation = citation + ", " + year + ".";
      return citation;
    }

    if (style == "Chicago") {
      // Chicago: Author. "Title." Journal Volume (Year): Pages. DOI.
      let citation = authors + ". \"" + title + ".\" " + journal;
      if (volume != "" && volume != nil) {
        citation = citation + " " + volume;
      }
      citation = citation + " (" + year + ")";
      if (pages != "" && pages != nil) {
        citation = citation + ": " + pages;
      }
      citation = citation + ".";
      if (doi != "" && doi != nil) {
        citation = citation + " https://doi.org/" + doi;
      }
      return citation;
    }

    return "Unsupported citation style: " + style;
  }
}

// Tool: Generate a complete bibliography from a list of papers.

tool generate_bibliography {
  description: "Generate a formatted bibliography from a list of paper metadata."

  parameters: {
    "papers": {
      "type": "string",
      "description": "JSON array of paper objects with title, authors, year, journal, volume, pages, doi"
    },
    "style": {
      "type": "string",
      "description": "Citation style: APA, IEEE, or Chicago"
    }
  }

  execute: fun(args) {
    let papers = json_parse(args["papers"]);
    let style = args["style"];
    let bibliography = "";

    for (index, paper) in enumerate(papers) {
      let citation = format_citation.call({
        "title": paper["title"],
        "authors": paper["authors"],
        "year": paper["year"],
        "journal": paper["journal"],
        "volume": paper["volume"],
        "pages": paper["pages"],
        "doi": paper["doi"],
        "style": style
      });

      if (style == "IEEE") {
        bibliography = bibliography + "[" + str(index + 1) + "] " + citation + "\n";
      } else {
        bibliography = bibliography + citation + "\n\n";
      }
    }

    return bibliography;
  }
}

24.7 Step 5: Multi-Agent Pipeline #

Now we connect the agents into a pipeline. The orchestration function takes a research question, passes it through the search agent, analysis agent (the summarizer), and citation formatter in sequence. Each stage's output feeds into the next.

neam

// ================================================================
// STEP 5: Multi-Agent Pipeline
// ================================================================

// Analysis agent: bridges search and summarization.
// Takes raw search results and produces a structured analysis
// that the summarizer can work with.

agent AnalysisAgent {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.3
  system: "You are a research analysis agent. Given search results from an
academic corpus, you must:

1. Score each result for relevance (1-10)
2. Extract the key methodology from each paper
3. Identify the main findings and their strength of evidence
4. Note any limitations or potential biases
5. Group papers by theme or approach

Output a structured analysis in the following format:

## Paper Analysis
For each paper:
- Title:
- Relevance: X/10
- Methodology:
- Key Findings:
- Limitations:
- Theme Group:

## Cross-Paper Themes
List the major themes that emerge across papers."

  reasoning: chain_of_thought
  connected_knowledge: [AcademicPapers, MethodologyRef]
  memory: "analysis_memory"
}

// The pipeline function orchestrates the entire research flow.

fun research_pipeline(question, citation_style) {
  emit "=== Research Pipeline Started ===";
  emit "Question: " + question;
  emit "";

  // Stage 1: Search for relevant papers
  emit "[Stage 1] Searching academic corpus...";
  let search_results = ResearchAgent.ask(
    "Find papers relevant to the following research question: " + question
  );
  emit "[Stage 1] Search complete. Results:";
  emit search_results;
  emit "";

  // Stage 2: Analyze the results
  emit "[Stage 2] Analyzing retrieved papers...";
  let analysis = AnalysisAgent.ask(
    "Analyze these search results for the research question: '"
    + question + "'\n\nSearch Results:\n" + search_results
  );
  emit "[Stage 2] Analysis complete.";
  emit analysis;
  emit "";

  // Stage 3: Summarize into a coherent narrative
  emit "[Stage 3] Generating summary...";
  let summary = SummarizerAgent.ask(
    "Summarize the following analysis into a coherent research summary. "
    + "The original question was: '" + question + "'\n\nAnalysis:\n" + analysis
  );
  emit "[Stage 3] Summary complete.";
  emit summary;
  emit "";

  // Stage 4: Format citations
  emit "[Stage 4] Formatting citations (" + citation_style + ")...";
  // The summarizer's response includes paper references.
  // We ask the research agent to extract citation metadata.
  let citation_data = ResearchAgent.ask(
    "Extract citation metadata (title, authors, year, journal, volume, pages, doi) "
    + "as a JSON array for all papers referenced in this summary:\n\n" + summary
  );
  emit "[Stage 4] Citations formatted.";
  emit "";

  return {
    "question": question,
    "summary": summary,
    "analysis": analysis,
    "search_results": search_results,
    "citation_data": citation_data
  };
}

24.8 Step 6: Learning from User Feedback #

The summarizer agent has learning enabled, but we also want an explicit feedback loop. After the user reviews the summary, they can rate it, and the system records this feedback for future learning reviews.

neam

// ================================================================
// STEP 6: Learning from Feedback
// ================================================================

// Configure the summarizer's evolution to refine its prompt over time.

// (This is already configured in Step 3 with learning.strategy = "experience_replay".)
// Here we add evolution for long-term prompt improvement.

agent SummarizerAgentV2 {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.4
  system: "You are an academic summarizer. You synthesize research findings into
coherent, well-structured summaries with proper academic tone."

  reasoning: plan_and_execute

  reflect: {
    after: each_response
    evaluate: [coherence, accuracy, coverage, academic_tone]
    min_confidence: 0.8
    on_low_quality: { strategy: "revise", max_revisions: 2 }
  }

  learning: {
    strategy: "experience_replay"
    review_interval: 10
    max_adaptations: 50
    rollback_on_decline: true
  }

  evolve: {
    mutable: [system_prompt, temperature]
    review_after: 50
    core_identity: "You are an academic summarizer."
    allow_rollback: true
  }

  memory: "summarizer_v2_memory"
}

// Feedback collection function.
// Call this after presenting the summary to the user.

fun collect_feedback(agent_name, rating, comments) {
  // Record the numeric rating (0.0 to 1.0)
  agent_rate(agent_name, rating);
  emit "Feedback recorded: " + str(rating) + "/1.0";

  // Check learning progress
  let stats = agent_learning_stats(agent_name);
  emit "Total interactions: " + str(stats["total_interactions"]);
  emit "Avg reflection score: " + str(stats["avg_reflection_score"]);
  emit "Reviews completed: " + str(stats["reviews_completed"]);

  // If we have enough data, check evolution status
  let status = agent_status(agent_name);
  emit "Evolution version: " + str(status["evolution_version"]);

  return stats;
}

// Example usage of the feedback loop:
fun demo_feedback_loop() {
  // Run the pipeline
  let result = research_pipeline(
    "What are the latest advances in direct preference optimization for LLMs?",
    "APA"
  );

  emit "";
  emit "=== User Review ===";
  emit "Summary quality: 4.5/5";
  emit "";

  // User rates the summary 0.9 out of 1.0
  collect_feedback("SummarizerAgentV2", 0.9, "Good coverage but missed DPO variants");

  emit "";
  emit "=== After 50 interactions, the summarizer's prompt will evolve ===";
  emit "=== to incorporate lessons from feedback patterns.              ===";
}

24.9 Step 7: Voice Interface for Hands-Free Research #

Researchers often want to query the system while reading papers or working at a whiteboard. A voice interface lets them ask questions hands-free and listen to summaries without switching to a keyboard.

neam

// ================================================================
// STEP 7: Voice Interface
// ================================================================

// Voice pipeline for hands-free research.
// STT: OpenAI Whisper (accurate for academic terminology)
// TTS: OpenAI (clear, natural speech)

voice ResearchVoice {
  agent: ResearchAgent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "nova"
}

// Voice-enabled research function.
// Records audio, transcribes, runs the pipeline, and speaks the summary.

fun voice_research(audio_input_path, audio_output_path) {
  // Step 1: Transcribe the spoken question
  emit "[Voice] Transcribing audio...";
  let question = voice_transcribe("ResearchVoice", audio_input_path);
  emit "[Voice] Transcription: " + question;

  // Step 2: Run the research pipeline
  let result = research_pipeline(question, "APA");

  // Step 3: Synthesize the summary as speech
  emit "[Voice] Synthesizing audio response...";
  voice_synthesize("ResearchVoice", result["summary"], audio_output_path);
  emit "[Voice] Audio saved to: " + audio_output_path;

  return result;
}

// Full pipeline run using the voice interface.
// In production, this would be triggered by a microphone input event.

fun voice_pipeline_demo() {
  emit "=== Voice Research Demo ===";
  emit "";

  // Run the full voice pipeline
  let result = voice_pipeline_run("ResearchVoice", "/tmp/research_question.wav");
  emit "Pipeline result: " + result;

  // Or run it step by step for more control:
  let result2 = voice_research(
    "/tmp/research_question.wav",
    "/tmp/research_answer.mp3"
  );
  emit "Summary length: " + str(len(result2["summary"])) + " characters";
}

24.10 Full Code Listing #

neam

// ================================================================
// Academic Research Assistant
// Version: 1.0.0
//
// Multi-agent research pipeline with RAG, reasoning,
// reflection, learning, evolution, and voice interface.
// ================================================================

// ── Knowledge Bases ────────────────────────────────────────────

knowledge AcademicPapers {
  vector_store: "usearch"
  embedding_model: "text-embedding-3-small"
  chunk_size: 512
  chunk_overlap: 128
  retrieval_strategy: "corrective_rag"
  sources: [
    { type: "directory", path: "./papers/", pattern: "*.md" },
    { type: "directory", path: "./papers/", pattern: "*.pdf" }
  ]
}

knowledge MethodologyRef {
  vector_store: "usearch"
  embedding_model: "text-embedding-3-small"
  chunk_size: 256
  chunk_overlap: 64
  retrieval_strategy: "hybrid"
  sources: [
    { type: "file", path: "./references/research_methods.md" },
    { type: "file", path: "./references/statistics_guide.md" }
  ]
}

// ── Tools ──────────────────────────────────────────────────────

tool format_citation {
  description: "Format a paper citation in APA, IEEE, or Chicago style."
  parameters: {
    "title":   { "type": "string", "description": "Paper title" },
    "authors": { "type": "string", "description": "Author list" },
    "year":    { "type": "string", "description": "Publication year" },
    "journal": { "type": "string", "description": "Journal name" },
    "volume":  { "type": "string", "description": "Volume number" },
    "pages":   { "type": "string", "description": "Page range" },
    "doi":     { "type": "string", "description": "DOI" },
    "style":   { "type": "string", "description": "APA, IEEE, or Chicago" }
  }
  execute: fun(args) {
    let s = args["style"];
    let t = args["title"];
    let a = args["authors"];
    let y = args["year"];
    let j = args["journal"];
    if (s == "APA") {
      return a + " (" + y + "). " + t + ". " + j + ".";
    }
    if (s == "IEEE") {
      return a + ", \"" + t + ",\" " + j + ", " + y + ".";
    }
    return a + ". \"" + t + ".\" " + j + " (" + y + ").";
  }
}

// ── Agents ─────────────────────────────────────────────────────

agent ResearchAgent {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.3
  system: "You are an academic research agent. Search the knowledge base,
assess relevance, and return structured results with key findings."
  reasoning: chain_of_thought
  connected_knowledge: [AcademicPapers]
  memory: "research_memory"
  reflect: {
    after: each_response
    evaluate: [relevance, completeness, specificity]
    min_confidence: 0.7
    on_low_quality: { strategy: "revise", max_revisions: 2 }
  }
}

agent AnalysisAgent {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.3
  system: "You are a research analysis agent. Score papers for relevance,
extract methodologies, identify findings, and group by theme."
  reasoning: chain_of_thought
  connected_knowledge: [AcademicPapers, MethodologyRef]
  memory: "analysis_memory"
}

agent SummarizerAgent {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.4
  system: "You are an academic summarizer. Synthesize research findings into
coherent summaries with themes, contradictions, and research gaps."
  reasoning: plan_and_execute
  reflect: {
    after: each_response
    evaluate: [coherence, accuracy, coverage, academic_tone]
    min_confidence: 0.8
    on_low_quality: { strategy: "revise", max_revisions: 2 }
  }
  learning: {
    strategy: "experience_replay"
    review_interval: 10
    max_adaptations: 50
    rollback_on_decline: true
  }
  evolve: {
    mutable: [system_prompt, temperature]
    review_after: 50
    core_identity: "You are an academic summarizer."
    allow_rollback: true
  }
  memory: "summarizer_memory"
}

// ── Voice Pipeline ─────────────────────────────────────────────

voice ResearchVoice {
  agent: ResearchAgent
  stt_provider: "whisper"
  stt_model: "whisper-1"
  tts_provider: "openai"
  tts_model: "tts-1"
  tts_voice: "nova"
}

// ── Pipeline Orchestration ─────────────────────────────────────

fun research_pipeline(question, citation_style) {
  emit "[Pipeline] Question: " + question;

  let search_results = ResearchAgent.ask(
    "Find papers relevant to: " + question
  );

  let analysis = AnalysisAgent.ask(
    "Analyze for '" + question + "':\n" + search_results
  );

  let summary = SummarizerAgent.ask(
    "Summarize for '" + question + "':\n" + analysis
  );

  return {
    "question": question,
    "summary": summary,
    "analysis": analysis,
    "search_results": search_results
  };
}

// ── Main ───────────────────────────────────────────────────────

{
  let result = research_pipeline(
    "What are the latest advances in direct preference optimization for LLMs?",
    "APA"
  );

  emit "";
  emit "=== Research Summary ===";
  emit result["summary"];
  emit "";

  // Provide feedback
  agent_rate("SummarizerAgent", 0.85);

  // Check cognitive status
  let stats = agent_learning_stats("SummarizerAgent");
  emit "Learning stats: " + str(stats);
}

24.11 Evaluation with neam-gym #

The neam-gym tool provides systematic evaluation of agent performance. We define a dataset of research questions with known-good answers and measure how well the research assistant performs.

Creating an Evaluation Dataset #

Create a JSONL file with test cases:

jsonl

{"input": "What are the main approaches to RLHF?", "expected_themes": ["reward modeling", "preference learning", "policy optimization"], "min_papers": 3}
{"input": "Compare transformer and state-space model architectures", "expected_themes": ["attention mechanism", "linear recurrence", "efficiency"], "min_papers": 4}
{"input": "What are the limitations of chain-of-thought prompting?", "expected_themes": ["faithfulness", "consistency", "computational cost"], "min_papers": 2}
{"input": "Survey recent work on AI alignment", "expected_themes": ["value alignment", "safety", "interpretability"], "min_papers": 5}
{"input": "How does retrieval-augmented generation compare to fine-tuning?", "expected_themes": ["knowledge updating", "hallucination", "cost"], "min_papers": 3}

Running the Evaluation #

bash

# Run evaluation with 3 independent runs per test case
neam-gym \
  --agent research_assistant.neam \
  --dataset ./eval/research_questions.jsonl \
  --output ./eval/results/ \
  --runs 3 \
  --judge gpt-4o \
  --threshold 0.75

# Output:
# === neam-gym Evaluation Report ===
# Agent: research_assistant.neam
# Dataset: 5 test cases x 3 runs = 15 evaluations
# Judge: gpt-4o
#
# Results:
#   Test 1 (RLHF):           0.87 +/- 0.04  PASS
#   Test 2 (Transformers):   0.82 +/- 0.06  PASS
#   Test 3 (CoT):            0.79 +/- 0.03  PASS
#   Test 4 (Alignment):      0.91 +/- 0.02  PASS
#   Test 5 (RAG vs FT):      0.76 +/- 0.05  PASS
#
# Overall: 0.83 +/- 0.05 (threshold: 0.75) -> PASS
# All 5 test cases passed.

Evaluation Metrics #

The judge model evaluates each response on four dimensions:

Dimension	Weight	What It Measures
Relevance	0.30	Are the retrieved papers relevant to the question?
Coverage	0.25	Does the summary cover the expected themes?
Accuracy	0.25	Are the factual claims supported by the cited papers?
Coherence	0.20	Is the summary well-structured and readable?

💡 Tip

Run evaluations after every significant change to agent prompts or RAG configuration. The neam-gym output is deterministic given the same random seed, so you can track regressions reliably. Use --runs 5 for publication-quality results and --runs 1 for quick feedback during development.

24.12 Upgrading to Agentic Retrieval #

The initial implementation uses corrective_rag, which adds a single verification-and-reformulation loop. For research questions that span multiple sub-topics, the agentic retrieval strategy produces better results by iteratively refining the search across multiple rounds.

Agentic RAG Configuration #

neam

knowledge AcademicPapersV2 {
  vector_store: "usearch"
  embedding_model: "text-embedding-3-small"
  chunk_size: 512
  chunk_overlap: 128
  retrieval_strategy: "agentic"

  agentic_config: {
    max_iterations: 3
    reflection_model: "gpt-4o-mini"
    decompose_queries: true
    cross_reference: true
  }

  sources: [
    { type: "directory", path: "./papers/", pattern: "*.md" },
    { type: "directory", path: "./papers/", pattern: "*.pdf" }
  ]
}

The agentic strategy differs from corrective_rag in three ways:

Feature	Corrective RAG	Agentic RAG
Query reformulation	1 round	Up to `max_iterations` rounds
Query decomposition	No	Splits complex queries into sub-queries
Cross-referencing	No	Checks if retrieved chunks reference each other
Cost per query	~$0.003	~$0.008
Relevance improvement	+19% over basic	+31% over basic

For the research assistant, query decomposition is the most impactful feature. A question like "How does RLHF compare to DPO for alignment?" is decomposed into sub-queries: "RLHF reward modeling methods," "direct preference optimization techniques," and "RLHF vs DPO experimental comparisons." Each sub-query retrieves independently, and the results are merged and deduplicated before being passed to the analysis agent.

Cross-Corpus Search #

When papers span multiple repositories (published papers, preprints, internal reports), connect the research agent to multiple knowledge bases:

neam

knowledge ArxivPreprints {
  vector_store: "usearch"
  embedding_model: "text-embedding-3-small"
  chunk_size: 512
  chunk_overlap: 128
  retrieval_strategy: "agentic"
  sources: [
    { type: "directory", path: "./preprints/", pattern: "*.pdf" }
  ]
}

agent MultiCorpusResearcher {
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.3
  system: "You are a research agent with access to both published papers and
preprints. When results appear in both corpora, prefer the published version.
Note when a finding exists only as a preprint (not yet peer-reviewed)."

  reasoning: chain_of_thought
  connected_knowledge: [AcademicPapersV2, ArxivPreprints]
  memory: "multi_corpus_memory"
}

The agent receives retrieval results from both knowledge bases with source annotations. Its system prompt instructs it to prefer published versions and flag preprint-only findings, giving the user appropriate confidence levels for each cited claim.

24.13 Lessons Learned #

1. Chunk size matters enormously for academic text. We tested chunk sizes of 128, 256, and 512 tokens. At 128, paragraphs were split mid-argument and the agent frequently misinterpreted findings. At 512, retrieval quality improved by 34% as measured by relevance scores. The tradeoff is that larger chunks consume more of the LLM's context window, but for academic text, this is almost always worth it.

2. Corrective RAG justifies its cost. The corrective RAG strategy adds an extra LLM call per query for relevance scoring, costing approximately $0.003 per query. But it improved retrieval relevance from 0.72 to 0.86 on our test set. For research applications where precision matters more than cost, corrective RAG is the right choice.

3. Plan-and-execute produces better summaries. We compared chain-of-thought and plan-and-execute for the summarizer agent. Plan-and-execute summaries scored 0.12 higher on coherence and 0.08 higher on coverage, because the explicit planning step ensures all themes are addressed before writing begins.

4. Learning improves over time, but slowly. After 100 interactions, the summarizer's average reflection score improved from 0.78 to 0.84. After 200 interactions, it reached 0.87. The improvement curve flattens after about 150 interactions, suggesting that the system prompt reaches its practical limit and further improvement requires new training data or a more capable base model.

5. Voice interfaces need shorter summaries. When we added the voice interface, users reported that full summaries were too long to listen to. We added a separate "voice summary mode" that limits the summarizer to 3-4 sentences. This is a UI consideration, not an agent architecture issue, but it is worth planning for from the start.

24.14 Exercises #

Add a citation graph. Extend the analysis agent to identify citation relationships between retrieved papers (which papers cite which). Display the graph as a structured output.
Multi-corpus search. Add a second knowledge base containing preprints from arXiv and modify the research agent to search both corpora, de-duplicate results, and prefer published versions over preprints.
Autonomous literature monitoring. Configure the research agent with goals and triggers to automatically search for new papers on a daily schedule. Use a budget of 20 calls/day and $2/day.
Evaluation expansion. Create a dataset of 20 research questions spanning four domains (NLP, computer vision, reinforcement learning, systems). Run neam-gym with 5 runs per question and analyze the variance across domains.

Summary #

In this chapter, we built a multi-agent research assistant that demonstrates the power of Neam's pipeline architecture. The system chains four agents (search, analysis, summarization, citation formatting) into a coherent workflow, enhanced by RAG retrieval, cognitive reasoning strategies, reflection for quality control, learning from user feedback, prompt evolution, a voice interface, and upgradeable agentic retrieval with cross-corpus search.

The key insight is that each stage of the research process maps naturally to a separate agent with its own reasoning strategy. The search agent uses chain-of-thought to decompose queries. The summarizer uses plan-and-execute to structure multi-section output. Reflection ensures quality at each stage. Learning gradually improves the system over hundreds of interactions. And agentic retrieval with query decomposition handles complex multi-topic research questions that simpler strategies miss.

In the next chapter, we tackle a fundamentally different architecture: autonomous agents managing a data pipeline with minimal human intervention.