Programming Neam
📖 16 min read

Chapter 15: RAG and Knowledge Bases #

"The best way to reduce hallucination is to give the model something true to talk about."

Large language models are powerful, but they have a fundamental limitation: their knowledge is frozen at training time. Ask a model about your company's internal documentation, last week's policy changes, or a dataset that did not exist when the model was trained, and it will either refuse to answer or -- worse -- fabricate a confident, plausible-sounding response. This phenomenon is called hallucination, and it is the single largest barrier to deploying LLMs in production.

Retrieval-Augmented Generation (RAG) solves this problem by injecting relevant documents into the prompt at query time. Instead of relying solely on parametric memory (the model's weights), a RAG system retrieves context from an external knowledge base and presents it alongside the user's question. The model then grounds its answer in the retrieved material.

Neam makes RAG a first-class language construct. You declare a knowledge block, connect it to an agent, and the entire retrieval pipeline -- chunking, embedding, indexing, querying, and context injection -- is handled by the runtime. This chapter teaches you how to build knowledge-augmented agents from the ground up, starting with the simplest configuration and progressing through all eight retrieval strategies.


15.1 What Is RAG? #

Retrieval-Augmented Generation was introduced by Lewis et al. (2020) as a technique that combines a retrieval component with a generative model. The core idea is straightforward:

  1. Index a corpus of documents into a searchable store.
  2. Retrieve the most relevant documents for a given query.
  3. Augment the LLM prompt with the retrieved documents.
  4. Generate an answer grounded in the retrieved context.
User Query
Embed
Query
Vector Store
(uSearch)
Top-K Results
(Documents)
Augmented Prompt
= System + Context
+ User Query
LLM Provider
(OpenAI / Ollama)

The key insight is that the retrieval step is non-parametric -- it does not depend on the model's training data. You can update the knowledge base at any time, and the next query will immediately reflect the new information.

Why RAG Matters #

Problem Without RAG How RAG Solves It
Hallucination on domain-specific questions Grounds answers in retrieved facts
Knowledge frozen at training cutoff Knowledge base can be updated in real time
No access to private/internal data Index proprietary documents locally
Expensive fine-tuning for new domains Swap knowledge bases without retraining
No source attribution Retrieved chunks provide traceable citations
📝 Note

RAG does not eliminate hallucination entirely. A model can still misinterpret retrieved context or generate unsupported inferences. However, RAG dramatically reduces the frequency and severity of hallucination compared to ungrounded generation.


15.2 Declaring a Knowledge Base #

In Neam, a knowledge base is declared with the knowledge keyword. Here is the minimal configuration:

neam
knowledge ProductDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "text", content: "Neam is a domain-specific language for AI agent orchestration." },
    { type: "text", content: "Neam compiles to bytecode and runs on a custom VM." }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

Let us break down each field:

vector_store #

Specifies the vector index implementation. Currently Neam supports:

Value Implementation Description
"usearch" uSearch HNSW High-performance approximate nearest neighbor search

uSearch uses the Hierarchical Navigable Small World (HNSW) algorithm, which provides sub-millisecond search latency even on large document collections. The index is built in-memory at program startup.

embedding_model #

The model used to convert text into dense vector representations. Neam uses Ollama to serve embedding models locally:

Model Dimensions Context Notes
nomic-embed-text 768 8192 tokens Default. Excellent general-purpose embeddings.

Before using RAG, pull the embedding model:

bash
ollama pull nomic-embed-text

The embedding model runs on your local machine via Ollama's embedding API (http://localhost:11434/api/embeddings). No cloud API key is required for embeddings.

chunk_size and chunk_overlap #

Documents are split into chunks before embedding. These two parameters control how that splitting works:

The quick brown fox jumps over the lazy dog. This is a long
document that needs to be split into manageable pieces for
embedding. Each piece should be small enough to fit in the
embedding model's context window...

Choosing chunk size: Smaller chunks (100-200) produce more focused embeddings but may lose context. Larger chunks (500-1000) retain more context but may dilute the relevance signal. A chunk size of 200 with overlap of 50 is a good starting point for most use cases.

💡 Tip

If your documents contain short, self-contained paragraphs (like FAQ entries), a smaller chunk size (100-150) works well. For narrative text or technical documentation, a larger chunk size (300-500) preserves more context.

sources #

The list of documents to index. Neam supports two source types:

Inline Text Sources #

neam
sources: [
  { type: "text", content: "Neam is a programming language for AI agents." },
  { type: "text", content: "Agents connect to LLM providers like OpenAI and Ollama." }
]

Use inline text for small, self-contained facts, FAQ entries, or test data.

File Sources #

neam
sources: [
  { type: "file", path: "./docs/README.md" },
  { type: "file", path: "./data/product_catalog.txt" }
]

File sources read the file at compile time and chunk its contents. The path is relative to the working directory where neam is executed.

⚠️ Warning

File paths are resolved relative to the current working directory at runtime, not relative to the .neam source file. If you run your program from a different directory, ensure the paths still resolve correctly.

retrieval_strategy #

Specifies which retrieval algorithm to use when the agent queries the knowledge base. Neam supports eight strategies, covered in detail in Section 15.4.

top_k #

The number of document chunks to retrieve and include in the augmented prompt. The default is 4.


15.3 Connecting Knowledge to Agents #

A knowledge base becomes useful only when connected to an agent. Use the connected_knowledge property:

neam
knowledge ProductDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "text", content: "Neam is a programming language designed for AI agent orchestration. It supports agents, handoffs, guardrails, and runners. Neam compiles to bytecode and runs on a custom virtual machine." },
    { type: "text", content: "To create an agent in Neam, use the 'agent' keyword followed by a name and configuration block. Agents can have a provider (openai, ollama), model, temperature, and system prompt." },
    { type: "text", content: "Handoffs allow agents to transfer control to other agents. Use the handoffs property in an agent declaration. The runner orchestrates the handoff flow with a max_turns limit." }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

agent DocAssistant {
  provider: "ollama"
  model: "llama3.2:3b"
  temperature: 0.3
  system: "You are a documentation assistant. Answer questions using only the provided context. Be concise."
  connected_knowledge: [ProductDocs]
}

{
  let answer = DocAssistant.ask("How do I create an agent in Neam?");
  emit "Q: How do I create an agent in Neam?";
  emit "A: " + answer;
}

When the agent receives a query via .ask(), the runtime performs these steps automatically:

  1. Embed the query using the knowledge base's embedding model.
  2. Search the vector store for the top_k most similar chunks.
  3. Inject the retrieved chunks into the prompt as context.
  4. Call the LLM with the augmented prompt.
  5. Return the response to the caller.

The agent's system prompt is preserved. The retrieved context is inserted between the system prompt and the user's message.

Connecting Multiple Knowledge Bases #

An agent can connect to multiple knowledge bases:

neam
knowledge TechDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "file", path: "./docs/technical.md" }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

knowledge PolicyDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 150
  chunk_overlap: 30
  sources: [
    { type: "file", path: "./docs/policies.md" }
  ]
  retrieval_strategy: "mmr"
  top_k: 2
}

agent SupportAgent {
  provider: "openai"
  model: "gpt-4o"
  system: "You are a customer support agent. Answer using the provided context."
  connected_knowledge: [TechDocs, PolicyDocs]
}

When an agent is connected to multiple knowledge bases, the runtime queries each one independently and merges the retrieved chunks before injecting them into the prompt.


15.4 Retrieval Strategies #

Neam supports eight retrieval strategies. Each makes different trade-offs between accuracy, diversity, latency, and cost. The following table provides an overview:

+------------------------------------------------------------------+
|  Strategy        | LLM Calls | Latency | Accuracy | Best For    |
|  ----------------+-----------+---------+----------+------------- |
|  basic           | 0         | Low     | Good     | Simple Q&A   |
|  mmr             | 0         | Low     | Good     | Diverse docs |
|  hybrid          | 0         | Low     | Better   | Precise match|
|  hyde            | 1         | Medium  | Better   | Abstract Q   |
|  self_rag        | 1         | Medium  | High     | High accuracy|
|  crag            | 1-3       | Medium  | High     | Complex Q    |
|  agentic         | 2-5+      | High    | Highest  | Research     |
|  graph_rag       | 1-2       | Medium  | High     | Relationships|
+------------------------------------------------------------------+

Strategy 1: basic -- Standard Vector Similarity #

The default and simplest strategy. The query is embedded and compared against all document chunk embeddings using cosine similarity. The top-K most similar chunks are returned.

neam
knowledge BasicKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "basic"
  top_k: 3
}

How it works:

  1. Embed the query: q_vec = embed("What is Neam?")
  2. Compute cosine similarity against every chunk embedding.
  3. Return the 3 highest-scoring chunks.

When to use: Simple factual questions where the answer is likely contained in a single chunk. This is the fastest strategy with zero additional LLM calls.

Configuration options:

Option Default Description
top_k 4 Number of chunks to retrieve
relevance_threshold 0.5 Minimum similarity score (0.0-1.0)

Strategy 2: mmr -- Maximal Marginal Relevance #

MMR balances relevance and diversity. After finding the most relevant chunks, it penalizes chunks that are too similar to already-selected chunks. This produces a more diverse set of results.

neam
knowledge MMRKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "mmr"
  top_k: 3
  mmr_lambda: 0.7
}

How it works:

  1. Retrieve an initial candidate set (typically 2x top_k chunks).
  2. Select the first chunk (highest relevance to query).
  3. For each remaining selection, choose the chunk that maximizes: MMR = lambda * similarity(chunk, query) - (1 - lambda) * max_similarity(chunk, selected_chunks)
  4. Repeat until top_k chunks are selected.

The mmr_lambda parameter:

When to use: When your knowledge base contains many similar or overlapping passages and you want the retrieved context to cover different aspects of the topic.


Combines traditional keyword matching (BM25-style) with vector similarity search. This catches cases where semantically relevant documents use different vocabulary than the query.

neam
knowledge HybridKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "hybrid"
  top_k: 3
}

How it works:

  1. Run vector similarity search (same as basic).
  2. Run keyword/token matching against chunk text.
  3. Combine scores using reciprocal rank fusion.
  4. Return the top-K chunks from the fused ranking.

When to use: When queries contain specific technical terms, product names, error codes, or identifiers that should be matched exactly. Vector search might miss "ERR-4021" if no similar text exists in training data, but keyword search catches it.


Strategy 4: hyde -- Hypothetical Document Embeddings #

HyDE generates a hypothetical answer to the query, embeds that answer, and uses it for retrieval instead of the raw query. The intuition is that a hypothetical answer is closer in embedding space to the actual answer than the question itself.

neam
knowledge HyDEKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./docs.md" } ]
  retrieval_strategy: "hyde"
  top_k: 3
  num_hypothetical: 1
}

How it works:

  1. Send the query to the LLM: "Write a short passage that would answer: [query]"
  2. Embed the hypothetical answer (not the original query).
  3. Search the vector store using the hypothetical embedding.
  4. Return the top-K actual document chunks.

Configuration options:

Option Default Description
num_hypothetical 1 Number of hypothetical documents to generate

When to use: Abstract or conceptual queries where the question phrasing is very different from how the answer would appear in the documents. For example, "What should I do when a customer is upset?" retrieves better documents when the hypothetical answer ("When dealing with an upset customer, first acknowledge their frustration...") is used as the search vector.

Trade-off: HyDE requires one additional LLM call, adding latency and cost. The hypothetical answer may also steer retrieval in the wrong direction if the LLM generates an incorrect hypothesis.


Strategy 5: self_rag -- Self-Reflective RAG #

Self-RAG adds a relevance check after retrieval. The LLM evaluates whether each retrieved chunk is actually relevant to the query, filtering out false positives before generating the answer.

neam
knowledge SelfRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "self_rag"
  top_k: 4
  enable_relevance_check: true
  enable_support_check: true
}

How it works:

  1. Retrieve top-K chunks (same as basic).
  2. For each chunk, ask the LLM: "Is this chunk relevant to the query? Rate 0-1."
  3. Filter chunks below the relevance threshold.
  4. Generate the answer using only the validated chunks.
  5. (Optional) Check whether the answer is supported by the retrieved chunks.

Configuration options:

Option Default Description
enable_relevance_check true Check each chunk's relevance before use
enable_support_check true Verify the answer is supported by context

When to use: High-stakes applications (medical, legal, financial) where using irrelevant context could lead to harmful or misleading answers. The relevance check acts as a guardrail on the retrieval step.

Trade-off: Adds one LLM call for the relevance check. Can discard too many chunks if the threshold is too aggressive, leaving insufficient context.


Strategy 6: crag -- Corrective RAG #

CRAG (Corrective Retrieval Augmented Generation) adds query decomposition and iterative correction. If the initial retrieval does not produce confident results, CRAG decomposes the query into sub-queries and tries again.

neam
knowledge CRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "crag"
  top_k: 3
  enable_query_decomposition: true
  max_corrections: 2
}

How it works:

  1. Retrieve top-K chunks for the original query.
  2. Evaluate retrieval confidence.
  3. If confidence is low: a. Decompose the query into sub-queries. b. Retrieve chunks for each sub-query. c. Merge and re-rank results.
  4. Repeat up to max_corrections times.
  5. Generate the answer using the refined context.

Configuration options:

Option Default Description
enable_query_decomposition true Break complex queries into sub-queries
max_corrections 2 Maximum correction rounds
enable_web_fallback false Fall back to web search if local retrieval fails

When to use: Complex, multi-part questions that cannot be answered by a single retrieval pass. For example: "Compare the performance characteristics of basic and agentic RAG strategies, and explain when to use each one."


Strategy 7: agentic -- Tool-Based Planning with Reflection #

Agentic RAG treats retrieval as an iterative research process. The LLM plans what information it needs, retrieves it, reflects on whether it has enough context, and repeats until satisfied.

neam
knowledge AgenticKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./research/*.md" } ]
  retrieval_strategy: "agentic"
  top_k: 5
  max_iterations: 3
  enable_reflection: true
}

How it works:

  1. The LLM analyzes the query and generates a retrieval plan.
  2. Execute the first retrieval based on the plan.
  3. The LLM reflects: "Do I have enough context to answer? What is missing?"
  4. If more context is needed, refine the search query and retrieve again.
  5. Repeat up to max_iterations times.
  6. Generate the final answer using all accumulated context.

Configuration options:

Option Default Description
max_iterations 5 Maximum retrieval-reflection cycles
enable_reflection true Enable self-reflection between iterations

When to use: Research tasks, deep-dive questions, or scenarios where a single retrieval pass is unlikely to surface all necessary information. This is the most thorough strategy but also the most expensive.

Trade-off: Multiple LLM calls per query (2 to max_iterations * 2). Best reserved for high-value queries where accuracy justifies the cost.


Strategy 8: graph_rag -- Knowledge Graph Retrieval #

Graph RAG builds a knowledge graph from your documents, extracting entities and relationships. Retrieval traverses the graph starting from entities mentioned in the query, producing richly connected context.

neam
knowledge GraphKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./docs/architecture.md" } ]
  retrieval_strategy: "graph_rag"
  top_k: 5
}

How it works:

  1. At index time, extract entities and relationships from each chunk using an LLM.
  2. Build a graph with entity nodes, document nodes, and relationship edges.
  3. Optionally detect communities (clusters of related entities).
  4. At query time, extract entities from the query.
  5. Traverse the graph from matched entities, collecting related nodes up to a configurable depth.
  6. Include community summaries for broader context.
  7. Generate the answer using the graph-derived context.

When to use: Documents with rich relationships between concepts -- organizational charts, technical architectures, legal contracts with cross-references, scientific papers with citation networks.

The Neam standard library provides graph construction utilities in std.rag.advanced.graph_rag:

neam
import std::rag::advanced::graph_rag;

// Create entities
let neam = graph_rag::node_entity("e1", "Neam", "Language", {
  "version": "0.5.0",
  "paradigm": "agentic"
});

let vm = graph_rag::node_entity("e2", "Virtual Machine", "Component", {
  "type": "bytecode interpreter"
});

// Create relationship
let runs_on = graph_rag::edge_related_to("e1", "e2", "runs_on", 0.95);

// Build graph
let graph = graph_rag::knowledge_graph();
graph = graph_rag::add_node(graph, neam);
graph = graph_rag::add_node(graph, vm);
graph = graph_rag::add_edge(graph, runs_on);

15.5 Strategy Selection Guide #

Choosing the right retrieval strategy depends on your use case. Here is a decision framework:

Start
  |
  v
Is the question simple and factual?
  |--- Yes --> Use "basic"
  |--- No
       |
       v
     Do your docs have many overlapping passages?
       |--- Yes --> Use "mmr"
       |--- No
            |
            v
          Does the query contain specific terms/codes?
            |--- Yes --> Use "hybrid"
            |--- No
                 |
                 v
               Is the query abstract or conceptual?
                 |--- Yes --> Use "hyde"
                 |--- No
                      |
                      v
                    Is accuracy critical (medical/legal)?
                      |--- Yes --> Use "self_rag"
                      |--- No
                           |
                           v
                         Is the query multi-part or complex?
                           |--- Yes --> Use "crag"
                           |--- No
                                |
                                v
                              Is this a research/deep-dive task?
                                |--- Yes --> Use "agentic"
                                |--- No
                                     |
                                     v
                                   Do docs have entity relationships?
                                     |--- Yes --> Use "graph_rag"
                                     |--- No  --> Use "basic"

15.6 Practical Walkthrough: Building a Documentation QA Bot #

Let us build a complete documentation assistant that answers questions about a project using its README and documentation files.

Step 1: Prepare the Knowledge Base #

neam
knowledge ProjectDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 250
  chunk_overlap: 50
  sources: [
    { type: "file", path: "./README.md" },
    { type: "file", path: "./docs/AGENT_HANDOFFS_GUIDE.md" },
    { type: "text", content: "To build Neam, run: mkdir -p build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . --parallel" },
    { type: "text", content: "Neam supports four LLM providers: OpenAI, Anthropic, Google Gemini, and Ollama. Set the corresponding API key environment variable before running." }
  ]
  retrieval_strategy: "hybrid"
  top_k: 4
}

Step 2: Define the Agent #

neam
agent DocsBot {
  provider: "ollama"
  model: "llama3.2:3b"
  temperature: 0.3
  system: "You are a documentation assistant for the Neam programming language.
           Answer questions using ONLY the provided context. If the context does
           not contain the answer, say 'I don't have that information in the docs.'
           Be concise and cite specific details from the context."
  connected_knowledge: [ProjectDocs]
}

Step 3: Build the Interactive Loop #

neam
{
  emit "=== Neam Documentation Assistant ===";
  emit "Ask questions about the Neam language.";
  emit "Type 'quit' to exit.";
  emit "";

  let running = true;
  while (running) {
    emit "Q: ";
    let question = input();

    if (question == "quit") {
      running = false;
    } else {
      let answer = DocsBot.ask(question);
      emit "A: " + answer;
      emit "";
    }
  }

  emit "Goodbye!";
}

Step 4: Compile and Run #

bash
# Prerequisites: Ollama with the required models
ollama pull llama3.2:3b
ollama pull nomic-embed-text

# Compile
./neamc docs_qa.neam -o docs_qa.neamb

# Run (from the project root so file paths resolve correctly)
./neam docs_qa.neamb

Step 5: Test with Sample Questions #

text
=== Neam Documentation Assistant ===
Ask questions about the Neam language.
Type 'quit' to exit.

Q: How do I build Neam?
A: To build Neam, run the following commands:
   mkdir -p build && cd build
   cmake .. -DCMAKE_BUILD_TYPE=Release
   cmake --build . --parallel

Q: What LLM providers does Neam support?
A: Neam supports four LLM providers: OpenAI, Anthropic, Google Gemini,
   and Ollama. You need to set the corresponding API key environment
   variable before running.

Q: How do handoffs work?
A: Handoffs allow agents to transfer control to other agents. You use
   the handoffs property in an agent declaration, and the runner
   orchestrates the handoff flow with a max_turns limit.

15.7 Comparing All Strategies Side by Side #

The following program queries the same knowledge base with all strategies and compares the results. This is an excellent way to evaluate which strategy works best for your specific dataset.

neam
knowledge BasicKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "basic"
  top_k: 3
}

knowledge MMRKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "mmr"
  top_k: 3
  mmr_lambda: 0.7
}

knowledge HybridKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "hybrid"
  top_k: 3
}

knowledge HyDEKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "hyde"
  top_k: 3
  num_hypothetical: 1
}

knowledge SelfRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "self_rag"
  top_k: 4
  enable_relevance_check: true
}

knowledge CRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "crag"
  top_k: 3
  enable_query_decomposition: true
}

knowledge AgenticKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "agentic"
  top_k: 3
  max_iterations: 2
  enable_reflection: true
}

// One agent per strategy
agent BasicAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [BasicKB]
}

agent MMRAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [MMRKB]
}

agent HybridAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [HybridKB]
}

agent HyDEAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [HyDEKB]
}

agent SelfRAGAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [SelfRAGKB]
}

agent CRAGAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [CRAGKB]
}

agent AgenticAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [AgenticKB]
}

{
  emit "===============================================";
  emit "  NEAM RAG STRATEGIES - COMPREHENSIVE TEST    ";
  emit "===============================================";
  emit "";

  let q = "What is Neam?";
  emit "Question: " + q;
  emit "";
  emit "-----------------------------------------------";

  emit "1. BASIC (standard similarity):";
  emit "   " + BasicAgent.ask(q);
  emit "";

  emit "2. MMR (diversity-focused):";
  emit "   " + MMRAgent.ask(q);
  emit "";

  emit "3. HYBRID (keyword + vector):";
  emit "   " + HybridAgent.ask(q);
  emit "";

  emit "4. HYDE (hypothetical document):";
  emit "   " + HyDEAgent.ask(q);
  emit "";

  emit "5. SELF-RAG (relevance-checked):";
  emit "   " + SelfRAGAgent.ask(q);
  emit "";

  emit "6. CRAG (query decomposition):";
  emit "   " + CRAGAgent.ask(q);
  emit "";

  emit "7. AGENTIC (iterative refinement):";
  emit "   " + AgenticAgent.ask(q);
  emit "";

  emit "===============================================";
  emit "         ALL STRATEGIES TEST COMPLETE          ";
  emit "===============================================";
}

15.8 Tuning Chunk Size and Overlap #

The chunk size and overlap parameters have a significant impact on retrieval quality. Here is a practical guide for tuning them:

Experiment Setup #

neam
knowledge SmallChunks {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 100
  chunk_overlap: 20
  sources: [ { type: "file", path: "./docs/README.md" } ]
  retrieval_strategy: "basic"
  top_k: 5
}

knowledge MediumChunks {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 250
  chunk_overlap: 50
  sources: [ { type: "file", path: "./docs/README.md" } ]
  retrieval_strategy: "basic"
  top_k: 3
}

knowledge LargeChunks {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 500
  chunk_overlap: 100
  sources: [ { type: "file", path: "./docs/README.md" } ]
  retrieval_strategy: "basic"
  top_k: 2
}

Guidelines #

Document Type Recommended Chunk Size Recommended Overlap Reasoning
FAQ entries 100-150 20-30 Each entry is self-contained
Technical docs 200-300 50-75 Need enough context for code examples
Narrative text 400-600 100-150 Preserve paragraph coherence
Legal documents 300-500 75-125 Clauses often reference each other
API references 150-250 30-50 Each endpoint description is independent
💡 Tip

When in doubt, start with chunk_size: 200 and chunk_overlap: 50. These values work well for most technical documentation. Adjust based on the quality of retrieved results.


15.9 Advanced Pattern: RAG with Multiple Sources and Strategies #

A common production pattern uses different knowledge bases with different strategies for different types of content:

neam
knowledge FAQs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 150
  chunk_overlap: 30
  sources: [
    { type: "file", path: "./docs/faq.md" }
  ]
  retrieval_strategy: "hybrid"
  top_k: 3
}

knowledge TechDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 300
  chunk_overlap: 75
  sources: [
    { type: "file", path: "./docs/architecture.md" },
    { type: "file", path: "./docs/api-reference.md" }
  ]
  retrieval_strategy: "mmr"
  top_k: 4
  mmr_lambda: 0.6
}

knowledge Research {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 400
  chunk_overlap: 100
  sources: [
    { type: "file", path: "./papers/survey.md" }
  ]
  retrieval_strategy: "agentic"
  top_k: 5
  max_iterations: 3
  enable_reflection: true
}

agent ResearchAssistant {
  provider: "openai"
  model: "gpt-4o"
  system: "You are a research assistant. Use the provided context to give
           thorough, well-sourced answers. Cite specific sections when possible."
  connected_knowledge: [FAQs, TechDocs, Research]
}

{
  let answer = ResearchAssistant.ask(
    "How does Neam's agentic RAG strategy compare to standard vector search?"
  );
  emit answer;
}

15.10 Performance Considerations #

Indexing Time #

Indexing happens at program startup. The time depends on:

For a 50-page document with chunk_size: 200, expect roughly 100-200 chunks and 2-5 seconds of indexing time on a modern machine.

Query Latency by Strategy #

Strategy Typical Latency LLM Calls Notes
basic 50-100ms 0 extra Vector search only
mmr 50-100ms 0 extra Slightly more computation
hybrid 50-150ms 0 extra Keyword + vector
hyde 1-3s 1 extra One LLM call for hypothesis
self_rag 1-3s 1 extra One LLM call for relevance check
crag 2-8s 1-3 extra Decomposition + correction
agentic 3-15s 2-5+ extra Iterative retrieval
graph_rag 1-5s 1-2 extra Graph traversal + LLM

Memory Usage #

The uSearch HNSW index resides in memory. For 1,000 chunks with 768-dimensional embeddings, expect approximately 3 MB of memory for the index. This scales linearly with the number of chunks.


15.11 Embedding Providers #

The examples throughout this chapter use nomic-embed-text via Ollama, but the Neam standard library (std.ingest.embed) supports a wide range of embedding providers for production deployments:

Provider Models Dimensions Notes
OpenAI text-embedding-3-small, text-embedding-3-large, ada-002 1536-3072 Cloud API, highest quality
Cohere embed-english-v3.0, embed-multilingual-v3.0 1024 Strong multilingual support
Voyage AI voyage-2, voyage-large-2, voyage-code-2 1024-1536 Specialized for code retrieval
Ollama nomic-embed-text, mxbai-embed-large 768-1024 Free, local, no API key
Local all-MiniLM-L6-v2 (sentence-transformers), bge-base, e5-base 384-768 Self-hosted models

You can use the standard library's batch embedder for large document sets. It supports rate limiting and caching to manage API costs:

neam
import std::ingest::embed;

let embedder = embed::create_embedder({
  "provider": "openai",
  "model": "text-embedding-3-small",
  "batch_size": 100,
  "cache_enabled": true
});

let vectors = embed::batch_embed(embedder, chunks);

The library also provides similarity functions for comparing vectors directly:

💡 Tip

For most applications, start with nomic-embed-text via Ollama during development (free, local, no API key), then switch to text-embedding-3-small or embed-english-v3.0 for production quality.


15.12 Vector Store Options #

The knowledge declaration uses uSearch (in-memory HNSW) by default, but the standard library (std.ingest.store) provides connectors for production-grade vector databases:

Store Type Scaling Best For
uSearch In-memory HNSW Single process Development, small corpora
Pinecone Cloud managed Serverless Production SaaS, no ops
Qdrant Self-hosted / Cloud Horizontal High-performance, filtering
Weaviate Self-hosted / Cloud Horizontal Multi-modal, module ecosystem
Milvus Self-hosted / Cloud Horizontal Massive-scale (billions)
ChromaDB Embedded / Server Single node Prototyping, simple setup
PgVector PostgreSQL extension PostgreSQL cluster Existing Postgres stack
SqliteVec SQLite extension Single file Edge deployment, embedded apps

To use a cloud vector store, configure your knowledge base with the appropriate store and connection settings. The standard library provides a unified filter expression syntax that works across all stores:

neam
import std::ingest::store;

let qdrant = store::create_store({
  "type": "qdrant",
  "url": "http://localhost:6333",
  "collection": "product_docs"
});

// Filter expressions work across all backends
let filter = store::filter_and([
  store::filter_eq("category", "technical"),
  store::filter_gte("updated_at", "2025-01-01")
]);

let results = store::search(qdrant, query_vector, 5, filter);
📝 Note

The knowledge declaration currently uses the built-in uSearch store directly. To use external vector stores, work with the std.ingest.store module in your program logic alongside the knowledge declaration.


15.13 Advanced Chunking Strategies #

The chunk_size / chunk_overlap fields in the knowledge declaration use fixed-size character splitting. For finer control, the standard library (std.ingest.chunk) provides nine chunking strategies:

Strategy Description Best For
Fixed Fixed character size with overlap General-purpose (default)
Sentence Split on sentence boundaries Narrative text, articles
Paragraph Split on paragraph boundaries Well-structured documents
Recursive Hierarchical splitting (text → paragraph → sentence → character) Mixed content
Semantic Split when embedding similarity drops Maintaining topic coherence
Code Split on function/class/block boundaries Source code files
Markdown Split on headers, preserve code blocks Documentation, READMEs
Sliding window Overlapping windows at fixed intervals Dense information extraction
Hybrid Multiple strategies with content-type matchers Multi-format corpora
neam
import std::ingest::chunk;

// Semantic chunking groups text by topic similarity
let semantic_chunker = chunk::create_chunker({
  "strategy": "semantic",
  "similarity_threshold": 0.75,
  "min_chunk_size": 100,
  "max_chunk_size": 500
});

let chunks = chunk::split(semantic_chunker, document_text);

The recursive chunker is particularly effective for mixed-content documents. It tries progressively finer split points, starting with double newlines (paragraphs), then single newlines, then sentences, then characters:

neam
let recursive_chunker = chunk::create_chunker({
  "strategy": "recursive",
  "separators": ["\n\n", "\n", ". ", " "],
  "chunk_size": 300,
  "chunk_overlap": 50
});

15.14 Reranking for Better Results #

Initial retrieval returns an approximate set of relevant chunks. Reranking applies a more expensive but more accurate model to re-score and re-order those results. The standard library (std.rag.reranker) supports four reranking methods:

Method Approach Quality Latency
Cross-encoder BERT-style pairwise scoring Highest High
ColBERT Late-interaction token matching High Medium
Cohere API-based neural reranking High Medium
LLM-based Ask an LLM to rank results Variable High
neam
import std::rag::reranker;

let reranker_config = reranker::create_reranker({
  "method": "cross_encoder",
  "model": "cross-encoder/ms-marco-MiniLM-L-12-v2",
  "top_k": 3,
  "min_score": 0.5
});

// Rerank retrieved chunks before passing to the agent
let reranked = reranker::rerank(reranker_config, query, initial_chunks);

When to add reranking:


15.15 The Document Ingestion Pipeline #

For production systems with large, diverse document collections, the standard library provides a complete ingestion pipeline (std.ingest.pipeline) that automates the flow from raw documents to indexed vectors:

Source
Parser
Chunker
Embedder
Store

Each stage is independently configurable:

neam
import std::ingest::pipeline;
import std::ingest::parser;
import std::ingest::chunk;
import std::ingest::embed;
import std::ingest::store;

let pipe = pipeline::create_pipeline({
  "parser": parser::create_parser({ "type": "auto" }),
  "chunker": chunk::create_chunker({
    "strategy": "recursive",
    "chunk_size": 300,
    "chunk_overlap": 50
  }),
  "embedder": embed::create_embedder({
    "provider": "openai",
    "model": "text-embedding-3-small"
  }),
  "store": store::create_store({
    "type": "qdrant",
    "url": "http://localhost:6333",
    "collection": "docs"
  })
});

let result = pipeline::ingest(pipe, [
  { "type": "file", "path": "./docs/manual.pdf" },
  { "type": "file", "path": "./docs/api-reference.md" },
  { "type": "file", "path": "./data/faq.csv" }
]);

emit "Ingested " + str(result["chunks_created"]) + " chunks";
emit "Cost: $" + str(result["cost"]);

The pipeline supports incremental ingestion -- only new or modified documents are processed on subsequent runs. It tracks document fingerprints to detect changes.

Supported Document Parsers #

The std.ingest.parser module includes parsers for common document formats:

Parser Formats Notes
Text .txt, .md, .csv Plain text extraction
PDF .pdf Text extraction with layout preservation
Office .docx, .xlsx, .pptx Microsoft Office formats
Code .py, .js, .rs, .neam, etc. Language-aware parsing
Image .png, .jpg, .tiff OCR-based text extraction
Audio .mp3, .wav Transcription-based parsing
HTML .html Content extraction, tag stripping

Source Connectors #

Documents can be loaded from various sources beyond the local filesystem:

Source Description
File Local filesystem paths
S3 Amazon S3 buckets
GCS Google Cloud Storage
Azure Azure Blob Storage
HTTP Remote URLs
Database SQL query results

Summary #

In this chapter you learned:


Exercises #

Exercise 15.1: Basic RAG Bot #

Create a knowledge base from three inline text sources about a topic you know well (e.g., a programming language, a cooking recipe, a historical event). Connect it to an agent using the basic strategy and test with five questions. Record which questions are answered correctly and which are not.

Exercise 15.2: Strategy Comparison #

Take the knowledge base from Exercise 15.1 and create seven copies, each using a different retrieval strategy (basic, mmr, hybrid, hyde, self_rag, crag, agentic). Run the same five questions through all seven agents. Create a table comparing the quality of answers across strategies. Which strategy performed best for your dataset? Why?

Exercise 15.3: Chunk Size Experiment #

Using a single file source (at least 2000 characters), create three knowledge bases with different chunk sizes: 100, 250, and 500. Keep the overlap at 25% of the chunk size. Test with the same set of questions and observe how chunk size affects answer quality. Write a brief analysis of your findings.

Exercise 15.4: Multi-Source Documentation Bot #

Build a documentation QA bot that indexes at least three different files from a real project (e.g., README, API docs, contribution guide). Use the hybrid strategy. Include at least two inline text sources for information not covered in the files. Demonstrate the bot answering questions that require information from different sources.

Exercise 15.5: Production RAG Architecture #

Design (on paper or in code) a RAG architecture for a customer support system with the following requirements:

For each document source, justify your choice of chunk size, overlap, and retrieval strategy. Explain how you would handle the daily update of support tickets.

Exercise 15.6: Graph RAG Exploration #

Using the std.rag.advanced.graph_rag module, build a small knowledge graph with at least 5 entities and 8 relationships. Implement a function that, given an entity name, returns all entities within 2 hops. Test your graph with queries that require understanding relationships between entities.

Start typing to search...