📖 16 min read

Chapter 15: RAG and Knowledge Bases #

"The best way to reduce hallucination is to give the model something true to talk about."

Large language models are powerful, but they have a fundamental limitation: their knowledge is frozen at training time. Ask a model about your company's internal documentation, last week's policy changes, or a dataset that did not exist when the model was trained, and it will either refuse to answer or -- worse -- fabricate a confident, plausible-sounding response. This phenomenon is called hallucination, and it is the single largest barrier to deploying LLMs in production.

Retrieval-Augmented Generation (RAG) solves this problem by injecting relevant documents into the prompt at query time. Instead of relying solely on parametric memory (the model's weights), a RAG system retrieves context from an external knowledge base and presents it alongside the user's question. The model then grounds its answer in the retrieved material.

Neam makes RAG a first-class language construct. You declare a knowledge block, connect it to an agent, and the entire retrieval pipeline -- chunking, embedding, indexing, querying, and context injection -- is handled by the runtime. This chapter teaches you how to build knowledge-augmented agents from the ground up, starting with the simplest configuration and progressing through all eight retrieval strategies.

15.1 What Is RAG? #

Retrieval-Augmented Generation was introduced by Lewis et al. (2020) as a technique that combines a retrieval component with a generative model. The core idea is straightforward:

Index a corpus of documents into a searchable store.
Retrieve the most relevant documents for a given query.
Augment the LLM prompt with the retrieved documents.
Generate an answer grounded in the retrieved context.

User Query

Embed

Query

▶

Vector Store

(uSearch)

▶

Top-K Results

(Documents)

▼

Augmented Prompt

= System + Context

+ User Query

▼

LLM Provider

(OpenAI / Ollama)

The key insight is that the retrieval step is non-parametric -- it does not depend on the model's training data. You can update the knowledge base at any time, and the next query will immediately reflect the new information.

Why RAG Matters #

Problem Without RAG	How RAG Solves It
Hallucination on domain-specific questions	Grounds answers in retrieved facts
Knowledge frozen at training cutoff	Knowledge base can be updated in real time
No access to private/internal data	Index proprietary documents locally
Expensive fine-tuning for new domains	Swap knowledge bases without retraining
No source attribution	Retrieved chunks provide traceable citations

📝 Note

RAG does not eliminate hallucination entirely. A model can still misinterpret retrieved context or generate unsupported inferences. However, RAG dramatically reduces the frequency and severity of hallucination compared to ungrounded generation.

15.2 Declaring a Knowledge Base #

In Neam, a knowledge base is declared with the knowledge keyword. Here is the minimal configuration:

neam

knowledge ProductDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "text", content: "Neam is a domain-specific language for AI agent orchestration." },
    { type: "text", content: "Neam compiles to bytecode and runs on a custom VM." }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

Let us break down each field:

`vector_store` #

Specifies the vector index implementation. Currently Neam supports:

Value	Implementation	Description
`"usearch"`	uSearch HNSW	High-performance approximate nearest neighbor search

uSearch uses the Hierarchical Navigable Small World (HNSW) algorithm, which provides sub-millisecond search latency even on large document collections. The index is built in-memory at program startup.

`embedding_model` #

The model used to convert text into dense vector representations. Neam uses Ollama to serve embedding models locally:

Model	Dimensions	Context	Notes
`nomic-embed-text`	768	8192 tokens	Default. Excellent general-purpose embeddings.

Before using RAG, pull the embedding model:

bash

ollama pull nomic-embed-text

The embedding model runs on your local machine via Ollama's embedding API (http://localhost:11434/api/embeddings). No cloud API key is required for embeddings.

`chunk_size` and `chunk_overlap` #

Documents are split into chunks before embedding. These two parameters control how that splitting works:

chunk_size: The maximum number of characters per chunk.
chunk_overlap: The number of characters that overlap between consecutive chunks.

The quick brown fox jumps over the lazy dog. This is a long

document that needs to be split into manageable pieces for

embedding. Each piece should be small enough to fit in the

embedding model's context window...

Choosing chunk size: Smaller chunks (100-200) produce more focused embeddings but may lose context. Larger chunks (500-1000) retain more context but may dilute the relevance signal. A chunk size of 200 with overlap of 50 is a good starting point for most use cases.

💡 Tip

If your documents contain short, self-contained paragraphs (like FAQ entries), a smaller chunk size (100-150) works well. For narrative text or technical documentation, a larger chunk size (300-500) preserves more context.

`sources` #

The list of documents to index. Neam supports two source types:

Inline Text Sources #

neam

sources: [
  { type: "text", content: "Neam is a programming language for AI agents." },
  { type: "text", content: "Agents connect to LLM providers like OpenAI and Ollama." }
]

Use inline text for small, self-contained facts, FAQ entries, or test data.

File Sources #

neam

sources: [
  { type: "file", path: "./docs/README.md" },
  { type: "file", path: "./data/product_catalog.txt" }
]

File sources read the file at compile time and chunk its contents. The path is relative to the working directory where neam is executed.

⚠️ Warning

File paths are resolved relative to the current working directory at runtime, not relative to the .neam source file. If you run your program from a different directory, ensure the paths still resolve correctly.

`retrieval_strategy` #

Specifies which retrieval algorithm to use when the agent queries the knowledge base. Neam supports eight strategies, covered in detail in Section 15.4.

`top_k` #

The number of document chunks to retrieve and include in the augmented prompt. The default is 4.

Low top_k (1-3): Faster, less context. Good for simple factual questions.
High top_k (5-10): More context, but increases prompt size and cost. Good for complex questions that require synthesizing multiple sources.

15.3 Connecting Knowledge to Agents #

A knowledge base becomes useful only when connected to an agent. Use the connected_knowledge property:

neam

knowledge ProductDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "text", content: "Neam is a programming language designed for AI agent orchestration. It supports agents, handoffs, guardrails, and runners. Neam compiles to bytecode and runs on a custom virtual machine." },
    { type: "text", content: "To create an agent in Neam, use the 'agent' keyword followed by a name and configuration block. Agents can have a provider (openai, ollama), model, temperature, and system prompt." },
    { type: "text", content: "Handoffs allow agents to transfer control to other agents. Use the handoffs property in an agent declaration. The runner orchestrates the handoff flow with a max_turns limit." }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

agent DocAssistant {
  provider: "ollama"
  model: "llama3.2:3b"
  temperature: 0.3
  system: "You are a documentation assistant. Answer questions using only the provided context. Be concise."
  connected_knowledge: [ProductDocs]
}

{
  let answer = DocAssistant.ask("How do I create an agent in Neam?");
  emit "Q: How do I create an agent in Neam?";
  emit "A: " + answer;
}

When the agent receives a query via .ask(), the runtime performs these steps automatically:

Embed the query using the knowledge base's embedding model.
Search the vector store for the top_k most similar chunks.
Inject the retrieved chunks into the prompt as context.
Call the LLM with the augmented prompt.
Return the response to the caller.

The agent's system prompt is preserved. The retrieved context is inserted between the system prompt and the user's message.

Connecting Multiple Knowledge Bases #

An agent can connect to multiple knowledge bases:

neam

knowledge TechDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [
    { type: "file", path: "./docs/technical.md" }
  ]
  retrieval_strategy: "basic"
  top_k: 3
}

knowledge PolicyDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 150
  chunk_overlap: 30
  sources: [
    { type: "file", path: "./docs/policies.md" }
  ]
  retrieval_strategy: "mmr"
  top_k: 2
}

agent SupportAgent {
  provider: "openai"
  model: "gpt-4o"
  system: "You are a customer support agent. Answer using the provided context."
  connected_knowledge: [TechDocs, PolicyDocs]
}

When an agent is connected to multiple knowledge bases, the runtime queries each one independently and merges the retrieved chunks before injecting them into the prompt.

15.4 Retrieval Strategies #

Neam supports eight retrieval strategies. Each makes different trade-offs between accuracy, diversity, latency, and cost. The following table provides an overview:

+------------------------------------------------------------------+
|  Strategy        | LLM Calls | Latency | Accuracy | Best For    |
|  ----------------+-----------+---------+----------+------------- |
|  basic           | 0         | Low     | Good     | Simple Q&A   |
|  mmr             | 0         | Low     | Good     | Diverse docs |
|  hybrid          | 0         | Low     | Better   | Precise match|
|  hyde            | 1         | Medium  | Better   | Abstract Q   |
|  self_rag        | 1         | Medium  | High     | High accuracy|
|  crag            | 1-3       | Medium  | High     | Complex Q    |
|  agentic         | 2-5+      | High    | Highest  | Research     |
|  graph_rag       | 1-2       | Medium  | High     | Relationships|
+------------------------------------------------------------------+

Strategy 1: `basic` -- Standard Vector Similarity #

The default and simplest strategy. The query is embedded and compared against all document chunk embeddings using cosine similarity. The top-K most similar chunks are returned.

neam

knowledge BasicKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "basic"
  top_k: 3
}

How it works:

Embed the query: q_vec = embed("What is Neam?")
Compute cosine similarity against every chunk embedding.
Return the 3 highest-scoring chunks.

When to use: Simple factual questions where the answer is likely contained in a single chunk. This is the fastest strategy with zero additional LLM calls.

Configuration options:

Option	Default	Description
`top_k`	4	Number of chunks to retrieve
`relevance_threshold`	0.5	Minimum similarity score (0.0-1.0)

Strategy 2: `mmr` -- Maximal Marginal Relevance #

MMR balances relevance and diversity. After finding the most relevant chunks, it penalizes chunks that are too similar to already-selected chunks. This produces a more diverse set of results.

neam

knowledge MMRKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "mmr"
  top_k: 3
  mmr_lambda: 0.7
}

How it works:

Retrieve an initial candidate set (typically 2x top_k chunks).
Select the first chunk (highest relevance to query).
For each remaining selection, choose the chunk that maximizes: MMR = lambda * similarity(chunk, query) - (1 - lambda) * max_similarity(chunk, selected_chunks)
Repeat until top_k chunks are selected.

The mmr_lambda parameter:

1.0 = pure relevance (equivalent to basic strategy)
0.0 = pure diversity (maximally different chunks)
0.5 = balanced (default)
0.7 = relevance-weighted but still diverse (recommended)

When to use: When your knowledge base contains many similar or overlapping passages and you want the retrieved context to cover different aspects of the topic.

Strategy 3: `hybrid` -- Keyword + Vector Search #

Combines traditional keyword matching (BM25-style) with vector similarity search. This catches cases where semantically relevant documents use different vocabulary than the query.

neam

knowledge HybridKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "hybrid"
  top_k: 3
}

How it works:

Run vector similarity search (same as basic).
Run keyword/token matching against chunk text.
Combine scores using reciprocal rank fusion.
Return the top-K chunks from the fused ranking.

When to use: When queries contain specific technical terms, product names, error codes, or identifiers that should be matched exactly. Vector search might miss "ERR-4021" if no similar text exists in training data, but keyword search catches it.

Strategy 4: `hyde` -- Hypothetical Document Embeddings #

HyDE generates a hypothetical answer to the query, embeds that answer, and uses it for retrieval instead of the raw query. The intuition is that a hypothetical answer is closer in embedding space to the actual answer than the question itself.

neam

knowledge HyDEKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./docs.md" } ]
  retrieval_strategy: "hyde"
  top_k: 3
  num_hypothetical: 1
}

How it works:

Send the query to the LLM: "Write a short passage that would answer: [query]"
Embed the hypothetical answer (not the original query).
Search the vector store using the hypothetical embedding.
Return the top-K actual document chunks.

Configuration options:

Option	Default	Description
`num_hypothetical`	1	Number of hypothetical documents to generate

When to use: Abstract or conceptual queries where the question phrasing is very different from how the answer would appear in the documents. For example, "What should I do when a customer is upset?" retrieves better documents when the hypothetical answer ("When dealing with an upset customer, first acknowledge their frustration...") is used as the search vector.

Trade-off: HyDE requires one additional LLM call, adding latency and cost. The hypothetical answer may also steer retrieval in the wrong direction if the LLM generates an incorrect hypothesis.

Strategy 5: `self_rag` -- Self-Reflective RAG #

Self-RAG adds a relevance check after retrieval. The LLM evaluates whether each retrieved chunk is actually relevant to the query, filtering out false positives before generating the answer.

neam

knowledge SelfRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "self_rag"
  top_k: 4
  enable_relevance_check: true
  enable_support_check: true
}

How it works:

Retrieve top-K chunks (same as basic).
For each chunk, ask the LLM: "Is this chunk relevant to the query? Rate 0-1."
Filter chunks below the relevance threshold.
Generate the answer using only the validated chunks.
(Optional) Check whether the answer is supported by the retrieved chunks.

Configuration options:

Option	Default	Description
`enable_relevance_check`	`true`	Check each chunk's relevance before use
`enable_support_check`	`true`	Verify the answer is supported by context

When to use: High-stakes applications (medical, legal, financial) where using irrelevant context could lead to harmful or misleading answers. The relevance check acts as a guardrail on the retrieval step.

Trade-off: Adds one LLM call for the relevance check. Can discard too many chunks if the threshold is too aggressive, leaving insufficient context.

Strategy 6: `crag` -- Corrective RAG #

CRAG (Corrective Retrieval Augmented Generation) adds query decomposition and iterative correction. If the initial retrieval does not produce confident results, CRAG decomposes the query into sub-queries and tries again.

neam

knowledge CRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "crag"
  top_k: 3
  enable_query_decomposition: true
  max_corrections: 2
}

How it works:

Retrieve top-K chunks for the original query.
Evaluate retrieval confidence.
If confidence is low: a. Decompose the query into sub-queries. b. Retrieve chunks for each sub-query. c. Merge and re-rank results.
Repeat up to max_corrections times.
Generate the answer using the refined context.

Configuration options:

Option	Default	Description
`enable_query_decomposition`	`true`	Break complex queries into sub-queries
`max_corrections`	2	Maximum correction rounds
`enable_web_fallback`	`false`	Fall back to web search if local retrieval fails

When to use: Complex, multi-part questions that cannot be answered by a single retrieval pass. For example: "Compare the performance characteristics of basic and agentic RAG strategies, and explain when to use each one."

Strategy 7: `agentic` -- Tool-Based Planning with Reflection #

Agentic RAG treats retrieval as an iterative research process. The LLM plans what information it needs, retrieves it, reflects on whether it has enough context, and repeats until satisfied.

neam

knowledge AgenticKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./research/*.md" } ]
  retrieval_strategy: "agentic"
  top_k: 5
  max_iterations: 3
  enable_reflection: true
}

How it works:

The LLM analyzes the query and generates a retrieval plan.
Execute the first retrieval based on the plan.
The LLM reflects: "Do I have enough context to answer? What is missing?"
If more context is needed, refine the search query and retrieve again.
Repeat up to max_iterations times.
Generate the final answer using all accumulated context.

Configuration options:

Option	Default	Description
`max_iterations`	5	Maximum retrieval-reflection cycles
`enable_reflection`	`true`	Enable self-reflection between iterations

When to use: Research tasks, deep-dive questions, or scenarios where a single retrieval pass is unlikely to surface all necessary information. This is the most thorough strategy but also the most expensive.

Trade-off: Multiple LLM calls per query (2 to max_iterations * 2). Best reserved for high-value queries where accuracy justifies the cost.

Strategy 8: `graph_rag` -- Knowledge Graph Retrieval #

Graph RAG builds a knowledge graph from your documents, extracting entities and relationships. Retrieval traverses the graph starting from entities mentioned in the query, producing richly connected context.

neam

knowledge GraphKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./docs/architecture.md" } ]
  retrieval_strategy: "graph_rag"
  top_k: 5
}

How it works:

At index time, extract entities and relationships from each chunk using an LLM.
Build a graph with entity nodes, document nodes, and relationship edges.
Optionally detect communities (clusters of related entities).
At query time, extract entities from the query.
Traverse the graph from matched entities, collecting related nodes up to a configurable depth.
Include community summaries for broader context.
Generate the answer using the graph-derived context.

When to use: Documents with rich relationships between concepts -- organizational charts, technical architectures, legal contracts with cross-references, scientific papers with citation networks.

The Neam standard library provides graph construction utilities in std.rag.advanced.graph_rag:

neam

import std::rag::advanced::graph_rag;

// Create entities
let neam = graph_rag::node_entity("e1", "Neam", "Language", {
  "version": "0.5.0",
  "paradigm": "agentic"
});

let vm = graph_rag::node_entity("e2", "Virtual Machine", "Component", {
  "type": "bytecode interpreter"
});

// Create relationship
let runs_on = graph_rag::edge_related_to("e1", "e2", "runs_on", 0.95);

// Build graph
let graph = graph_rag::knowledge_graph();
graph = graph_rag::add_node(graph, neam);
graph = graph_rag::add_node(graph, vm);
graph = graph_rag::add_edge(graph, runs_on);

15.5 Strategy Selection Guide #

Choosing the right retrieval strategy depends on your use case. Here is a decision framework:

Start
  |
  v
Is the question simple and factual?
  |--- Yes --> Use "basic"
  |--- No
       |
       v
     Do your docs have many overlapping passages?
       |--- Yes --> Use "mmr"
       |--- No
            |
            v
          Does the query contain specific terms/codes?
            |--- Yes --> Use "hybrid"
            |--- No
                 |
                 v
               Is the query abstract or conceptual?
                 |--- Yes --> Use "hyde"
                 |--- No
                      |
                      v
                    Is accuracy critical (medical/legal)?
                      |--- Yes --> Use "self_rag"
                      |--- No
                           |
                           v
                         Is the query multi-part or complex?
                           |--- Yes --> Use "crag"
                           |--- No
                                |
                                v
                              Is this a research/deep-dive task?
                                |--- Yes --> Use "agentic"
                                |--- No
                                     |
                                     v
                                   Do docs have entity relationships?
                                     |--- Yes --> Use "graph_rag"
                                     |--- No  --> Use "basic"

15.6 Practical Walkthrough: Building a Documentation QA Bot #

Let us build a complete documentation assistant that answers questions about a project using its README and documentation files.

Step 1: Prepare the Knowledge Base #

neam

knowledge ProjectDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 250
  chunk_overlap: 50
  sources: [
    { type: "file", path: "./README.md" },
    { type: "file", path: "./docs/AGENT_HANDOFFS_GUIDE.md" },
    { type: "text", content: "To build Neam, run: mkdir -p build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . --parallel" },
    { type: "text", content: "Neam supports four LLM providers: OpenAI, Anthropic, Google Gemini, and Ollama. Set the corresponding API key environment variable before running." }
  ]
  retrieval_strategy: "hybrid"
  top_k: 4
}

Step 2: Define the Agent #

neam

agent DocsBot {
  provider: "ollama"
  model: "llama3.2:3b"
  temperature: 0.3
  system: "You are a documentation assistant for the Neam programming language.
           Answer questions using ONLY the provided context. If the context does
           not contain the answer, say 'I don't have that information in the docs.'
           Be concise and cite specific details from the context."
  connected_knowledge: [ProjectDocs]
}

Step 3: Build the Interactive Loop #

neam

{
  emit "=== Neam Documentation Assistant ===";
  emit "Ask questions about the Neam language.";
  emit "Type 'quit' to exit.";
  emit "";

  let running = true;
  while (running) {
    emit "Q: ";
    let question = input();

    if (question == "quit") {
      running = false;
    } else {
      let answer = DocsBot.ask(question);
      emit "A: " + answer;
      emit "";
    }
  }

  emit "Goodbye!";
}

Step 4: Compile and Run #

bash

# Prerequisites: Ollama with the required models
ollama pull llama3.2:3b
ollama pull nomic-embed-text

# Compile
./neamc docs_qa.neam -o docs_qa.neamb

# Run (from the project root so file paths resolve correctly)
./neam docs_qa.neamb

Step 5: Test with Sample Questions #

text

=== Neam Documentation Assistant ===
Ask questions about the Neam language.
Type 'quit' to exit.

Q: How do I build Neam?
A: To build Neam, run the following commands:
   mkdir -p build && cd build
   cmake .. -DCMAKE_BUILD_TYPE=Release
   cmake --build . --parallel

Q: What LLM providers does Neam support?
A: Neam supports four LLM providers: OpenAI, Anthropic, Google Gemini,
   and Ollama. You need to set the corresponding API key environment
   variable before running.

Q: How do handoffs work?
A: Handoffs allow agents to transfer control to other agents. You use
   the handoffs property in an agent declaration, and the runner
   orchestrates the handoff flow with a max_turns limit.

15.7 Comparing All Strategies Side by Side #

The following program queries the same knowledge base with all strategies and compares the results. This is an excellent way to evaluate which strategy works best for your specific dataset.

neam

knowledge BasicKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "basic"
  top_k: 3
}

knowledge MMRKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "mmr"
  top_k: 3
  mmr_lambda: 0.7
}

knowledge HybridKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "hybrid"
  top_k: 3
}

knowledge HyDEKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "hyde"
  top_k: 3
  num_hypothetical: 1
}

knowledge SelfRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "self_rag"
  top_k: 4
  enable_relevance_check: true
}

knowledge CRAGKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "crag"
  top_k: 3
  enable_query_decomposition: true
}

knowledge AgenticKB {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 200
  chunk_overlap: 50
  sources: [ { type: "file", path: "./readme.md" } ]
  retrieval_strategy: "agentic"
  top_k: 3
  max_iterations: 2
  enable_reflection: true
}

// One agent per strategy
agent BasicAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [BasicKB]
}

agent MMRAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [MMRKB]
}

agent HybridAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [HybridKB]
}

agent HyDEAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [HyDEKB]
}

agent SelfRAGAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [SelfRAGKB]
}

agent CRAGAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [CRAGKB]
}

agent AgenticAgent {
  provider: "ollama"
  model: "qwen3:1.7b"
  system: "Answer in exactly one sentence."
  connected_knowledge: [AgenticKB]
}

{
  emit "===============================================";
  emit "  NEAM RAG STRATEGIES - COMPREHENSIVE TEST    ";
  emit "===============================================";
  emit "";

  let q = "What is Neam?";
  emit "Question: " + q;
  emit "";
  emit "-----------------------------------------------";

  emit "1. BASIC (standard similarity):";
  emit "   " + BasicAgent.ask(q);
  emit "";

  emit "2. MMR (diversity-focused):";
  emit "   " + MMRAgent.ask(q);
  emit "";

  emit "3. HYBRID (keyword + vector):";
  emit "   " + HybridAgent.ask(q);
  emit "";

  emit "4. HYDE (hypothetical document):";
  emit "   " + HyDEAgent.ask(q);
  emit "";

  emit "5. SELF-RAG (relevance-checked):";
  emit "   " + SelfRAGAgent.ask(q);
  emit "";

  emit "6. CRAG (query decomposition):";
  emit "   " + CRAGAgent.ask(q);
  emit "";

  emit "7. AGENTIC (iterative refinement):";
  emit "   " + AgenticAgent.ask(q);
  emit "";

  emit "===============================================";
  emit "         ALL STRATEGIES TEST COMPLETE          ";
  emit "===============================================";
}

15.8 Tuning Chunk Size and Overlap #

The chunk size and overlap parameters have a significant impact on retrieval quality. Here is a practical guide for tuning them:

Experiment Setup #

neam

knowledge SmallChunks {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 100
  chunk_overlap: 20
  sources: [ { type: "file", path: "./docs/README.md" } ]
  retrieval_strategy: "basic"
  top_k: 5
}

knowledge MediumChunks {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 250
  chunk_overlap: 50
  sources: [ { type: "file", path: "./docs/README.md" } ]
  retrieval_strategy: "basic"
  top_k: 3
}

knowledge LargeChunks {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 500
  chunk_overlap: 100
  sources: [ { type: "file", path: "./docs/README.md" } ]
  retrieval_strategy: "basic"
  top_k: 2
}

Guidelines #

Document Type	Recommended Chunk Size	Recommended Overlap	Reasoning
FAQ entries	100-150	20-30	Each entry is self-contained
Technical docs	200-300	50-75	Need enough context for code examples
Narrative text	400-600	100-150	Preserve paragraph coherence
Legal documents	300-500	75-125	Clauses often reference each other
API references	150-250	30-50	Each endpoint description is independent

💡 Tip

When in doubt, start with chunk_size: 200 and chunk_overlap: 50. These values work well for most technical documentation. Adjust based on the quality of retrieved results.

15.9 Advanced Pattern: RAG with Multiple Sources and Strategies #

A common production pattern uses different knowledge bases with different strategies for different types of content:

neam

knowledge FAQs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 150
  chunk_overlap: 30
  sources: [
    { type: "file", path: "./docs/faq.md" }
  ]
  retrieval_strategy: "hybrid"
  top_k: 3
}

knowledge TechDocs {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 300
  chunk_overlap: 75
  sources: [
    { type: "file", path: "./docs/architecture.md" },
    { type: "file", path: "./docs/api-reference.md" }
  ]
  retrieval_strategy: "mmr"
  top_k: 4
  mmr_lambda: 0.6
}

knowledge Research {
  vector_store: "usearch"
  embedding_model: "nomic-embed-text"
  chunk_size: 400
  chunk_overlap: 100
  sources: [
    { type: "file", path: "./papers/survey.md" }
  ]
  retrieval_strategy: "agentic"
  top_k: 5
  max_iterations: 3
  enable_reflection: true
}

agent ResearchAssistant {
  provider: "openai"
  model: "gpt-4o"
  system: "You are a research assistant. Use the provided context to give
           thorough, well-sourced answers. Cite specific sections when possible."
  connected_knowledge: [FAQs, TechDocs, Research]
}

{
  let answer = ResearchAssistant.ask(
    "How does Neam's agentic RAG strategy compare to standard vector search?"
  );
  emit answer;
}

15.10 Performance Considerations #

Indexing Time #

Indexing happens at program startup. The time depends on:

Number of source documents and total text size
Chunk size (smaller chunks = more embeddings to compute)
Embedding model speed (local Ollama inference)

For a 50-page document with chunk_size: 200, expect roughly 100-200 chunks and 2-5 seconds of indexing time on a modern machine.

Query Latency by Strategy #

Strategy	Typical Latency	LLM Calls	Notes
basic	50-100ms	0 extra	Vector search only
mmr	50-100ms	0 extra	Slightly more computation
hybrid	50-150ms	0 extra	Keyword + vector
hyde	1-3s	1 extra	One LLM call for hypothesis
self_rag	1-3s	1 extra	One LLM call for relevance check
crag	2-8s	1-3 extra	Decomposition + correction
agentic	3-15s	2-5+ extra	Iterative retrieval
graph_rag	1-5s	1-2 extra	Graph traversal + LLM

Memory Usage #

The uSearch HNSW index resides in memory. For 1,000 chunks with 768-dimensional embeddings, expect approximately 3 MB of memory for the index. This scales linearly with the number of chunks.

15.11 Embedding Providers #

The examples throughout this chapter use nomic-embed-text via Ollama, but the Neam standard library (std.ingest.embed) supports a wide range of embedding providers for production deployments:

Provider	Models	Dimensions	Notes
OpenAI	`text-embedding-3-small`, `text-embedding-3-large`, `ada-002`	1536-3072	Cloud API, highest quality
Cohere	`embed-english-v3.0`, `embed-multilingual-v3.0`	1024	Strong multilingual support
Voyage AI	`voyage-2`, `voyage-large-2`, `voyage-code-2`	1024-1536	Specialized for code retrieval
Ollama	`nomic-embed-text`, `mxbai-embed-large`	768-1024	Free, local, no API key
Local	`all-MiniLM-L6-v2` (sentence-transformers), `bge-base`, `e5-base`	384-768	Self-hosted models

You can use the standard library's batch embedder for large document sets. It supports rate limiting and caching to manage API costs:

neam

import std::ingest::embed;

let embedder = embed::create_embedder({
  "provider": "openai",
  "model": "text-embedding-3-small",
  "batch_size": 100,
  "cache_enabled": true
});

let vectors = embed::batch_embed(embedder, chunks);

The library also provides similarity functions for comparing vectors directly:

Cosine similarity -- Default, works well for normalized embeddings.
Euclidean distance -- Useful when magnitude matters.
Dot product -- Fastest computation, good for pre-normalized vectors.

💡 Tip

For most applications, start with nomic-embed-text via Ollama during development (free, local, no API key), then switch to text-embedding-3-small or embed-english-v3.0 for production quality.

15.12 Vector Store Options #

The knowledge declaration uses uSearch (in-memory HNSW) by default, but the standard library (std.ingest.store) provides connectors for production-grade vector databases:

Store	Type	Scaling	Best For
uSearch	In-memory HNSW	Single process	Development, small corpora
Pinecone	Cloud managed	Serverless	Production SaaS, no ops
Qdrant	Self-hosted / Cloud	Horizontal	High-performance, filtering
Weaviate	Self-hosted / Cloud	Horizontal	Multi-modal, module ecosystem
Milvus	Self-hosted / Cloud	Horizontal	Massive-scale (billions)
ChromaDB	Embedded / Server	Single node	Prototyping, simple setup
PgVector	PostgreSQL extension	PostgreSQL cluster	Existing Postgres stack
SqliteVec	SQLite extension	Single file	Edge deployment, embedded apps

To use a cloud vector store, configure your knowledge base with the appropriate store and connection settings. The standard library provides a unified filter expression syntax that works across all stores:

neam

import std::ingest::store;

let qdrant = store::create_store({
  "type": "qdrant",
  "url": "http://localhost:6333",
  "collection": "product_docs"
});

// Filter expressions work across all backends
let filter = store::filter_and([
  store::filter_eq("category", "technical"),
  store::filter_gte("updated_at", "2025-01-01")
]);

let results = store::search(qdrant, query_vector, 5, filter);

📝 Note

The knowledge declaration currently uses the built-in uSearch store directly. To use external vector stores, work with the std.ingest.store module in your program logic alongside the knowledge declaration.

15.13 Advanced Chunking Strategies #

The chunk_size / chunk_overlap fields in the knowledge declaration use fixed-size character splitting. For finer control, the standard library (std.ingest.chunk) provides nine chunking strategies:

Strategy	Description	Best For
Fixed	Fixed character size with overlap	General-purpose (default)
Sentence	Split on sentence boundaries	Narrative text, articles
Paragraph	Split on paragraph boundaries	Well-structured documents
Recursive	Hierarchical splitting (text → paragraph → sentence → character)	Mixed content
Semantic	Split when embedding similarity drops	Maintaining topic coherence
Code	Split on function/class/block boundaries	Source code files
Markdown	Split on headers, preserve code blocks	Documentation, READMEs
Sliding window	Overlapping windows at fixed intervals	Dense information extraction
Hybrid	Multiple strategies with content-type matchers	Multi-format corpora

neam

import std::ingest::chunk;

// Semantic chunking groups text by topic similarity
let semantic_chunker = chunk::create_chunker({
  "strategy": "semantic",
  "similarity_threshold": 0.75,
  "min_chunk_size": 100,
  "max_chunk_size": 500
});

let chunks = chunk::split(semantic_chunker, document_text);

The recursive chunker is particularly effective for mixed-content documents. It tries progressively finer split points, starting with double newlines (paragraphs), then single newlines, then sentences, then characters:

neam

let recursive_chunker = chunk::create_chunker({
  "strategy": "recursive",
  "separators": ["\n\n", "\n", ". ", " "],
  "chunk_size": 300,
  "chunk_overlap": 50
});

15.14 Reranking for Better Results #

Initial retrieval returns an approximate set of relevant chunks. Reranking applies a more expensive but more accurate model to re-score and re-order those results. The standard library (std.rag.reranker) supports four reranking methods:

Method	Approach	Quality	Latency
Cross-encoder	BERT-style pairwise scoring	Highest	High
ColBERT	Late-interaction token matching	High	Medium
Cohere	API-based neural reranking	High	Medium
LLM-based	Ask an LLM to rank results	Variable	High

neam

import std::rag::reranker;

let reranker_config = reranker::create_reranker({
  "method": "cross_encoder",
  "model": "cross-encoder/ms-marco-MiniLM-L-12-v2",
  "top_k": 3,
  "min_score": 0.5
});

// Rerank retrieved chunks before passing to the agent
let reranked = reranker::rerank(reranker_config, query, initial_chunks);

When to add reranking:

Your initial retrieval returns many marginally relevant chunks.
You use a high top_k (e.g., 10-20) and want to narrow down to the best 3-5.
Accuracy is more important than latency (e.g., legal, medical, financial domains).
You are using basic or hybrid retrieval and want better precision without switching to a more expensive strategy like agentic.

15.15 The Document Ingestion Pipeline #

For production systems with large, diverse document collections, the standard library provides a complete ingestion pipeline (std.ingest.pipeline) that automates the flow from raw documents to indexed vectors:

Source

▶

Parser

▶

Chunker

▶

Embedder

▶

Store

Each stage is independently configurable:

neam

import std::ingest::pipeline;
import std::ingest::parser;
import std::ingest::chunk;
import std::ingest::embed;
import std::ingest::store;

let pipe = pipeline::create_pipeline({
  "parser": parser::create_parser({ "type": "auto" }),
  "chunker": chunk::create_chunker({
    "strategy": "recursive",
    "chunk_size": 300,
    "chunk_overlap": 50
  }),
  "embedder": embed::create_embedder({
    "provider": "openai",
    "model": "text-embedding-3-small"
  }),
  "store": store::create_store({
    "type": "qdrant",
    "url": "http://localhost:6333",
    "collection": "docs"
  })
});

let result = pipeline::ingest(pipe, [
  { "type": "file", "path": "./docs/manual.pdf" },
  { "type": "file", "path": "./docs/api-reference.md" },
  { "type": "file", "path": "./data/faq.csv" }
]);

emit "Ingested " + str(result["chunks_created"]) + " chunks";
emit "Cost: $" + str(result["cost"]);

The pipeline supports incremental ingestion -- only new or modified documents are processed on subsequent runs. It tracks document fingerprints to detect changes.

Supported Document Parsers #

The std.ingest.parser module includes parsers for common document formats:

Parser	Formats	Notes
Text	`.txt`, `.md`, `.csv`	Plain text extraction
PDF	`.pdf`	Text extraction with layout preservation
Office	`.docx`, `.xlsx`, `.pptx`	Microsoft Office formats
Code	`.py`, `.js`, `.rs`, `.neam`, etc.	Language-aware parsing
Image	`.png`, `.jpg`, `.tiff`	OCR-based text extraction
Audio	`.mp3`, `.wav`	Transcription-based parsing
HTML	`.html`	Content extraction, tag stripping

Source Connectors #

Documents can be loaded from various sources beyond the local filesystem:

Source	Description
File	Local filesystem paths
S3	Amazon S3 buckets
GCS	Google Cloud Storage
Azure	Azure Blob Storage
HTTP	Remote URLs
Database	SQL query results

Summary #

In this chapter you learned:

What RAG is and why it reduces hallucination by grounding LLM responses in retrieved facts.
How to declare knowledge bases in Neam using the knowledge keyword.
Source types: inline text and file sources for building your document corpus.
Embedding and chunking: How documents are split into chunks and converted to vectors using nomic-embed-text via Ollama.
Eight retrieval strategies: from the simple basic vector search to the sophisticated agentic iterative retrieval and graph_rag knowledge graph approach.
How to connect knowledge to agents using connected_knowledge.
Practical patterns for building documentation QA bots and multi-source RAG systems.
Tuning guidelines for chunk size, overlap, and strategy selection.
Embedding providers beyond Ollama: OpenAI, Cohere, Voyage AI, and local sentence-transformers, with batch embedding and caching support.
Vector store options: from in-memory uSearch for development to production stores like Pinecone, Qdrant, Weaviate, Milvus, ChromaDB, PgVector, and SqliteVec.
Advanced chunking strategies: sentence, paragraph, recursive, semantic, code, markdown, sliding window, and hybrid chunkers for different document types.
Reranking with cross-encoder, ColBERT, Cohere, and LLM-based methods to improve retrieval precision.
The document ingestion pipeline: a full Source → Parser → Chunker → Embedder → Store workflow with incremental ingestion, multiple parsers, and cloud source connectors.

Exercises #

Exercise 15.1: Basic RAG Bot #

Create a knowledge base from three inline text sources about a topic you know well (e.g., a programming language, a cooking recipe, a historical event). Connect it to an agent using the basic strategy and test with five questions. Record which questions are answered correctly and which are not.

Exercise 15.2: Strategy Comparison #

Take the knowledge base from Exercise 15.1 and create seven copies, each using a different retrieval strategy (basic, mmr, hybrid, hyde, self_rag, crag, agentic). Run the same five questions through all seven agents. Create a table comparing the quality of answers across strategies. Which strategy performed best for your dataset? Why?

Exercise 15.3: Chunk Size Experiment #

Using a single file source (at least 2000 characters), create three knowledge bases with different chunk sizes: 100, 250, and 500. Keep the overlap at 25% of the chunk size. Test with the same set of questions and observe how chunk size affects answer quality. Write a brief analysis of your findings.

Exercise 15.4: Multi-Source Documentation Bot #

Build a documentation QA bot that indexes at least three different files from a real project (e.g., README, API docs, contribution guide). Use the hybrid strategy. Include at least two inline text sources for information not covered in the files. Demonstrate the bot answering questions that require information from different sources.

Exercise 15.5: Production RAG Architecture #

Design (on paper or in code) a RAG architecture for a customer support system with the following requirements:

FAQ database (500 entries) -- needs fast, exact matching
Product documentation (50 pages) -- needs broad coverage
Recent support tickets (updated daily) -- needs freshness

For each document source, justify your choice of chunk size, overlap, and retrieval strategy. Explain how you would handle the daily update of support tickets.

Exercise 15.6: Graph RAG Exploration #

Using the std.rag.advanced.graph_rag module, build a small knowledge graph with at least 5 entities and 8 relationships. Implement a function that, given an entity name, returns all entities within 2 hops. Test your graph with queries that require understanding relationships between entities.

Chapter 15: RAG and Knowledge Bases #

15.1 What Is RAG? #

Why RAG Matters #

15.2 Declaring a Knowledge Base #

vector_store #

embedding_model #

chunk_size and chunk_overlap #

sources #

Inline Text Sources #

File Sources #

retrieval_strategy #

top_k #

15.3 Connecting Knowledge to Agents #

Connecting Multiple Knowledge Bases #

15.4 Retrieval Strategies #

Strategy 1: basic -- Standard Vector Similarity #

Strategy 2: mmr -- Maximal Marginal Relevance #

Strategy 3: hybrid -- Keyword + Vector Search #

Strategy 4: hyde -- Hypothetical Document Embeddings #

Strategy 5: self_rag -- Self-Reflective RAG #

Strategy 6: crag -- Corrective RAG #

Strategy 7: agentic -- Tool-Based Planning with Reflection #

Strategy 8: graph_rag -- Knowledge Graph Retrieval #

15.5 Strategy Selection Guide #

15.6 Practical Walkthrough: Building a Documentation QA Bot #

Step 1: Prepare the Knowledge Base #

Step 2: Define the Agent #

Step 3: Build the Interactive Loop #

Step 4: Compile and Run #

Step 5: Test with Sample Questions #

15.7 Comparing All Strategies Side by Side #

15.8 Tuning Chunk Size and Overlap #

Experiment Setup #

Guidelines #

15.9 Advanced Pattern: RAG with Multiple Sources and Strategies #

15.10 Performance Considerations #

Indexing Time #

Query Latency by Strategy #

Memory Usage #

15.11 Embedding Providers #

15.12 Vector Store Options #

15.13 Advanced Chunking Strategies #

15.14 Reranking for Better Results #

15.15 The Document Ingestion Pipeline #

Supported Document Parsers #

Source Connectors #

Summary #

Exercises #

Exercise 15.1: Basic RAG Bot #

Exercise 15.2: Strategy Comparison #

Exercise 15.3: Chunk Size Experiment #

Exercise 15.4: Multi-Source Documentation Bot #

Exercise 15.5: Production RAG Architecture #

Exercise 15.6: Graph RAG Exploration #

`vector_store` #

`embedding_model` #

`chunk_size` and `chunk_overlap` #

`sources` #

`retrieval_strategy` #

`top_k` #

Strategy 1: `basic` -- Standard Vector Similarity #

Strategy 2: `mmr` -- Maximal Marginal Relevance #

Strategy 3: `hybrid` -- Keyword + Vector Search #

Strategy 4: `hyde` -- Hypothetical Document Embeddings #

Strategy 5: `self_rag` -- Self-Reflective RAG #

Strategy 6: `crag` -- Corrective RAG #

Strategy 7: `agentic` -- Tool-Based Planning with Reflection #

Strategy 8: `graph_rag` -- Knowledge Graph Retrieval #