Chapter 15: RAG and Knowledge Bases #
"The best way to reduce hallucination is to give the model something true to talk about."
Large language models are powerful, but they have a fundamental limitation: their knowledge is frozen at training time. Ask a model about your company's internal documentation, last week's policy changes, or a dataset that did not exist when the model was trained, and it will either refuse to answer or -- worse -- fabricate a confident, plausible-sounding response. This phenomenon is called hallucination, and it is the single largest barrier to deploying LLMs in production.
Retrieval-Augmented Generation (RAG) solves this problem by injecting relevant documents into the prompt at query time. Instead of relying solely on parametric memory (the model's weights), a RAG system retrieves context from an external knowledge base and presents it alongside the user's question. The model then grounds its answer in the retrieved material.
Neam makes RAG a first-class language construct. You declare a knowledge block, connect
it to an agent, and the entire retrieval pipeline -- chunking, embedding, indexing,
querying, and context injection -- is handled by the runtime. This chapter teaches you
how to build knowledge-augmented agents from the ground up, starting with the simplest
configuration and progressing through all eight retrieval strategies.
15.1 What Is RAG? #
Retrieval-Augmented Generation was introduced by Lewis et al. (2020) as a technique that combines a retrieval component with a generative model. The core idea is straightforward:
- Index a corpus of documents into a searchable store.
- Retrieve the most relevant documents for a given query.
- Augment the LLM prompt with the retrieved documents.
- Generate an answer grounded in the retrieved context.
The key insight is that the retrieval step is non-parametric -- it does not depend on the model's training data. You can update the knowledge base at any time, and the next query will immediately reflect the new information.
Why RAG Matters #
| Problem Without RAG | How RAG Solves It |
|---|---|
| Hallucination on domain-specific questions | Grounds answers in retrieved facts |
| Knowledge frozen at training cutoff | Knowledge base can be updated in real time |
| No access to private/internal data | Index proprietary documents locally |
| Expensive fine-tuning for new domains | Swap knowledge bases without retraining |
| No source attribution | Retrieved chunks provide traceable citations |
RAG does not eliminate hallucination entirely. A model can still misinterpret retrieved context or generate unsupported inferences. However, RAG dramatically reduces the frequency and severity of hallucination compared to ungrounded generation.
15.2 Declaring a Knowledge Base #
In Neam, a knowledge base is declared with the knowledge keyword. Here is the minimal
configuration:
knowledge ProductDocs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [
{ type: "text", content: "Neam is a domain-specific language for AI agent orchestration." },
{ type: "text", content: "Neam compiles to bytecode and runs on a custom VM." }
]
retrieval_strategy: "basic"
top_k: 3
}
Let us break down each field:
vector_store #
Specifies the vector index implementation. Currently Neam supports:
| Value | Implementation | Description |
|---|---|---|
"usearch" |
uSearch HNSW | High-performance approximate nearest neighbor search |
uSearch uses the Hierarchical Navigable Small World (HNSW) algorithm, which provides sub-millisecond search latency even on large document collections. The index is built in-memory at program startup.
embedding_model #
The model used to convert text into dense vector representations. Neam uses Ollama to serve embedding models locally:
| Model | Dimensions | Context | Notes |
|---|---|---|---|
nomic-embed-text |
768 | 8192 tokens | Default. Excellent general-purpose embeddings. |
Before using RAG, pull the embedding model:
ollama pull nomic-embed-text
The embedding model runs on your local machine via Ollama's embedding API
(http://localhost:11434/api/embeddings). No cloud API key is required for embeddings.
chunk_size and chunk_overlap #
Documents are split into chunks before embedding. These two parameters control how that splitting works:
chunk_size: The maximum number of characters per chunk.chunk_overlap: The number of characters that overlap between consecutive chunks.
Choosing chunk size: Smaller chunks (100-200) produce more focused embeddings but may lose context. Larger chunks (500-1000) retain more context but may dilute the relevance signal. A chunk size of 200 with overlap of 50 is a good starting point for most use cases.
If your documents contain short, self-contained paragraphs (like FAQ entries), a smaller chunk size (100-150) works well. For narrative text or technical documentation, a larger chunk size (300-500) preserves more context.
sources #
The list of documents to index. Neam supports two source types:
Inline Text Sources #
sources: [
{ type: "text", content: "Neam is a programming language for AI agents." },
{ type: "text", content: "Agents connect to LLM providers like OpenAI and Ollama." }
]
Use inline text for small, self-contained facts, FAQ entries, or test data.
File Sources #
sources: [
{ type: "file", path: "./docs/README.md" },
{ type: "file", path: "./data/product_catalog.txt" }
]
File sources read the file at compile time and chunk its contents. The path is relative
to the working directory where neam is executed.
File paths are resolved relative to the current working directory at
runtime, not relative to the .neam source file. If you run your program from a
different directory, ensure the paths still resolve correctly.
retrieval_strategy #
Specifies which retrieval algorithm to use when the agent queries the knowledge base. Neam supports eight strategies, covered in detail in Section 15.4.
top_k #
The number of document chunks to retrieve and include in the augmented prompt. The default is 4.
- Low
top_k(1-3): Faster, less context. Good for simple factual questions. - High
top_k(5-10): More context, but increases prompt size and cost. Good for complex questions that require synthesizing multiple sources.
15.3 Connecting Knowledge to Agents #
A knowledge base becomes useful only when connected to an agent. Use the
connected_knowledge property:
knowledge ProductDocs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [
{ type: "text", content: "Neam is a programming language designed for AI agent orchestration. It supports agents, handoffs, guardrails, and runners. Neam compiles to bytecode and runs on a custom virtual machine." },
{ type: "text", content: "To create an agent in Neam, use the 'agent' keyword followed by a name and configuration block. Agents can have a provider (openai, ollama), model, temperature, and system prompt." },
{ type: "text", content: "Handoffs allow agents to transfer control to other agents. Use the handoffs property in an agent declaration. The runner orchestrates the handoff flow with a max_turns limit." }
]
retrieval_strategy: "basic"
top_k: 3
}
agent DocAssistant {
provider: "ollama"
model: "llama3.2:3b"
temperature: 0.3
system: "You are a documentation assistant. Answer questions using only the provided context. Be concise."
connected_knowledge: [ProductDocs]
}
{
let answer = DocAssistant.ask("How do I create an agent in Neam?");
emit "Q: How do I create an agent in Neam?";
emit "A: " + answer;
}
When the agent receives a query via .ask(), the runtime performs these steps
automatically:
- Embed the query using the knowledge base's embedding model.
- Search the vector store for the
top_kmost similar chunks. - Inject the retrieved chunks into the prompt as context.
- Call the LLM with the augmented prompt.
- Return the response to the caller.
The agent's system prompt is preserved. The retrieved context is inserted between the system prompt and the user's message.
Connecting Multiple Knowledge Bases #
An agent can connect to multiple knowledge bases:
knowledge TechDocs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [
{ type: "file", path: "./docs/technical.md" }
]
retrieval_strategy: "basic"
top_k: 3
}
knowledge PolicyDocs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 150
chunk_overlap: 30
sources: [
{ type: "file", path: "./docs/policies.md" }
]
retrieval_strategy: "mmr"
top_k: 2
}
agent SupportAgent {
provider: "openai"
model: "gpt-4o"
system: "You are a customer support agent. Answer using the provided context."
connected_knowledge: [TechDocs, PolicyDocs]
}
When an agent is connected to multiple knowledge bases, the runtime queries each one independently and merges the retrieved chunks before injecting them into the prompt.
15.4 Retrieval Strategies #
Neam supports eight retrieval strategies. Each makes different trade-offs between accuracy, diversity, latency, and cost. The following table provides an overview:
+------------------------------------------------------------------+
| Strategy | LLM Calls | Latency | Accuracy | Best For |
| ----------------+-----------+---------+----------+------------- |
| basic | 0 | Low | Good | Simple Q&A |
| mmr | 0 | Low | Good | Diverse docs |
| hybrid | 0 | Low | Better | Precise match|
| hyde | 1 | Medium | Better | Abstract Q |
| self_rag | 1 | Medium | High | High accuracy|
| crag | 1-3 | Medium | High | Complex Q |
| agentic | 2-5+ | High | Highest | Research |
| graph_rag | 1-2 | Medium | High | Relationships|
+------------------------------------------------------------------+
Strategy 1: basic -- Standard Vector Similarity #
The default and simplest strategy. The query is embedded and compared against all document chunk embeddings using cosine similarity. The top-K most similar chunks are returned.
knowledge BasicKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "basic"
top_k: 3
}
How it works:
- Embed the query:
q_vec = embed("What is Neam?") - Compute cosine similarity against every chunk embedding.
- Return the 3 highest-scoring chunks.
When to use: Simple factual questions where the answer is likely contained in a single chunk. This is the fastest strategy with zero additional LLM calls.
Configuration options:
| Option | Default | Description |
|---|---|---|
top_k |
4 | Number of chunks to retrieve |
relevance_threshold |
0.5 | Minimum similarity score (0.0-1.0) |
Strategy 2: mmr -- Maximal Marginal Relevance #
MMR balances relevance and diversity. After finding the most relevant chunks, it penalizes chunks that are too similar to already-selected chunks. This produces a more diverse set of results.
knowledge MMRKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "mmr"
top_k: 3
mmr_lambda: 0.7
}
How it works:
- Retrieve an initial candidate set (typically 2x
top_kchunks). - Select the first chunk (highest relevance to query).
- For each remaining selection, choose the chunk that maximizes:
MMR = lambda * similarity(chunk, query) - (1 - lambda) * max_similarity(chunk, selected_chunks) - Repeat until
top_kchunks are selected.
The mmr_lambda parameter:
1.0= pure relevance (equivalent to basic strategy)0.0= pure diversity (maximally different chunks)0.5= balanced (default)0.7= relevance-weighted but still diverse (recommended)
When to use: When your knowledge base contains many similar or overlapping passages and you want the retrieved context to cover different aspects of the topic.
Strategy 3: hybrid -- Keyword + Vector Search #
Combines traditional keyword matching (BM25-style) with vector similarity search. This catches cases where semantically relevant documents use different vocabulary than the query.
knowledge HybridKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "hybrid"
top_k: 3
}
How it works:
- Run vector similarity search (same as basic).
- Run keyword/token matching against chunk text.
- Combine scores using reciprocal rank fusion.
- Return the top-K chunks from the fused ranking.
When to use: When queries contain specific technical terms, product names, error codes, or identifiers that should be matched exactly. Vector search might miss "ERR-4021" if no similar text exists in training data, but keyword search catches it.
Strategy 4: hyde -- Hypothetical Document Embeddings #
HyDE generates a hypothetical answer to the query, embeds that answer, and uses it for retrieval instead of the raw query. The intuition is that a hypothetical answer is closer in embedding space to the actual answer than the question itself.
knowledge HyDEKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./docs.md" } ]
retrieval_strategy: "hyde"
top_k: 3
num_hypothetical: 1
}
How it works:
- Send the query to the LLM: "Write a short passage that would answer: [query]"
- Embed the hypothetical answer (not the original query).
- Search the vector store using the hypothetical embedding.
- Return the top-K actual document chunks.
Configuration options:
| Option | Default | Description |
|---|---|---|
num_hypothetical |
1 | Number of hypothetical documents to generate |
When to use: Abstract or conceptual queries where the question phrasing is very different from how the answer would appear in the documents. For example, "What should I do when a customer is upset?" retrieves better documents when the hypothetical answer ("When dealing with an upset customer, first acknowledge their frustration...") is used as the search vector.
Trade-off: HyDE requires one additional LLM call, adding latency and cost. The hypothetical answer may also steer retrieval in the wrong direction if the LLM generates an incorrect hypothesis.
Strategy 5: self_rag -- Self-Reflective RAG #
Self-RAG adds a relevance check after retrieval. The LLM evaluates whether each retrieved chunk is actually relevant to the query, filtering out false positives before generating the answer.
knowledge SelfRAGKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "self_rag"
top_k: 4
enable_relevance_check: true
enable_support_check: true
}
How it works:
- Retrieve top-K chunks (same as basic).
- For each chunk, ask the LLM: "Is this chunk relevant to the query? Rate 0-1."
- Filter chunks below the relevance threshold.
- Generate the answer using only the validated chunks.
- (Optional) Check whether the answer is supported by the retrieved chunks.
Configuration options:
| Option | Default | Description |
|---|---|---|
enable_relevance_check |
true |
Check each chunk's relevance before use |
enable_support_check |
true |
Verify the answer is supported by context |
When to use: High-stakes applications (medical, legal, financial) where using irrelevant context could lead to harmful or misleading answers. The relevance check acts as a guardrail on the retrieval step.
Trade-off: Adds one LLM call for the relevance check. Can discard too many chunks if the threshold is too aggressive, leaving insufficient context.
Strategy 6: crag -- Corrective RAG #
CRAG (Corrective Retrieval Augmented Generation) adds query decomposition and iterative correction. If the initial retrieval does not produce confident results, CRAG decomposes the query into sub-queries and tries again.
knowledge CRAGKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "crag"
top_k: 3
enable_query_decomposition: true
max_corrections: 2
}
How it works:
- Retrieve top-K chunks for the original query.
- Evaluate retrieval confidence.
- If confidence is low: a. Decompose the query into sub-queries. b. Retrieve chunks for each sub-query. c. Merge and re-rank results.
- Repeat up to
max_correctionstimes. - Generate the answer using the refined context.
Configuration options:
| Option | Default | Description |
|---|---|---|
enable_query_decomposition |
true |
Break complex queries into sub-queries |
max_corrections |
2 | Maximum correction rounds |
enable_web_fallback |
false |
Fall back to web search if local retrieval fails |
When to use: Complex, multi-part questions that cannot be answered by a single retrieval pass. For example: "Compare the performance characteristics of basic and agentic RAG strategies, and explain when to use each one."
Strategy 7: agentic -- Tool-Based Planning with Reflection #
Agentic RAG treats retrieval as an iterative research process. The LLM plans what information it needs, retrieves it, reflects on whether it has enough context, and repeats until satisfied.
knowledge AgenticKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./research/*.md" } ]
retrieval_strategy: "agentic"
top_k: 5
max_iterations: 3
enable_reflection: true
}
How it works:
- The LLM analyzes the query and generates a retrieval plan.
- Execute the first retrieval based on the plan.
- The LLM reflects: "Do I have enough context to answer? What is missing?"
- If more context is needed, refine the search query and retrieve again.
- Repeat up to
max_iterationstimes. - Generate the final answer using all accumulated context.
Configuration options:
| Option | Default | Description |
|---|---|---|
max_iterations |
5 | Maximum retrieval-reflection cycles |
enable_reflection |
true |
Enable self-reflection between iterations |
When to use: Research tasks, deep-dive questions, or scenarios where a single retrieval pass is unlikely to surface all necessary information. This is the most thorough strategy but also the most expensive.
Trade-off: Multiple LLM calls per query (2 to max_iterations * 2). Best reserved
for high-value queries where accuracy justifies the cost.
Strategy 8: graph_rag -- Knowledge Graph Retrieval #
Graph RAG builds a knowledge graph from your documents, extracting entities and relationships. Retrieval traverses the graph starting from entities mentioned in the query, producing richly connected context.
knowledge GraphKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./docs/architecture.md" } ]
retrieval_strategy: "graph_rag"
top_k: 5
}
How it works:
- At index time, extract entities and relationships from each chunk using an LLM.
- Build a graph with entity nodes, document nodes, and relationship edges.
- Optionally detect communities (clusters of related entities).
- At query time, extract entities from the query.
- Traverse the graph from matched entities, collecting related nodes up to a configurable depth.
- Include community summaries for broader context.
- Generate the answer using the graph-derived context.
When to use: Documents with rich relationships between concepts -- organizational charts, technical architectures, legal contracts with cross-references, scientific papers with citation networks.
The Neam standard library provides graph construction utilities in
std.rag.advanced.graph_rag:
import std::rag::advanced::graph_rag;
// Create entities
let neam = graph_rag::node_entity("e1", "Neam", "Language", {
"version": "0.5.0",
"paradigm": "agentic"
});
let vm = graph_rag::node_entity("e2", "Virtual Machine", "Component", {
"type": "bytecode interpreter"
});
// Create relationship
let runs_on = graph_rag::edge_related_to("e1", "e2", "runs_on", 0.95);
// Build graph
let graph = graph_rag::knowledge_graph();
graph = graph_rag::add_node(graph, neam);
graph = graph_rag::add_node(graph, vm);
graph = graph_rag::add_edge(graph, runs_on);
15.5 Strategy Selection Guide #
Choosing the right retrieval strategy depends on your use case. Here is a decision framework:
Start
|
v
Is the question simple and factual?
|--- Yes --> Use "basic"
|--- No
|
v
Do your docs have many overlapping passages?
|--- Yes --> Use "mmr"
|--- No
|
v
Does the query contain specific terms/codes?
|--- Yes --> Use "hybrid"
|--- No
|
v
Is the query abstract or conceptual?
|--- Yes --> Use "hyde"
|--- No
|
v
Is accuracy critical (medical/legal)?
|--- Yes --> Use "self_rag"
|--- No
|
v
Is the query multi-part or complex?
|--- Yes --> Use "crag"
|--- No
|
v
Is this a research/deep-dive task?
|--- Yes --> Use "agentic"
|--- No
|
v
Do docs have entity relationships?
|--- Yes --> Use "graph_rag"
|--- No --> Use "basic"
15.6 Practical Walkthrough: Building a Documentation QA Bot #
Let us build a complete documentation assistant that answers questions about a project using its README and documentation files.
Step 1: Prepare the Knowledge Base #
knowledge ProjectDocs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 250
chunk_overlap: 50
sources: [
{ type: "file", path: "./README.md" },
{ type: "file", path: "./docs/AGENT_HANDOFFS_GUIDE.md" },
{ type: "text", content: "To build Neam, run: mkdir -p build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build . --parallel" },
{ type: "text", content: "Neam supports four LLM providers: OpenAI, Anthropic, Google Gemini, and Ollama. Set the corresponding API key environment variable before running." }
]
retrieval_strategy: "hybrid"
top_k: 4
}
Step 2: Define the Agent #
agent DocsBot {
provider: "ollama"
model: "llama3.2:3b"
temperature: 0.3
system: "You are a documentation assistant for the Neam programming language.
Answer questions using ONLY the provided context. If the context does
not contain the answer, say 'I don't have that information in the docs.'
Be concise and cite specific details from the context."
connected_knowledge: [ProjectDocs]
}
Step 3: Build the Interactive Loop #
{
emit "=== Neam Documentation Assistant ===";
emit "Ask questions about the Neam language.";
emit "Type 'quit' to exit.";
emit "";
let running = true;
while (running) {
emit "Q: ";
let question = input();
if (question == "quit") {
running = false;
} else {
let answer = DocsBot.ask(question);
emit "A: " + answer;
emit "";
}
}
emit "Goodbye!";
}
Step 4: Compile and Run #
# Prerequisites: Ollama with the required models
ollama pull llama3.2:3b
ollama pull nomic-embed-text
# Compile
./neamc docs_qa.neam -o docs_qa.neamb
# Run (from the project root so file paths resolve correctly)
./neam docs_qa.neamb
Step 5: Test with Sample Questions #
=== Neam Documentation Assistant ===
Ask questions about the Neam language.
Type 'quit' to exit.
Q: How do I build Neam?
A: To build Neam, run the following commands:
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallel
Q: What LLM providers does Neam support?
A: Neam supports four LLM providers: OpenAI, Anthropic, Google Gemini,
and Ollama. You need to set the corresponding API key environment
variable before running.
Q: How do handoffs work?
A: Handoffs allow agents to transfer control to other agents. You use
the handoffs property in an agent declaration, and the runner
orchestrates the handoff flow with a max_turns limit.
15.7 Comparing All Strategies Side by Side #
The following program queries the same knowledge base with all strategies and compares the results. This is an excellent way to evaluate which strategy works best for your specific dataset.
knowledge BasicKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "basic"
top_k: 3
}
knowledge MMRKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "mmr"
top_k: 3
mmr_lambda: 0.7
}
knowledge HybridKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "hybrid"
top_k: 3
}
knowledge HyDEKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "hyde"
top_k: 3
num_hypothetical: 1
}
knowledge SelfRAGKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "self_rag"
top_k: 4
enable_relevance_check: true
}
knowledge CRAGKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "crag"
top_k: 3
enable_query_decomposition: true
}
knowledge AgenticKB {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 200
chunk_overlap: 50
sources: [ { type: "file", path: "./readme.md" } ]
retrieval_strategy: "agentic"
top_k: 3
max_iterations: 2
enable_reflection: true
}
// One agent per strategy
agent BasicAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [BasicKB]
}
agent MMRAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [MMRKB]
}
agent HybridAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [HybridKB]
}
agent HyDEAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [HyDEKB]
}
agent SelfRAGAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [SelfRAGKB]
}
agent CRAGAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [CRAGKB]
}
agent AgenticAgent {
provider: "ollama"
model: "qwen3:1.7b"
system: "Answer in exactly one sentence."
connected_knowledge: [AgenticKB]
}
{
emit "===============================================";
emit " NEAM RAG STRATEGIES - COMPREHENSIVE TEST ";
emit "===============================================";
emit "";
let q = "What is Neam?";
emit "Question: " + q;
emit "";
emit "-----------------------------------------------";
emit "1. BASIC (standard similarity):";
emit " " + BasicAgent.ask(q);
emit "";
emit "2. MMR (diversity-focused):";
emit " " + MMRAgent.ask(q);
emit "";
emit "3. HYBRID (keyword + vector):";
emit " " + HybridAgent.ask(q);
emit "";
emit "4. HYDE (hypothetical document):";
emit " " + HyDEAgent.ask(q);
emit "";
emit "5. SELF-RAG (relevance-checked):";
emit " " + SelfRAGAgent.ask(q);
emit "";
emit "6. CRAG (query decomposition):";
emit " " + CRAGAgent.ask(q);
emit "";
emit "7. AGENTIC (iterative refinement):";
emit " " + AgenticAgent.ask(q);
emit "";
emit "===============================================";
emit " ALL STRATEGIES TEST COMPLETE ";
emit "===============================================";
}
15.8 Tuning Chunk Size and Overlap #
The chunk size and overlap parameters have a significant impact on retrieval quality. Here is a practical guide for tuning them:
Experiment Setup #
knowledge SmallChunks {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 100
chunk_overlap: 20
sources: [ { type: "file", path: "./docs/README.md" } ]
retrieval_strategy: "basic"
top_k: 5
}
knowledge MediumChunks {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 250
chunk_overlap: 50
sources: [ { type: "file", path: "./docs/README.md" } ]
retrieval_strategy: "basic"
top_k: 3
}
knowledge LargeChunks {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 500
chunk_overlap: 100
sources: [ { type: "file", path: "./docs/README.md" } ]
retrieval_strategy: "basic"
top_k: 2
}
Guidelines #
| Document Type | Recommended Chunk Size | Recommended Overlap | Reasoning |
|---|---|---|---|
| FAQ entries | 100-150 | 20-30 | Each entry is self-contained |
| Technical docs | 200-300 | 50-75 | Need enough context for code examples |
| Narrative text | 400-600 | 100-150 | Preserve paragraph coherence |
| Legal documents | 300-500 | 75-125 | Clauses often reference each other |
| API references | 150-250 | 30-50 | Each endpoint description is independent |
When in doubt, start with chunk_size: 200 and chunk_overlap: 50. These
values work well for most technical documentation. Adjust based on the quality of
retrieved results.
15.9 Advanced Pattern: RAG with Multiple Sources and Strategies #
A common production pattern uses different knowledge bases with different strategies for different types of content:
knowledge FAQs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 150
chunk_overlap: 30
sources: [
{ type: "file", path: "./docs/faq.md" }
]
retrieval_strategy: "hybrid"
top_k: 3
}
knowledge TechDocs {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 300
chunk_overlap: 75
sources: [
{ type: "file", path: "./docs/architecture.md" },
{ type: "file", path: "./docs/api-reference.md" }
]
retrieval_strategy: "mmr"
top_k: 4
mmr_lambda: 0.6
}
knowledge Research {
vector_store: "usearch"
embedding_model: "nomic-embed-text"
chunk_size: 400
chunk_overlap: 100
sources: [
{ type: "file", path: "./papers/survey.md" }
]
retrieval_strategy: "agentic"
top_k: 5
max_iterations: 3
enable_reflection: true
}
agent ResearchAssistant {
provider: "openai"
model: "gpt-4o"
system: "You are a research assistant. Use the provided context to give
thorough, well-sourced answers. Cite specific sections when possible."
connected_knowledge: [FAQs, TechDocs, Research]
}
{
let answer = ResearchAssistant.ask(
"How does Neam's agentic RAG strategy compare to standard vector search?"
);
emit answer;
}
15.10 Performance Considerations #
Indexing Time #
Indexing happens at program startup. The time depends on:
- Number of source documents and total text size
- Chunk size (smaller chunks = more embeddings to compute)
- Embedding model speed (local Ollama inference)
For a 50-page document with chunk_size: 200, expect roughly 100-200 chunks and 2-5
seconds of indexing time on a modern machine.
Query Latency by Strategy #
| Strategy | Typical Latency | LLM Calls | Notes |
|---|---|---|---|
| basic | 50-100ms | 0 extra | Vector search only |
| mmr | 50-100ms | 0 extra | Slightly more computation |
| hybrid | 50-150ms | 0 extra | Keyword + vector |
| hyde | 1-3s | 1 extra | One LLM call for hypothesis |
| self_rag | 1-3s | 1 extra | One LLM call for relevance check |
| crag | 2-8s | 1-3 extra | Decomposition + correction |
| agentic | 3-15s | 2-5+ extra | Iterative retrieval |
| graph_rag | 1-5s | 1-2 extra | Graph traversal + LLM |
Memory Usage #
The uSearch HNSW index resides in memory. For 1,000 chunks with 768-dimensional embeddings, expect approximately 3 MB of memory for the index. This scales linearly with the number of chunks.
15.11 Embedding Providers #
The examples throughout this chapter use nomic-embed-text via Ollama, but the Neam
standard library (std.ingest.embed) supports a wide range of embedding providers for
production deployments:
| Provider | Models | Dimensions | Notes |
|---|---|---|---|
| OpenAI | text-embedding-3-small, text-embedding-3-large, ada-002 |
1536-3072 | Cloud API, highest quality |
| Cohere | embed-english-v3.0, embed-multilingual-v3.0 |
1024 | Strong multilingual support |
| Voyage AI | voyage-2, voyage-large-2, voyage-code-2 |
1024-1536 | Specialized for code retrieval |
| Ollama | nomic-embed-text, mxbai-embed-large |
768-1024 | Free, local, no API key |
| Local | all-MiniLM-L6-v2 (sentence-transformers), bge-base, e5-base |
384-768 | Self-hosted models |
You can use the standard library's batch embedder for large document sets. It supports rate limiting and caching to manage API costs:
import std::ingest::embed;
let embedder = embed::create_embedder({
"provider": "openai",
"model": "text-embedding-3-small",
"batch_size": 100,
"cache_enabled": true
});
let vectors = embed::batch_embed(embedder, chunks);
The library also provides similarity functions for comparing vectors directly:
- Cosine similarity -- Default, works well for normalized embeddings.
- Euclidean distance -- Useful when magnitude matters.
- Dot product -- Fastest computation, good for pre-normalized vectors.
For most applications, start with nomic-embed-text via Ollama during
development (free, local, no API key), then switch to text-embedding-3-small or
embed-english-v3.0 for production quality.
15.12 Vector Store Options #
The knowledge declaration uses uSearch (in-memory HNSW) by default, but the standard
library (std.ingest.store) provides connectors for production-grade vector databases:
| Store | Type | Scaling | Best For |
|---|---|---|---|
| uSearch | In-memory HNSW | Single process | Development, small corpora |
| Pinecone | Cloud managed | Serverless | Production SaaS, no ops |
| Qdrant | Self-hosted / Cloud | Horizontal | High-performance, filtering |
| Weaviate | Self-hosted / Cloud | Horizontal | Multi-modal, module ecosystem |
| Milvus | Self-hosted / Cloud | Horizontal | Massive-scale (billions) |
| ChromaDB | Embedded / Server | Single node | Prototyping, simple setup |
| PgVector | PostgreSQL extension | PostgreSQL cluster | Existing Postgres stack |
| SqliteVec | SQLite extension | Single file | Edge deployment, embedded apps |
To use a cloud vector store, configure your knowledge base with the appropriate store and connection settings. The standard library provides a unified filter expression syntax that works across all stores:
import std::ingest::store;
let qdrant = store::create_store({
"type": "qdrant",
"url": "http://localhost:6333",
"collection": "product_docs"
});
// Filter expressions work across all backends
let filter = store::filter_and([
store::filter_eq("category", "technical"),
store::filter_gte("updated_at", "2025-01-01")
]);
let results = store::search(qdrant, query_vector, 5, filter);
The knowledge declaration currently uses the built-in uSearch store
directly. To use external vector stores, work with the std.ingest.store module
in your program logic alongside the knowledge declaration.
15.13 Advanced Chunking Strategies #
The chunk_size / chunk_overlap fields in the knowledge declaration use
fixed-size character splitting. For finer control, the standard library
(std.ingest.chunk) provides nine chunking strategies:
| Strategy | Description | Best For |
|---|---|---|
| Fixed | Fixed character size with overlap | General-purpose (default) |
| Sentence | Split on sentence boundaries | Narrative text, articles |
| Paragraph | Split on paragraph boundaries | Well-structured documents |
| Recursive | Hierarchical splitting (text → paragraph → sentence → character) | Mixed content |
| Semantic | Split when embedding similarity drops | Maintaining topic coherence |
| Code | Split on function/class/block boundaries | Source code files |
| Markdown | Split on headers, preserve code blocks | Documentation, READMEs |
| Sliding window | Overlapping windows at fixed intervals | Dense information extraction |
| Hybrid | Multiple strategies with content-type matchers | Multi-format corpora |
import std::ingest::chunk;
// Semantic chunking groups text by topic similarity
let semantic_chunker = chunk::create_chunker({
"strategy": "semantic",
"similarity_threshold": 0.75,
"min_chunk_size": 100,
"max_chunk_size": 500
});
let chunks = chunk::split(semantic_chunker, document_text);
The recursive chunker is particularly effective for mixed-content documents. It tries progressively finer split points, starting with double newlines (paragraphs), then single newlines, then sentences, then characters:
let recursive_chunker = chunk::create_chunker({
"strategy": "recursive",
"separators": ["\n\n", "\n", ". ", " "],
"chunk_size": 300,
"chunk_overlap": 50
});
15.14 Reranking for Better Results #
Initial retrieval returns an approximate set of relevant chunks. Reranking applies
a more expensive but more accurate model to re-score and re-order those results. The
standard library (std.rag.reranker) supports four reranking methods:
| Method | Approach | Quality | Latency |
|---|---|---|---|
| Cross-encoder | BERT-style pairwise scoring | Highest | High |
| ColBERT | Late-interaction token matching | High | Medium |
| Cohere | API-based neural reranking | High | Medium |
| LLM-based | Ask an LLM to rank results | Variable | High |
import std::rag::reranker;
let reranker_config = reranker::create_reranker({
"method": "cross_encoder",
"model": "cross-encoder/ms-marco-MiniLM-L-12-v2",
"top_k": 3,
"min_score": 0.5
});
// Rerank retrieved chunks before passing to the agent
let reranked = reranker::rerank(reranker_config, query, initial_chunks);
When to add reranking:
- Your initial retrieval returns many marginally relevant chunks.
- You use a high
top_k(e.g., 10-20) and want to narrow down to the best 3-5. - Accuracy is more important than latency (e.g., legal, medical, financial domains).
- You are using
basicorhybridretrieval and want better precision without switching to a more expensive strategy likeagentic.
15.15 The Document Ingestion Pipeline #
For production systems with large, diverse document collections, the standard library
provides a complete ingestion pipeline (std.ingest.pipeline) that automates the
flow from raw documents to indexed vectors:
Each stage is independently configurable:
import std::ingest::pipeline;
import std::ingest::parser;
import std::ingest::chunk;
import std::ingest::embed;
import std::ingest::store;
let pipe = pipeline::create_pipeline({
"parser": parser::create_parser({ "type": "auto" }),
"chunker": chunk::create_chunker({
"strategy": "recursive",
"chunk_size": 300,
"chunk_overlap": 50
}),
"embedder": embed::create_embedder({
"provider": "openai",
"model": "text-embedding-3-small"
}),
"store": store::create_store({
"type": "qdrant",
"url": "http://localhost:6333",
"collection": "docs"
})
});
let result = pipeline::ingest(pipe, [
{ "type": "file", "path": "./docs/manual.pdf" },
{ "type": "file", "path": "./docs/api-reference.md" },
{ "type": "file", "path": "./data/faq.csv" }
]);
emit "Ingested " + str(result["chunks_created"]) + " chunks";
emit "Cost: $" + str(result["cost"]);
The pipeline supports incremental ingestion -- only new or modified documents are processed on subsequent runs. It tracks document fingerprints to detect changes.
Supported Document Parsers #
The std.ingest.parser module includes parsers for common document formats:
| Parser | Formats | Notes |
|---|---|---|
| Text | .txt, .md, .csv |
Plain text extraction |
.pdf |
Text extraction with layout preservation | |
| Office | .docx, .xlsx, .pptx |
Microsoft Office formats |
| Code | .py, .js, .rs, .neam, etc. |
Language-aware parsing |
| Image | .png, .jpg, .tiff |
OCR-based text extraction |
| Audio | .mp3, .wav |
Transcription-based parsing |
| HTML | .html |
Content extraction, tag stripping |
Source Connectors #
Documents can be loaded from various sources beyond the local filesystem:
| Source | Description |
|---|---|
| File | Local filesystem paths |
| S3 | Amazon S3 buckets |
| GCS | Google Cloud Storage |
| Azure | Azure Blob Storage |
| HTTP | Remote URLs |
| Database | SQL query results |
Summary #
In this chapter you learned:
- What RAG is and why it reduces hallucination by grounding LLM responses in retrieved facts.
- How to declare knowledge bases in Neam using the
knowledgekeyword. - Source types: inline
textandfilesources for building your document corpus. - Embedding and chunking: How documents are split into chunks and converted to
vectors using
nomic-embed-textvia Ollama. - Eight retrieval strategies: from the simple
basicvector search to the sophisticatedagenticiterative retrieval andgraph_ragknowledge graph approach. - How to connect knowledge to agents using
connected_knowledge. - Practical patterns for building documentation QA bots and multi-source RAG systems.
- Tuning guidelines for chunk size, overlap, and strategy selection.
- Embedding providers beyond Ollama: OpenAI, Cohere, Voyage AI, and local sentence-transformers, with batch embedding and caching support.
- Vector store options: from in-memory uSearch for development to production stores like Pinecone, Qdrant, Weaviate, Milvus, ChromaDB, PgVector, and SqliteVec.
- Advanced chunking strategies: sentence, paragraph, recursive, semantic, code, markdown, sliding window, and hybrid chunkers for different document types.
- Reranking with cross-encoder, ColBERT, Cohere, and LLM-based methods to improve retrieval precision.
- The document ingestion pipeline: a full Source → Parser → Chunker → Embedder → Store workflow with incremental ingestion, multiple parsers, and cloud source connectors.
Exercises #
Exercise 15.1: Basic RAG Bot #
Create a knowledge base from three inline text sources about a topic you know well
(e.g., a programming language, a cooking recipe, a historical event). Connect it to an
agent using the basic strategy and test with five questions. Record which questions
are answered correctly and which are not.
Exercise 15.2: Strategy Comparison #
Take the knowledge base from Exercise 15.1 and create seven copies, each using a different retrieval strategy (basic, mmr, hybrid, hyde, self_rag, crag, agentic). Run the same five questions through all seven agents. Create a table comparing the quality of answers across strategies. Which strategy performed best for your dataset? Why?
Exercise 15.3: Chunk Size Experiment #
Using a single file source (at least 2000 characters), create three knowledge bases with different chunk sizes: 100, 250, and 500. Keep the overlap at 25% of the chunk size. Test with the same set of questions and observe how chunk size affects answer quality. Write a brief analysis of your findings.
Exercise 15.4: Multi-Source Documentation Bot #
Build a documentation QA bot that indexes at least three different files from a real
project (e.g., README, API docs, contribution guide). Use the hybrid strategy.
Include at least two inline text sources for information not covered in the files.
Demonstrate the bot answering questions that require information from different sources.
Exercise 15.5: Production RAG Architecture #
Design (on paper or in code) a RAG architecture for a customer support system with the following requirements:
- FAQ database (500 entries) -- needs fast, exact matching
- Product documentation (50 pages) -- needs broad coverage
- Recent support tickets (updated daily) -- needs freshness
For each document source, justify your choice of chunk size, overlap, and retrieval strategy. Explain how you would handle the daily update of support tickets.
Exercise 15.6: Graph RAG Exploration #
Using the std.rag.advanced.graph_rag module, build a small knowledge graph with at
least 5 entities and 8 relationships. Implement a function that, given an entity name,
returns all entities within 2 hops. Test your graph with queries that require
understanding relationships between entities.