Skip to content
· 19 min read

Retrieval-Augmented Thinking

How RAG principles apply to human knowledge work.

Last year I built a documentation chatbot for an internal knowledge base. The knowledge base had 12,000 pages of product docs, API references, runbooks, and tribal knowledge scattered across Confluence, Notion, and Google Docs. The chatbot’s job was simple: answer employee questions using this knowledge base as ground truth.

The first version was terrible. Not because the model was bad. The model was fine. The retrieval was terrible. The chatbot would retrieve the wrong documents, miss relevant context, or retrieve so much context that the model couldn’t figure out what mattered. A question about “how to reset a user’s password” would retrieve the API authentication docs, the password policy doc, the user management runbook, and seventeen other marginally related pages, then synthesize a confident-sounding answer that mixed up the admin reset flow with the user self-service flow.

Fixing the retrieval took three months of engineering. Not model engineering. Retrieval engineering. Chunking strategies, embedding model selection, hybrid search, reranking, metadata filtering, context window management. The model stayed the same. The retrieval got dramatically better. And the chatbot went from “annoying” to “indispensable.”

That experience taught me something about RAG that I think most teams learn the hard way: RAG is a retrieval problem that happens to involve generation. Get the retrieval right and even a mediocre model produces good answers. Get the retrieval wrong and even the best model produces garbage.

When to RAG, when to fine-tune, when to just use a big context window

Before diving into RAG architecture, let me address the first question every team asks: do we even need RAG? Maybe we should fine-tune. Maybe the context window is big enough now to just stuff everything in.

This decision has gotten more nuanced in 2026 as context windows have grown to 200K+ tokens and fine-tuning has gotten easier. Here’s my framework.

Approach Best for Worst for Cost profile
RAG Dynamic data that changes frequently. Factual accuracy matters. Need to cite sources. Large knowledge bases (>100K tokens). Tasks requiring deep behavioral changes. Style/tone consistency. Low upfront, moderate ongoing (embedding + retrieval infra)
Fine-tuning Consistent style/tone/format. Domain-specific terminology. Behavioral patterns (“always respond in JSON”). Frequently changing data. Source attribution. Large knowledge bases. High upfront (training), low ongoing (inference similar to base model)
Long context Small knowledge bases (<100K tokens). One-off analysis of documents. Prototyping before committing to RAG. Large knowledge bases (cost scales linearly with context). Real-time, high-volume applications. Scales linearly with context length. Expensive at volume.

The 2026 pattern that’s emerging in production: RAG plus selective fine-tuning. RAG handles the “what is true now” question (retrieving current, accurate information). Fine-tuning handles the “how to respond” question (consistent format, domain terminology, behavioral patterns). Very few production systems rely on pure fine-tuning without retrieval for fact-heavy tasks. The maintenance burden of retraining when facts change is too high.

And long context, while tempting, has a cost problem at scale. Stuffing 200K tokens into every query works fine for prototyping. It does not work when you’re handling 10,000 queries per day and paying per token. RAG lets you send only the relevant 2K-5K tokens per query, which is 40-100x cheaper.

The decision tree I use:

Is the data static and small (<50K tokens)?
  → Long context. Don't bother with RAG.

Does the data change frequently?
  → RAG. Fine-tuning can't keep up with daily changes.

Do you need source attribution?
  → RAG. Fine-tuned models can't point to where they got the information.

Do you need consistent style/format?
  → Fine-tuning (or very detailed prompting).

Is the knowledge base large (>100K tokens)?
  → RAG. Long context is too expensive at volume.

Do you need all of the above?
  → RAG + fine-tuning hybrid.

The anatomy of a RAG system

A production RAG system has more moving parts than most people expect. Here’s the full architecture.

User Query
    │
    ▼
┌─────────────┐
│ Query        │  Expand, rephrase, decompose the query
│ Processing   │  for better retrieval
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Retrieval    │  BM25 (keyword) + Dense (semantic) search
│ (Hybrid)     │  against the vector store + search index
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Reranking    │  Cross-encoder reranks top-K results
│              │  by relevance to the original query
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Context      │  Select, order, and format the top
│ Assembly     │  chunks for the model's context window
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Generation   │  Model synthesizes answer from
│              │  retrieved context + system prompt
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Grounding    │  Verify answer is supported by
│ Check        │  retrieved context, add citations
└─────┬───────┘
      │
      ▼
Response with citations

Each of these stages has its own failure modes, its own optimization strategies, and its own evaluation metrics. Let me walk through the ones that matter most.

Chunking: where most RAG systems go wrong

Chunking is how you break your documents into pieces that can be embedded and retrieved independently. It sounds simple. It’s the stage where the most value is lost.

The problem: if your chunks are too small, they lose context. A paragraph about password reset procedures doesn’t mention that it applies to admin users only, because that context was in the previous paragraph. If your chunks are too big, retrieval precision drops. A 2000-token chunk about user management will be retrieved for any user-related query, even if only one sentence in the chunk is relevant.

Chunking strategies compared

Strategy How it works Pros Cons
Fixed-size Split every N tokens with M-token overlap Simple, predictable Cuts mid-sentence, breaks semantic units
Recursive Split by paragraphs, then sentences, then tokens if still too large Respects document structure Chunk sizes vary widely
Semantic Split at topic boundaries detected by embedding similarity Chunks are semantically coherent Slower, requires tuning similarity threshold
Document-aware Use document structure (headers, sections) as split points Preserves logical sections Only works for well-structured docs
Sliding window Overlapping windows of fixed size Good recall (same content in multiple chunks) Increases storage and retrieval cost

The research consensus in 2026: recursive chunking with 10-20% overlap consistently outperforms other approaches. NVIDIA’s research specifically found that recursive token-based chunking offers strong performance with minimal resource overhead.

But the strategy matters less than the implementation details.

The chunking details that actually matter

Chunk size sweet spot: 256-512 tokens for most use cases. Smaller chunks have better retrieval precision (the chunk is likely to be about one thing). Larger chunks have better context (the chunk has enough information to be useful). 256-512 is the empirically validated sweet spot for most text retrieval tasks.

Overlap prevents context loss. A 10-20% overlap means that if a key piece of context spans a chunk boundary, it appears in both chunks. Without overlap, you’ll have chunks that are individually incomplete and confusing to the model.

Metadata enrichment is critical. Attach metadata to every chunk: source document title, section heading, page number, last updated date, document type. This metadata enables filtering before retrieval (only search product docs, not internal memos) and helps the model contextualize the chunk.

{
  "chunk_id": "doc_4521_chunk_12",
  "content": "To reset a user's password as an admin, navigate to...",
  "metadata": {
    "source": "Admin Runbook v3.2",
    "section": "User Management > Password Reset",
    "doc_type": "runbook",
    "last_updated": "2026-01-15",
    "audience": "admin",
    "product_area": "auth"
  }
}

Parent-child chunking for retrieval precision + generation context. This is the technique that most improved my documentation chatbot. Index small chunks (256 tokens) for retrieval. When a small chunk is retrieved, include its parent chunk (1024 tokens) in the context sent to the model. This gives you the precision of small chunks for retrieval and the context of large chunks for generation.

Document Section (2048 tokens)
├── Parent Chunk A (1024 tokens)
│   ├── Child Chunk A1 (256 tokens)  ← Retrieved by query
│   └── Child Chunk A2 (256 tokens)  ← Included as sibling context
└── Parent Chunk B (1024 tokens)
    ├── Child Chunk B1 (256 tokens)
    └── Child Chunk B2 (256 tokens)

Query matches Child Chunk A1
→ Include full Parent Chunk A in context
→ Model sees 1024 tokens of coherent context, not just 256

Embedding models: the foundation layer

The embedding model converts text into vectors that capture semantic meaning. The quality of your embeddings determines the upper bound of your retrieval quality. No amount of reranking or prompt engineering can fix bad embeddings.

The current landscape

The embedding model space has consolidated around a few categories:

Category Examples Dimensionality Best for
General-purpose OpenAI text-embedding-3-large, Cohere embed-v3 1024-3072 Most RAG applications
Instruction-tuned E5-large-instruct, GTE-large 768-1024 Tasks where query and document styles differ
Domain-specific PubMedBERT, Legal-BERT 768 Specialized vocabularies
Multilingual multilingual-e5-large, Cohere multilingual 768-1024 Non-English or mixed-language corpora
Lightweight MiniLM, all-MiniLM-L6-v2 384 Cost-sensitive, high-volume applications

The single most impactful finding from chunking research: the choice of embedding model can explain over 35 percentage points of variance in retrieval quality. That’s comparable to the variance from chunking strategy. In other words, picking the right embedding model matters as much as picking the right chunking approach.

My recommendation for most teams: Start with OpenAI’s text-embedding-3-large or Cohere’s embed-v3 for general-purpose RAG. If you need to reduce costs, MiniLM variants are surprisingly competitive at 1/10th the dimensionality. If your domain has specialized vocabulary (medical, legal, scientific), test a domain-specific model, but verify with your actual data that it outperforms general-purpose models. Domain-specific models sometimes underperform on retrieval tasks because they weren’t optimized for retrieval.

Embedding gotchas

Asymmetric embedding matters. Your queries and your documents are different types of text. A query is a question: “how do I reset a password?” A document chunk is a statement: “To reset a password, navigate to Settings > Security.” Some embedding models are trained for symmetric similarity (comparing similar texts) and others for asymmetric similarity (comparing queries to documents). Use asymmetric models for RAG.

Instruction-tuned models handle this well. For E5-instruct models, you prepend “query: " to queries and “passage: " to documents. This tells the model to embed them in a compatible but appropriately different way.

Embedding drift is real. If you change your embedding model (even updating to a new version of the same model), you need to re-embed your entire corpus. Old embeddings and new embeddings are not compatible. Plan for periodic re-embedding in your infrastructure.

Dimensionality trade-offs. Higher-dimensional embeddings capture more nuance but cost more to store and search. For most applications, 768-1024 dimensions is sufficient. Going to 3072 dimensions provides marginal improvement for most text retrieval tasks while significantly increasing storage and compute costs.

Hybrid search: the retrieval architecture that works

Pure semantic search (dense retrieval) has a well-known weakness: it misses exact keyword matches. A user searching for “error code E4521” needs exact string matching, not semantic similarity. Dense retrieval might return chunks about error handling in general rather than the specific error code.

Pure keyword search (BM25) has the opposite weakness: it misses semantic similarity. A search for “how to restart the service” won’t find a document that says “to reboot the application” because the keywords don’t match.

Hybrid search combines both approaches. The typical architecture:

Query
├── BM25 search → top 50 keyword matches
├── Dense search → top 50 semantic matches
│
Merge (Reciprocal Rank Fusion)
│
Top 100 unique results
│
Reranker (cross-encoder)
│
Top 5-10 final results → sent to model

Reciprocal Rank Fusion (RRF) is the standard merge algorithm. For each result, compute 1/(k + rank) where k is a constant (typically 60). Sum scores across both retrieval methods. Sort by combined score. Results that rank highly in both keyword and semantic search get the highest scores.

BM25 captures exact keyword matches. Dense retrieval finds semantic similarities. The combination catches cases that either method alone would miss. In my documentation chatbot, hybrid search improved retrieval recall by 23% compared to dense-only search, with the biggest gains on queries containing specific identifiers (error codes, feature names, config parameters).

If your corpus is uniformly semantic (creative writing, general Q&A) and rarely contains specific identifiers or technical terms, dense-only search is fine. Hybrid adds complexity, and if the keyword component never helps, it’s wasted engineering.

If your corpus is uniformly keyword-oriented (log files, structured data), BM25-only search might be sufficient. Dense search adds value primarily when users express the same concept with different words.

For most knowledge bases that mix technical and natural language content, hybrid is the right default.

Reranking: the precision layer

Retrieval (whether keyword, dense, or hybrid) casts a wide net. It returns the top 50-100 results that are broadly relevant. Reranking narrows the results to the 5-10 that are most precisely relevant to the specific query.

The architecture is simple: a cross-encoder model takes (query, document) pairs and scores their relevance. Unlike embedding models (which encode query and document independently), cross-encoders process them together, which allows much more precise relevance judgments.

Approach How it works Speed Precision
Bi-encoder (embedding) Encodes query and doc separately, compares vectors Fast (ms) Good
Cross-encoder (reranker) Encodes query + doc together, produces relevance score Slow (100ms+) Excellent
LLM reranker Uses an LLM to judge relevance Very slow (seconds) Very good but expensive

The standard pipeline: use fast bi-encoder retrieval to get 50-100 candidates, then use a cross-encoder to rerank them and select the top 5-10.

How many candidates to rerank? Research suggests 50 documents is optimal for latency-sensitive applications. 100-200 for thoroughness-first applications. The sweet spot for most production systems is 50-75.

Which reranker? Cohere’s rerank API is the easiest to integrate and performs well across domains. For self-hosted options, cross-encoder models from the sentence-transformers library (ms-marco-MiniLM-L-6-v2 for speed, bge-reranker-large for quality) are the standard choices. For maximum quality and you can tolerate the latency, an LLM-based reranker (ask the model “how relevant is this passage to this query?”) outperforms cross-encoders but at 10-50x the cost.

Context assembly: the forgotten stage

You’ve retrieved and reranked your chunks. Now you need to assemble them into a context that the model can use effectively. This stage is where many RAG systems leave significant quality on the table.

Ordering matters

The order in which you present chunks to the model affects how well it uses them. Research has consistently shown that LLMs pay more attention to information at the beginning and end of the context, with less attention to the middle (the “lost in the middle” effect).

My approach: put the most relevant chunk first, the second most relevant chunk last, and less relevant chunks in the middle. This ensures the model pays maximum attention to the best context.

Deduplication

If you’re using overlapping chunks or parent-child chunking, you’ll have duplicate or near-duplicate content in your retrieved results. Sending duplicate content to the model wastes context window space and can confuse it (the model may overweight a point that appears multiple times).

Deduplicate before context assembly. Simple approach: if two chunks have >80% text overlap, keep only the higher-ranked one.

Context window budget

You have a finite context window. The system prompt takes some. The conversation history takes some. The user’s query takes some. What’s left is your budget for retrieved context.

For most applications, I allocate:

System prompt:        500-1000 tokens
Conversation history: 1000-3000 tokens (sliding window)
Retrieved context:    3000-8000 tokens
User query:           100-500 tokens
Generation budget:    1000-3000 tokens (for the response)
─────────────────────────────────────
Total:                ~16,000 tokens (fits in any modern model)

Stuffing the maximum possible context into the window is counterproductive. The model performs better with 5 highly relevant chunks than with 20 marginally relevant chunks. Quality over quantity.

Source formatting

How you format the retrieved context in the prompt affects the model’s ability to cite sources and stay grounded.

## Retrieved Context

[Source 1: Admin Runbook v3.2, Section: Password Reset, Updated: 2026-01-15]
To reset a user's password as an admin, navigate to Settings > Security >
User Management. Select the user and click "Reset Password"...

[Source 2: Security Policy v2.1, Section: Password Requirements, Updated: 2025-11-20]
All passwords must be at least 12 characters and include uppercase,
lowercase, number, and special character...

[Source 3: API Reference, Endpoint: POST /api/users/{id}/reset-password]
Resets the password for the specified user. Requires admin role.
Body: { "send_email": boolean, "temporary": boolean }...

Each source is clearly delineated with metadata. The model can reference sources by number. The metadata (section titles, dates) gives the model additional context for prioritization (newer docs might be more authoritative).

Evaluating RAG quality

RAG evaluation requires measuring both retrieval quality and generation quality. Measuring only generation quality (is the final answer good?) misses the root cause when things go wrong. Was the answer bad because the right information wasn’t retrieved, or because the model misused the retrieved information?

Retrieval metrics

Metric What it measures How to compute
Hit Rate Is the correct chunk in the top-K results? Binary: 1 if correct chunk in top-K, 0 otherwise
MRR (Mean Reciprocal Rank) How high does the correct chunk rank? 1/rank of first correct chunk, averaged
NDCG Are highly relevant chunks ranked above less relevant ones? Weighted by relevance grade and position
Recall@K What fraction of relevant chunks are in the top-K? (Relevant chunks retrieved) / (Total relevant chunks)
Precision@K What fraction of top-K chunks are relevant? (Relevant chunks in top-K) / K

To compute these, you need a ground-truth dataset: questions paired with the chunks that contain the answer. Building this dataset is manual work, but it’s essential. Without it, you can’t measure retrieval quality and you can’t tell if a change to your chunking, embedding, or search pipeline made things better or worse.

Start with 50-100 question-chunk pairs. Have domain experts annotate them. This is your retrieval eval suite. Expand it over time with production failures.

Generation metrics

Metric What it measures How to evaluate
Faithfulness Is the answer supported by retrieved context? LLM-as-judge: check each claim against sources
Relevance Does the answer address the user’s question? LLM-as-judge + human eval
Completeness Does the answer use all relevant retrieved info? Coverage check: which sources were referenced?
Hallucination rate Does the answer contain information not in the sources? Entity extraction: compare answer entities to source entities
Citation accuracy Are the citations correct? Verify each citation points to supporting text

The RAGAS framework provides automated computation of many of these metrics. It evaluates context relevance (are retrieved docs relevant?), faithfulness (is the answer grounded in context?), and answer relevance (does the answer address the query?). It’s a good starting point, though I supplement it with domain-specific checks.

The “answer from training” test

The most important RAG-specific evaluation: does the model answer from retrieved context or from its training data?

Create test cases where your knowledge base contains information that contradicts the model’s general knowledge. For example, if your internal docs say “the default session timeout is 30 minutes” but common industry practice is 15 minutes, the model should report 30 minutes (from your docs), not 15 minutes (from its training).

If the model answers from training rather than from the retrieved context, your RAG system isn’t working. The model is ignoring the context and generating from memory. This is the most common and hardest-to-detect RAG failure mode.

Advanced RAG patterns

Beyond the basic retrieve-then-generate pipeline, several advanced patterns have proven valuable in production.

Query decomposition

Complex questions often need information from multiple, unrelated chunks. “Compare our password reset process for admin and self-service users” requires retrieving chunks about admin password reset AND chunks about self-service password reset. A single query might only match one set.

Query decomposition breaks the original query into sub-queries, retrieves for each, and combines the results:

Original: "Compare admin vs self-service password reset"

Sub-query 1: "admin password reset process"
→ Retrieves: admin runbook chunks

Sub-query 2: "self-service user password reset"
→ Retrieves: user-facing help doc chunks

Combined context: both sets of chunks
→ Model can now compare both processes

This is simple to implement (use an LLM to decompose the query, then run retrieval for each sub-query) and dramatically improves answer quality for complex questions.

Hypothetical document embedding (HyDE)

Instead of embedding the query directly, generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. The intuition: a hypothetical answer looks more like the actual document chunks than a question does, so it produces better embedding matches.

Query: "How do I reset a password?"

HyDE step: Generate hypothetical answer:
"To reset a password, navigate to the user settings page
and click the security tab. From there, select the password
reset option and follow the prompts to create a new password."

Embed the hypothetical answer (not the query)
→ Retrieve chunks similar to this hypothetical answer
→ Pass retrieved chunks + original query to model

HyDE works well for questions where the query style is very different from the document style. It adds one LLM call (generating the hypothetical answer), so it increases latency and cost. Use it when retrieval precision is poor and query-document style mismatch is the suspected cause.

Contextual retrieval

Anthropic published a technique called contextual retrieval that adds document-level context to each chunk before embedding. Instead of embedding a raw chunk, you prepend a brief context that explains where the chunk fits in the broader document:

# Before contextual retrieval
Chunk: "Click 'Reset Password' and enter a new password."

# After contextual retrieval
Chunk: "This section describes the admin password reset process
from the Admin Runbook v3.2. Click 'Reset Password' and enter
a new password."

The added context helps the embedding model distinguish between chunks that contain similar text but relate to different topics. In my testing, this reduced out-of-context retrieval errors by about 15%.

Agentic RAG

Instead of a fixed retrieve-then-generate pipeline, an agent decides when to retrieve, what to retrieve, and whether the retrieved context is sufficient. If the first retrieval doesn’t answer the question, the agent reformulates the query and retrieves again.

Agent: The user asks about password reset for enterprise SSO users.
Thought: Let me search for enterprise SSO password procedures.
Action: retrieve("enterprise SSO password reset")
Observation: No relevant results found.
Thought: Maybe it's under "federated authentication" or "SAML."
Action: retrieve("SAML federated authentication password")
Observation: Found relevant docs about SSO password flow.
Thought: Now I have the context to answer.
Action: generate(context + query)

Agentic RAG is more robust than fixed pipelines because it can recover from bad initial queries. It’s also more expensive (multiple retrieval calls, reasoning tokens) and harder to evaluate (the agent’s retrieval strategy is non-deterministic).

Use agentic RAG when your users ask complex or ambiguous questions that a single retrieval pass can’t handle. Use fixed-pipeline RAG when queries are predictable and the knowledge base is well-organized.

The human knowledge work connection

There’s an observation I keep returning to: the principles that make RAG systems work are the same principles that make human knowledge work effective.

Effective researchers don’t try to hold everything in their heads. They maintain organized collections of sources. They retrieve relevant material before formulating opinions. They synthesize from sources rather than generating from memory. They cite their sources so others can verify.

Ineffective researchers rely on memory, generate opinions without consulting sources, and can’t point to evidence for their claims.

The parallel is exact:

RAG component Human knowledge work equivalent
Chunking + indexing Organizing notes, bookmarks, reference library
Query processing Formulating a clear research question
Retrieval Searching your notes, library, and references
Reranking Evaluating which sources are most relevant
Context assembly Gathering the key passages before writing
Generation Synthesizing a response from your sources
Grounding check Verifying claims against your sources
Citation Citing your references

This isn’t a coincidence. RAG works because it mirrors how reliable knowledge production actually works, whether done by humans or machines. The alternative, generating from memory (or from model weights), is faster but less reliable. Just as a human expert who cites sources is more trustworthy than one who speaks from memory alone, a RAG system that retrieves and cites is more trustworthy than a model that generates from training data.

The teams that build the best RAG systems are often the ones that think about retrieval not as a technical problem but as a knowledge management problem. How should this information be organized? What makes a good search query? How do you evaluate whether you’ve found the right sources? These questions are as old as libraries. The technology is new. The principles aren’t.

Build the retrieval right. The generation takes care of itself.

Continue Reading

On Writing

Next page →