
What Is Retrieval-Augmented Generation (RAG) and How Does It Work?
RAG for Everyone
The problem: a brilliant student with an outdated textbook
Imagine hiring a brilliant consultant. She has read millions of books, can write beautifully, and reasons about almost any topic. But there's a catch: she finished her reading two years ago, locked herself in a room, and hasn't seen anything new since. She has never read your company's internal documents, your product manuals, or yesterday's news.
That consultant is a large language model (LLM) — the technology behind tools like ChatGPT and Claude. These models are trained on enormous amounts of text, but their knowledge is frozen at the moment training ended. Ask them about your company's vacation policy or a contract you signed last month, and they have two options: admit they don't know, or — worse — make something up that sounds convincing. This "making things up" problem is called hallucination, and it's one of the biggest obstacles to using AI in serious work.
The solution: give her a library card
Retrieval-Augmented Generation, or RAG, solves this with an idea so simple it feels obvious in hindsight: before the AI answers your question, let it look things up.
Instead of relying only on memory, a RAG system works like an open-book exam:
You ask a question. "How many vacation days do new employees get?"
The system searches. Behind the scenes, it scans a collection of documents you've given it — HR policies, manuals, reports, whatever you choose — and pulls out the few passages most relevant to your question.
The AI reads, then answers. Those passages are handed to the AI along with your question, with an instruction that amounts to: "Answer using this material." The AI writes its response grounded in what it just read.
The answer you get is no longer based on a two-year-old memory. It's based on your documents, retrieved fresh, at the moment you asked.
Why this matters
Accuracy. When the AI answers from real documents rather than fuzzy memory, it's far less likely to invent facts. Many RAG systems even show their sources — "according to page 12 of the HR handbook" — so you can verify the answer yourself.
Freshness. Updating the AI's knowledge no longer requires retraining a giant model (slow and expensive). You just update the documents. Add this morning's report to the library, and the AI can use it this afternoon.
Privacy and specificity. Your private documents never become part of the AI's training. They stay in your library, consulted only when needed. This is how a general-purpose AI becomes an expert on your business overnight.
Cost. Teaching a model new knowledge through retraining costs enormous amounts of compute. Giving it a search tool costs almost nothing by comparison.
A real-world picture
Think of a customer-support chatbot for an airline. Without RAG, it might confidently quote a baggage policy from 2023 that no longer exists. With RAG, when you ask "Can I bring a stroller?", the system fetches the current baggage policy page, hands it to the AI, and the AI answers from that page. If the policy changes tomorrow, the answers change tomorrow too — no retraining, no waiting.
That's RAG in a sentence: a memory upgrade for AI, delivered as a library card instead of brain surgery.
RAG for Engineers
Architecture overview
A RAG pipeline has two phases: an offline indexing phase and an online retrieval + generation phase.
INDEXING (offline)
Documents → Chunking → Embedding → Vector store
QUERY TIME (online)
User query → Embed query → Similarity search → Top-k chunks
→ Prompt assembly (query + chunks) → LLM → Grounded answer
1. Ingestion and chunking
Raw documents (PDFs, HTML, Markdown, database rows) are parsed and split into chunks. Chunking is one of the most consequential design decisions in the pipeline:
Fixed-size chunking (e.g., 512 tokens with 10–20% overlap) is simple and predictable but can split sentences or tables mid-thought.
Semantic / structural chunking splits on headings, paragraphs, or semantic boundaries, preserving coherent units of meaning.
Trade-off: small chunks give precise retrieval but lose context; large chunks preserve context but dilute relevance scores and burn context-window tokens. A common pattern is parent-document retrieval: search over small chunks, but feed the LLM the larger parent section they came from.
Metadata (source, date, section, access permissions) is attached to each chunk — essential for filtering, citations, and security.
2. Embeddings
Each chunk is passed through an embedding model (e.g., OpenAI text-embedding-3, Cohere Embed, or open-source models like bge or e5) that maps text to a dense vector — typically 384 to 3072 dimensions — where semantic similarity corresponds to geometric proximity. "How do I reset my password?" and "Steps for password recovery" land close together in this space despite sharing few words.
3. Vector storage and search
Vectors go into a vector database (pgvector, Pinecone, Weaviate, Qdrant, Milvus, or even an in-memory FAISS index). At query time, the user's question is embedded with the same model, and the store performs approximate nearest neighbor (ANN) search — algorithms like HNSW trade a tiny amount of recall for sub-millisecond lookups across millions of vectors. Similarity is usually cosine similarity or dot product.
Pure vector search misses exact matches (product codes, names, acronyms), so production systems typically use hybrid search: combining dense vectors with classic lexical search (BM25), then merging results with reciprocal rank fusion.
4. Reranking
The top-k candidates (say, 20–50) from the fast ANN search are often passed through a cross-encoder reranker (e.g., Cohere Rerank, bge-reranker). Unlike the embedding model, which encodes query and document independently, a cross-encoder reads them together, producing a much more accurate relevance score. It's too slow to run over the whole corpus, but cheap to run over 50 candidates. The best 3–10 chunks survive.
5. Prompt assembly and generation
The surviving chunks are injected into the LLM prompt, typically structured like:
System: Answer the user's question using only the provided context.
If the context is insufficient, say so. Cite sources by ID
Context:
[1] (hr-handbook.pdf, p.12) "New employees accrue 25 vacation days..."
[2] (policy-update-2026.md) "Effective March 2026, accrual begins..."
User: How many vacation days do new employees get?
The LLM generates an answer grounded in this context. Instructing the model to cite chunk IDs enables verifiable, linkable answers and makes hallucination detectable.
6. Evaluation
RAG quality is measured along two axes:
Retrieval metrics: recall@k, precision@k, MRR — did the right chunks make it into the context?
Generation metrics: faithfulness (is the answer supported by the retrieved context?) and answer relevance (does it actually address the question?). Frameworks like RAGAS and tools like LangSmith or Arize automate this with LLM-as-judge scoring.
A useful debugging heuristic: most bad RAG answers are retrieval failures, not generation failures. Inspect what was retrieved before blaming the model.
Common failure modes and fixes
Failure | Typical cause | Mitigation |
|---|---|---|
Right answer exists, wrong chunks retrieved | Vocabulary mismatch between query and docs | Hybrid search, query rewriting/expansion (HyDE) |
Retrieved chunks lack context | Chunks too small | Parent-document retrieval, larger overlap |
Answer ignores context | Weak prompt, conflicting parametric knowledge | Stronger grounding instructions, citation enforcement |
Stale answers | Index not refreshed | Incremental/event-driven re-indexing |
Data leakage across users | Missing access control on chunks | Metadata-based permission filtering at query time |
Beyond vanilla RAG
The field has moved past the simple "embed → search → stuff prompt" pattern:
Query transformation: rewriting vague user queries, decomposing multi-part questions into sub-queries, or generating a hypothetical answer and searching with that (HyDE).
Agentic RAG: the LLM decides whether to retrieve, what to search for, and can iterate — search, read, refine the query, search again — rather than executing a fixed pipeline.
GraphRAG: building a knowledge graph from documents so retrieval can follow entity relationships, which helps for questions that span many documents ("summarize all disputes involving supplier X").
Long-context vs. RAG: as context windows grow to millions of tokens, you can sometimes just paste everything in. But RAG remains cheaper, faster, and more auditable at scale — and the two combine well (retrieve broadly, let the long context absorb more chunks).
RAG vs. fine-tuning — a quick rule of thumb
Fine-tuning teaches a model new behavior (style, format, domain reasoning patterns). RAG gives it new knowledge (facts, documents, current data). Use fine-tuning when the model answers in the wrong way; use RAG when it answers with the wrong facts. Many production systems use both.
In short: RAG turns a closed-book exam into an open-book one. For users, it means accurate, current, verifiable answers. For engineers, it means knowledge updates become a data-pipeline problem instead of a model-training problem — and that changes everything about how fast you can ship.
Author: Mohamad Arnaout is a software professional at Devista with experience in software development, business analysis, and project management. He is passionate about designing practical technology solutions that improve processes, enhance collaboration, and deliver business value.
