Retrieval-Augmented Generation (RAG) is the most practical LLM architecture for enterprise use cases. It solves the two biggest problems with raw language models: hallucination and stale knowledge.
After building RAG systems for hospitals, law firms, and SaaS companies, I've learned what separates a demo that impresses in a pitch from a system that your team actually trusts at 2 AM.
What RAG Is (and Isn't)
RAG is not fine-tuning. Fine-tuning bakes knowledge into the model's weights — expensive, slow to update, and still prone to hallucination. RAG keeps knowledge external, retrieves relevant chunks at query time, and grounds the model's response in your actual documents.
The simplified flow:
- User asks a question
- The question is converted to a vector (embedding)
- A similarity search finds the most relevant document chunks
- Those chunks are injected into the LLM prompt as context
- The LLM generates a response grounded in that context
The magic is in step 3. When it works well, the model says "I found this in document X, section Y." When it fails, the model confabulates confidently. Everything interesting is in making step 3 work reliably.
The Components You Actually Need
Document Ingestion Pipeline
Before you can retrieve, you need to ingest. This is more complex than it looks:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("clinical_protocol.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " "],
)
chunks = splitter.split_documents(documents)
The key parameters are chunk_size and chunk_overlap. Too small and you lose context; too large and you retrieve too much noise. For most professional documents I use 400–600 tokens with 10–15% overlap.
The Vector Database
Your vector store is the retrieval engine. The main options:
| Database | Best for | Hosted |
|---|---|---|
| Pinecone | Simple, managed, reliable | Yes |
| Weaviate | Self-hosted, rich filtering | Both |
| pgvector | Already using Postgres | Self-hosted |
| Chroma | Local dev, prototyping | Local |
For production systems I default to Pinecone for its simplicity, or pgvector if the client already has Postgres — the operational overhead of a new database is often not worth it.
The Retrieval Step
Basic retrieval: cosine similarity search, return top-K chunks. This works for simple FAQs. For production, you need more:
Hybrid search combines vector similarity (semantic) with keyword search (BM25). Semantic search handles paraphrase; keyword search handles exact terminology. Most enterprise search requires both.
Metadata filtering lets you restrict retrieval to specific document types, dates, or departments before the vector search. For a hospital system, you'd filter by department first: {"department": "cardiology"}.
Re-ranking passes retrieved candidates through a cross-encoder that scores query-document relevance more precisely than the initial vector similarity. Adds latency but dramatically improves precision.
The Chunk Size Problem
This is where most RAG demos fail in production.
Small chunks (128 tokens) are precise for retrieval but lose surrounding context. The model might retrieve the answer to "what's the maximum dosage?" but without the sentence before it saying "for patients over 65..."
Large chunks (1024 tokens) preserve context but retrieve too much irrelevant text, degrading generation quality and wasting tokens.
The solution I use in production: hierarchical chunking. Store both summary-level and detail-level chunks. Use the summary for initial retrieval, then fetch the associated detail chunk for generation. More complex, but it solves the context problem at scale.
Evaluating Your RAG System
"It seems to be working" is not a production standard. You need systematic evaluation before you ship.
The minimal evaluation set you should build:
- 50 representative questions that real users will ask
- The expected answer for each (the ground truth)
- The source document and section that contains the answer
Then evaluate three things:
- Retrieval recall — does the correct chunk appear in your top-K results? If it doesn't appear, the LLM can't answer correctly regardless of how good it is.
- Answer faithfulness — does the generated answer only use information from the retrieved context? Hallucination detection.
- Answer relevance — does the generated answer actually address the question asked?
Tools like RAGAS automate this evaluation. Use it.
What I Do Differently in Production
Confidence thresholds: If the similarity score of the top retrieved chunk is below 0.75, don't answer — say "I don't have information about this." Users trust a system that admits it doesn't know more than one that confidently makes things up.
Source citations: Every response includes the document name, section, and page number. This lets users verify answers and builds trust over time.
Query expansion: Before retrieval, generate 2–3 alternative phrasings of the question and retrieve for all of them. Users don't always phrase questions the way documents are written.
Caching: Identical or near-identical queries are common in enterprise settings. Cache both the retrieved chunks and the final response to reduce latency and cost.
When RAG Is the Wrong Choice
RAG is not the solution to every knowledge problem.
If your documents change faster than you can re-index, you'll serve stale results. If your documents are poorly structured or duplicate-heavy, retrieval quality will suffer. If the question requires complex multi-document reasoning across hundreds of chunks, a single RAG query won't get you there — you need agentic retrieval.
RAG works best when: your corpus is stable and well-organised, questions can be answered from a single document section, and you have the engineering capacity to evaluate retrieval quality continuously.
A Note on Prompt Design
The retrieval is only half the problem. The prompt that wraps your context chunks matters enormously:
You are a clinical knowledge assistant. Answer only using the provided context.
If the answer is not in the context, say "I don't have information about this."
Always cite the source document and section.
Context:
{retrieved_chunks}
Question: {user_question}
The instruction "answer only using the provided context" is not magic — models still hallucinate if the context is ambiguous or incomplete. But it significantly reduces hallucination compared to open-ended prompting.
Getting Started
If you're evaluating RAG for your business, the fastest path to a useful prototype is:
- Pick 20–30 representative documents from your corpus
- Build a basic ingestion pipeline with
langchainandchroma(local, no cost) - Create your 50-question evaluation set
- Measure retrieval recall before you do anything else
If retrieval recall is below 80%, no amount of prompt engineering will fix it. Fix retrieval first.
If you want to skip the learning curve and get a production-ready system built in weeks, let's talk.