Retrieval-Augmented Generation (RAG) has become the standard architecture for grounding LLM responses in real documents. But most teams discover the same thing after their first prototype: the demo works, production doesn’t. The difference almost always lives in the quality of three components — how you chunk documents, which embedding model you use, and how you retrieve results at query time.
Here’s a practical look at each stage of a production RAG pipeline with enough detail to make real implementation decisions.
The Core RAG Pipeline
Every RAG system has the same five stages: ingestion (load raw documents — PDFs, HTML, markdown, database rows), chunking (split documents into retrievable units), embedding (convert chunks into vector representations), indexing (store vectors and metadata in a vector store), and retrieval and generation (at query time, embed the query, fetch relevant chunks, and generate a grounded response).
When you first build it, stages 1 and 4 feel important. In production, stages 2 and 3 determine most of your outcome quality. Stage 5 — how you retrieve — is where you capture the last 20% of gains.
Chunking Strategies
Chunking determines the units your retrieval system can return. A chunk that’s too small lacks context; too large and it contains irrelevant content that degrades generation quality and wastes tokens.
Fixed-Size Chunking
Split text every N tokens (typically 256–512), with an overlap of 10–20% to avoid cutting sentences mid-thought.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)
Use this for homogeneous documents with consistent content density — news articles, product descriptions. It’s fast to implement and easy to tune. The limitation: it ignores document structure, so a chunk may cut across a heading boundary and orphan context.
Semantic Chunking
Group sentences by semantic similarity — keep sentences together when they discuss the same idea, split when the topic shifts.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")
chunks = splitter.split_documents(docs)
Better for long documents with varied topics — research papers, legal documents, technical manuals. Produces more coherent chunks at the cost of higher compute during ingestion.
Hierarchical (Parent-Child) Chunking
Store large “parent” chunks for generation context and small “child” chunks for precision retrieval. Retrieve via child chunks, but pass the parent chunk to the LLM.
from llama_index.node_parser import HierarchicalNodeParser
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # parent, child, grandchild
)
nodes = parser.get_nodes_from_documents(documents)
This is the highest-performing strategy for most enterprise RAG use cases. Small chunks give precise retrieval; large parent chunks give the model sufficient context to generate accurate, coherent answers. The extra storage cost is almost always worth it.
Embedding Model Selection
The embedding model determines how well your semantic search performs. Higher-dimensional models (1536d, 3072d) capture more nuance but cost more to store and query. For most production use cases, 768–1024 dimensions offer a strong quality/cost balance.
General-purpose models (OpenAI’s text-embedding-3-large, Cohere’s embed-english-v3) perform well on broad content. For legal, medical, or scientific domains, fine-tuned models — or models like BAAI/bge-large-en-v1.5 from the MTEB leaderboard — often outperform them significantly.
The MTEB (Massive Text Embedding Benchmark) leaderboard at huggingface.co/spaces/mteb/leaderboard is the most reliable source for comparing models on retrieval tasks. Filter by the task type closest to your use case.
For 2026: text-embedding-3-large (OpenAI) or voyage-3 (Voyage AI) for hosted solutions. BAAI/bge-m3 for self-hosted multi-language workloads.
Retrieval Strategies
Basic top-k cosine similarity retrieval is a starting point, not a destination.
Maximal Marginal Relevance (MMR)
Standard top-k retrieval returns the N most similar chunks — which often means N very similar chunks that all say the same thing. MMR balances relevance against diversity, penalising chunks that are too similar to ones already selected.
retriever = vector_store.as_retriever(
search_type="mmr",
search_kwargs={"k": 6, "fetch_k": 20, "lambda_mult": 0.5}
)
# lambda_mult: 0 = maximum diversity, 1 = maximum relevance
MMR is particularly valuable when documents have repeated content — a legal clause that appears in multiple contracts, for instance.
Hybrid Search
Combine dense (vector) retrieval with sparse (BM25 keyword) retrieval and merge the ranked lists. Dense retrieval finds conceptually related content; sparse retrieval finds exact keyword matches. Together they handle both “find me documents about financial risk” and “find every occurrence of clause 14.2(b)”.
from langchain.retrievers import EnsembleRetriever
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6] # tune based on your eval results
)
Weaviate and Elasticsearch have hybrid search built in. For Pinecone or ChromaDB, implement BM25 separately (the rank_bm25 Python library) and merge results using Reciprocal Rank Fusion (RRF).
Reranking
Add a cross-encoder reranker as a final step: retrieve a large candidate set (top 20–50) with fast vector search, then rerank with a more accurate model.
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vector_retriever
)
Cohere Rerank and cross-encoder/ms-marco-MiniLM-L-6-v2 (open-source) are the standard choices. Reranking typically adds 50–150ms latency but can improve answer accuracy by 15–30% on complex queries. Worth it for most use cases.
Evaluation Metrics
Don’t deploy a RAG system you haven’t evaluated. The three metrics that matter most are context recall (of the relevant chunks that exist in your corpus, what percentage are actually retrieved?), context precision (of the chunks retrieved, what percentage are actually relevant?), and answer faithfulness (does the generated answer stick to what the retrieved chunks actually say, or does the model hallucinate beyond the context?).
The RAGAS library automates all three:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
Build an eval set of at least 50–100 representative queries with known correct answers before touching production. Run evals every time you change chunking strategy, embedding model, or retrieval parameters. Without eval infrastructure, RAG “tuning” is guesswork.
Putting It Together
A production-ready RAG stack in 2026 typically looks like: hierarchical chunking with semantic splitting → text-embedding-3-large or voyage-3 → Pinecone or Weaviate with hybrid search → MMR retrieval → cross-encoder reranking → RAGAS-evaluated generation. Start simpler, add complexity only when your eval results justify it.