RAG that holds up: connecting org data to LLMs
Most production RAG systems get to demo, then quietly underperform. The model isn't the bottleneck. The retrieval is. Treating RAG as "embed everything and hope" is the most common reason an internal AI assistant feels great in week one and useless in month three.
Here's what reliably moves the bar.
RAG is a data pipeline, not a model
A RAG system has seven stages, and quality compounds across them:
- Ingestion, parse source documents into clean text.
- Chunking, split the text into retrieval units.
- Embedding, encode chunks as vectors.
- Storage, index vectors and (ideally) keyword tokens.
- Retrieval, given a query, return candidate chunks.
- Reranking, re-order candidates by a more accurate scorer.
- Generation, give the LLM the top chunks and a grounded prompt.
Most teams skip step 6, half-ass steps 2 and 5, and then blame the LLM at step 7.
Don't lose document structure
Chunking is where most quality is lost. A wiki page becomes a wall of disconnected 512-token segments and the retrieval no longer knows what document a chunk came from, where it sat in a hierarchy, or what its neighbors said.
Practices that pay back immediately:
- Chunk by structure first, length second. Use the document's own headings and sections.
- Attach metadata to every chunk:
doc_id, breadcrumb, last-modified, owner, ACL. Filter on it at retrieval time. - Overlap is not a chunking strategy. It's a band-aid for bad chunking. Fix the chunking instead.
- Tables and code don't chunk like prose. Keep them whole, or extract them separately and link back.
Hybrid search beats pure vector search
Pure dense retrieval misses on rare terms, jargon, identifiers, and short queries. A query like "the SOC2 control C7.3 audit log requirement" is half names and acronyms, exactly what BM25 handles well and what a generic embedding model handles mediocrely.
Run both. Score-fuse with Reciprocal Rank Fusion. Retrieve maybe 20 candidates from each, fuse, take the top 10. This is the single biggest quality win in most RAG systems and it costs almost nothing to implement.
Use a reranker
Top-k from a retriever is not the same as top-k by relevance. A cross-encoder reranker, a small model that scores query+chunk pairs together, typically lifts retrieval precision substantially over the bare retriever.
Use one. Open-source options like bge-reranker are good defaults; hosted options like Cohere's reranker work too. Rerank the top 20-30 candidates down to the 5 you actually pass to the model.
Cite, then ground
Every chunk you pass to the LLM should carry a citation handle (doc_id + chunk_id) so the model can quote it. Two reasons:
- Trust, users can verify.
- Debugging, when an answer is wrong, you can see which chunks led to it. Without citations, every regression becomes a "it just hallucinated" mystery.
Prompt the model to refuse when retrieved context doesn't answer the question. "If the answer is not in the provided context, say you don't know." This single instruction reduces fabrication more than any embedding model upgrade.
Eval the retrieval, not just the answer
Most teams eval the final LLM answer and stop there. That confounds two failure modes, bad retrieval (the right chunk wasn't there) and bad generation (the right chunk was there and the model still got it wrong). They're fixed in different ways.
Build a small golden set, 50-200 question/expected-chunk pairs, hand-curated. Measure:
- Retrieval recall@k, was the right chunk in the top k?
- Answer faithfulness, does the answer only state things present in retrieved chunks?
- Answer correctness, does the answer match the expected one?
The golden set is annoying to build and worth twice what it costs. It's the only way to tell whether your changes are actually better, and the only way to catch a model upgrade that quietly regressed your specific use case.
Anti-patterns to avoid
A short list of things that look like progress and aren't:
- Re-embedding the corpus with a bigger model without re-evaluating retrieval recall.
- "Increasing the chunk size" as a fix for missing context. Almost always a chunking-strategy problem.
- Putting the entire retrieved context into the prompt, regardless of relevance. The LLM will use whatever it sees.
- Skipping observability on retrieval. Every query, top-k, scores, final selected chunks, log it. You will need this data the first time the system is wrong.
The shape that works
A RAG system that holds up has good ingestion, structure-aware chunking, hybrid retrieval, a reranker, citations, refusal-on-no-context, and a maintained eval set. None of it is novel. All of it is unglamorous data engineering. That's the point, RAG is plumbing. Treat it like plumbing and it will hold up.
If your RAG system stalled out somewhere between demo and "actually useful," we'd like to hear about it.