MindTorc
Back to Blog
AI

RAG Is Not a Feature. It Is a System.

A breakdown of the retrieval, ranking, and evaluation layers that separate production RAG from demo RAG.

MindTorc AI Team·AI EngineeringMar 5, 202614 min read
RAG Is Not a Feature. It Is a System.

Every week someone asks us to "add RAG" to their product. They expect it to work like adding a database: you hook it up, throw your documents in, and it answers questions correctly. It does not work that way. RAG is a system with three distinct layers, and if any layer is broken, the whole thing fails in ways that are genuinely hard to debug from the outside.

Layer 1: Retrieval

Retrieval gets the most attention and the least rigor. Most teams use cosine similarity on embeddings, pick a vector database, and call it done. This works reasonably well for small, clean corpora with well-formed queries. It breaks down when:

  • Documents are long and relevant information is buried in paragraph 47
  • Queries are ambiguous ("pricing" could mean three different things)
  • The corpus is large with many documents that are slightly similar to each other
  • Users ask questions the way they think, not the way documents are written

Before you choose an embedding model or vector database, build an evaluation set. Fifty real question-answer pairs from actual users if you have them, synthetic if you do not. Run every retrieval change against this set. This sounds like work but you will thank yourself the first time a seemingly minor embedding model change destroys your hit rate in ways you would not have caught until users complained.

Layer 2: Ranking and filtering

Retrieved chunks often include noise: documents that contain some of the query terms but do not actually address the question. A cross-encoder reranker as a second pass significantly improves what the model actually sees. The reranker scores each chunk against the full query rather than just comparing embedding vectors, which catches relevance that semantic similarity misses.

Beyond relevance, you need filtering. Does the user have permission to see this content? Is the chunk from a document that is still current and valid? Building metadata indexing alongside your vector index from the start is much cheaper than retrofitting it after your corpus has grown to thousands of documents.

Layer 3: Generation

Given genuinely good context, modern LLMs generate reasonable answers most of the time. The system fails when the context does not actually contain the answer and the model hallucinates one anyway, when the context is too long and relevant information gets lost in the middle, or when the query is ambiguous and the model silently picks the wrong interpretation.

For the hallucination problem, the most reliable fix is explicit "I do not know" instructions in your system prompt combined with confidence calibration. Prompt the model to acknowledge uncertainty and cite sources rather than generate plausible-sounding answers from memory.

For the context length problem, do not just stuff chunks in document order. Order them by relevance score, use compression where the content allows it, and test systematically at different context lengths with your specific models. The sweet spot varies significantly by model.

The evaluation problem everyone underestimates

The hardest part of RAG in production is not the initial build. It is knowing when the system is actually working. You need three distinct metrics:

  • Hit rate: did retrieval return the right chunks at all?
  • Faithfulness: is the generated answer actually grounded in the retrieved context?
  • Correctness: is the answer right from the user's perspective?

Tools like Ragas and LangSmith help, but automated metrics are not substitutes for human evaluation on your specific domain. Budget for a monthly review of 100 production queries by someone who actually understands the content. This is the step that catches systematic failures that automated scoring misses, which in our experience is most of the failures that matter.

If you are planning to ship a RAG feature, budget for a proper evaluation harness before writing any retrieval code, two to three months of iteration on the retrieval and ranking layers, and ongoing human evaluation after launch. The teams that succeed treat RAG as a product discipline, not a configuration exercise.