
The Six-Layer Reference Architecture
An enterprise RAG system combines a large language model with a retrieval layer that pulls from authoritative internal sources before generating any answer. That description sounds simple. The architecture required to make it work at enterprise scale is not.
POCs that look 90% accurate on curated demo sets drop to 30-50% correct on real user traffic when any architectural layer is missing. A production system requires six distinct engineered layers: ingestion, indexing, query understanding, hybrid retrieval, generation with guardrails, and observability. Kapa.ai's published architecture for deployments at Docker, CircleCI, Reddit, and Monday.com makes this concrete. Their stack explicitly separates ingestion with delta processing, hybrid retrieval with metadata filtering, prompting with citation enforcement, and continuous evaluation. These layers are what allow a RAG system to survive contact with the messy, permissioned, multi-source corpora that exist inside real companies.
The failure pattern is consistent. Teams build a POC, retrieve from a curated document set, demo well, and ship to production. Real user traffic is different: ambiguous queries, documents with conflicting versions, permission boundaries, edge-case terminology. Accuracy collapses. The POC never had the layers to handle those cases, and nothing in the system logs which layer failed.
Missing any single layer is the strongest predictor of POC-to-production failure.
Industry best practices have converged on a recognizable stack: ingestion and preprocessing, vector indexing via stores like Pinecone, Weaviate, or Vespa, retrieval orchestration through frameworks like LangChain, LlamaIndex, or Haystack, generation, and evaluation with monitoring. The Manifold AI engineering talk, which walks through production RAG failures across multiple enterprise deployments, puts this sequence directly: query understanding feeds hybrid retrieval, which feeds reranking, which feeds context filtering with fallback, which feeds an evaluation loop, which feeds monitoring. Teams that compress or skip steps in that chain get answers that look plausible but are wrong, with no instrumentation to trace why.
The most common failure across those deployments: teams treat the LLM as a single opaque endpoint. No observability, no fallback, no evaluation harness. When the system degrades, there is no signal that it has.
Why Frameworks Only Solve Part of the Problem
LangChain, LlamaIndex, Haystack, and RAGFlow are real engineering tools. Each abstracts meaningful complexity: retrieval orchestration, chain management, document loaders, embedding integrations. The error is mistaking framework adoption for architecture.
A framework handles the connective tissue between layers. It does not define your chunking strategy, enforce citation behavior at generation time, or instrument the retrieval pipeline for precision metrics. Those decisions belong to the engineering team and must be made explicitly. No framework makes them for you.
There is a fair pushback here: modular RAG with chunking and vector search is enough for most enterprise use cases, and adding layers prematurely creates complexity that slows shipping. There is real force to this. Starting simple is correct. Kapa.ai's own guidance recommends a modular approach, and teams should not add reranking or query rewriting before they have signal that those investments are needed.
But there is a difference between implementing each layer simply and skipping layers entirely. Even a v1 system should have observability and an eval harness. Those two components generate the signal that tells you what to build next. Ship without them and you are flying blind on whether retrieval quality is acceptable, whether hallucination rates are within tolerance, and whether a data source change just broke something upstream. The eval harness is cheap to add early. It is expensive to retrofit after a production incident.
The six layers are: ingestion, indexing, query understanding, hybrid retrieval, generation with guardrails, and observability. A v1 implementation of each can be minimal. The layers themselves cannot be absent. That is how you build an enterprise RAG system without painting yourself into a corner.
Ingestion, Chunking, and Delta Indexing
The six layers establish the skeleton. The next sections explain the engineering decisions inside the layers where most production systems break, starting with the one closest to the source data.
The ingestion layer determines the retrieval ceiling. The model cannot retrieve what was never ingested correctly, and it cannot reason over chunks that were split in ways that destroyed meaning. Source curation, semantic chunking, metadata enrichment, and delta indexing set the upper bound on what any retrieval or generation layer can produce. Choosing a better model does not raise that ceiling.
Kapa.ai's work across 100+ technical teams produced a clear data strategy: start with primary authoritative content, specifically technical documentation, API references, and verified support solutions. Expand to secondary sources only with strict filters on recency and authority. The reasoning is straightforward. A RAG system grounded in authoritative sources with version metadata returns answers that can be traced and audited. One that ingests everything available, including stale forum posts, deprecated documentation, and unverified wikis, produces answers that are harder to attribute and harder to correct when wrong.
That same data strategy treats the knowledge base as deployed software. Incremental refresh using Git-diff style delta processing replaces affected portions of the corpus on change, rather than reindexing from scratch. Full reindexes are expensive, slow, and create downtime windows where the index is either stale or unavailable. Delta processing keeps the index current at the document level without those costs.
Naive fixed-size chunking reduces top-k retrieval accuracy by 20 to 30 percentage points compared to semantic, boundary-aware chunking. That is a large number, and it has a concrete mechanism. Fixed-size chunks split tables mid-row, divide procedure steps across boundaries, and break argument structures in ways that leave each fragment without sufficient context to be retrieved correctly. Picture a user querying for a numeric value in a pricing table. The chunk boundary falls inside the table. The retrieved chunk contains the right column header but the wrong row. The model returns a wrong number confidently, and nothing in the response signals that the chunk was malformed.
Google's published RAG best practices for Vertex AI Search show what the alternative looks like. Their retrieval pipeline tests chunk sizes of 400, 600, and 1,200 characters against the same query set, then measures retrieval performance at each size before choosing. Every chunk carries version metadata, source identifiers, and timestamps. Chunking becomes a measured engineering decision with a feedback loop, rather than a default setting that ships unchanged from the tutorial.
The metadata attached at chunk creation is what makes the system filterable downstream. Without source, version, and timestamp attached to each chunk, the retrieval layer cannot enforce access controls, cannot filter by product version, and cannot prefer recent content over stale content. Those filters require metadata present in the index from the moment the chunk was created.
Embeddings, Vector Store Choice, and Delta Indexing in Practice
Embedding model selection matters, but vector store architecture matters more at scale. The choice of Pinecone, Weaviate, or Vespa determines what filtering, metadata handling, and hybrid retrieval capabilities are available to the retrieval layer above. Those are vector store options for RAG worth evaluating before committing to an ingestion schema, because the chunk metadata structure must match what the store can index and filter against.
Delta indexing in practice means treating every document as a versioned artifact. When a source document changes, the system identifies the delta, reprocesses affected chunks, updates the index for those specific chunks, and leaves unchanged content alone. The index stays current. The compute cost stays proportional to what actually changed.
A familiar pushback from teams shipping on newer models: modern long-context LLMs can process full documents in a single pass, so chunking strategy is increasingly irrelevant. Just retrieve full documents and let the model sort out what is relevant.
Long context is a real capability. It does not solve the cost or relevance problem at enterprise scale. EnterpriseRAG-Bench uses a corpus of over 500,000 documents. At that scale, you must retrieve a manageable subset before generation. The cost of passing entire documents through the context window on every query is prohibitive, and retrieval quality without chunk-level filtering is lower, not higher, because there is no mechanism to enforce version, ACL, or recency constraints at document retrieval time. Chunk-level metadata remains the only practical path to those filters. Long context and good chunking operate at different stages of the same pipeline.
Query Understanding, Hybrid Retrieval, and Reranking
Once retrieval is producing relevant, ranked context, the next architectural question is what happens at generation time, and what stops the model from inventing answers around that context. Before reaching generation, though, there is a prior failure point that kills accuracy quietly: the retrieval layer itself.
Pure vector search is insufficient for enterprise workloads. Production retrieval requires an explicit query understanding layer, hybrid retrieval combining BM25 with dense embeddings, and a reranking stage. The industry is converging on this fast. VentureBeat-cited data shows enterprise intent to adopt hybrid retrieval tripled from 10.3% to 33.3% in a single quarter of 2025, driven by accuracy and reliability failures at scale. That tripling is a market correction.
The mechanism behind the failure is specific. Dense embeddings are weak for rare identifiers: part numbers, SKUs, code symbols, policy clause references, and exact phrase matches. A user querying "Policy 9.3.2 subsection (d)" gets back documents on similar policy topics, not the exact clause. The model then reasons from the wrong source and returns incorrect legal guidance. The retrieval step looked successful because it returned something relevant. It was not successful, because it did not return the right thing.
BM25 catches what dense embeddings miss. Keyword-based retrieval handles exact-match lookups precisely. Fusing BM25 scores with vector similarity scores at query time, using reciprocal rank fusion or a learned fusion weight, gives the system coverage across both semantic and lexical query types. Neither approach alone covers the full range of enterprise query patterns.
Query Rewriting and the Case for a Dedicated Query Understanding Layer
Hybrid retrieval solves part of the problem. Query understanding solves a different part.
Users do not write retrieval-optimized queries. They ask conversational questions, use pronouns that reference earlier turns in a conversation, omit context they consider obvious, and use internal terminology that does not match document language. A query that reaches the retrieval layer unchanged carries all of that ambiguity into the search.
A dedicated query understanding layer addresses this before retrieval runs. Query rewriting expands ambiguous terms, resolves coreference across conversation turns, and reformulates natural language questions into forms that match the document vocabulary in the index. Hypothetical document embedding, where the system generates a hypothetical answer and uses it as the retrieval query, is a separate technique that improves recall for question-answering workloads by anchoring the search in the answer space rather than the question space.
The results of adding this layer are measurable at the framework level. Atlan's 2026 benchmarks place LlamaIndex retrieval accuracy at approximately 92% versus LangChain at approximately 85%. Retrieval framework choice and the sophistication of the query processing pipeline both contribute to that gap. Teams that treat query processing as a pass-through step are leaving accuracy on the table.
Vectara and Contextual AI have drawn the same conclusion at the product level. Both ship hybrid retrieval and reranking as default behavior, not optional configuration flags. Vectara's HHEM hallucination evaluation model sits alongside its hybrid retrieval pipeline as a production component, not an add-on. Contextual AI treats cross-encoder reranking as a baseline requirement. When two managed retrieval stacks independently converge on the same architecture as their default, that is the production consensus speaking.
Reranking deserves its own emphasis. Initial retrieval optimizes for recall by definition: return everything likely relevant, then sort. A reranking stage, typically a cross-encoder model that scores query-document pairs jointly rather than independently, re-orders the retrieved set by true relevance before passing it to generation. The top-k chunks the model receives are the most relevant available, not just the nearest neighbors in embedding space. Skipping reranking is common. The accuracy cost shows up in generation.
A second objection deserves a direct answer: adding a query understanding layer and reranking introduces latency, and most users will not tolerate an extra 500 milliseconds. This is the wrong frame.
The real latency problem runs the other direction. A vector-only system that returns a wrong answer fast sends users back to rephrase, clarify, or escalate to a human. Multiply that loop across a large user base and total time-to-answer is higher with the fast system than with the accurate one. Latency budgets should be measured end-to-end with correctness factored in, not per-stage in isolation. Weaviate and Vespa both demonstrate sub-100ms hybrid query latency at scale when tuned properly. The performance cost of doing retrieval correctly is lower than it appears when benchmarked against a vector-only baseline, and the accuracy gain is not marginal.
Vespa is also the only open-source platform identified in Atlan's benchmarks that combines retrieval and ML ranking at billions-of-documents scale without requiring a separate serving layer. For teams operating at that scale, the infrastructure simplification is itself an argument for the architecture.
The retrieval layer is where framework selection decisions have the largest downstream effect on accuracy, which is exactly the choice the next section breaks down. Choose the retrieval architecture before choosing the framework, not after.
Choosing Your RAG Stack: Platforms, Frameworks, and Specialists
The single most common mistake teams make when building an enterprise RAG system is shortlisting tools before choosing a direction. Settle on one of three paths first: a managed platform, an open framework, or a custom build with a specialist partner. The tool comparison only becomes useful after that decision.
Here is how the main options stack up.
LangChain. Best for teams that want maximum flexibility and a large ecosystem of integrations. The trade-off is real operational overhead: you own the glue code, the versioning, and the debugging.
LlamaIndex. Best for teams whose primary use case is document-heavy retrieval over structured and unstructured data. Its abstractions can obscure pipeline behavior, which makes production debugging harder than it looks in demos.
Haystack. Best for engineering teams that want a modular, pipeline-first architecture with strong support for hybrid search. The learning curve is steeper than LangChain for teams new to pipeline composition.
Vectara. Best for enterprises that want a fully managed RAG API with grounded generation and built-in hallucination controls. The managed model limits customization of the retrieval layer.
Ragie. Best for product teams that want a simple API to ship retrieval features fast without standing up infrastructure. Control over chunking strategy and re-ranking behavior is limited.
Glean. Best for enterprises that need workplace search across SaaS tools with permissions-aware retrieval out of the box. Glean is a closed platform, so it does not extend cleanly into custom generative workflows outside its scope.
Contextual AI. Best for regulated industries that need RAG systems with strong provenance tracking and audit trails. It sits at the enterprise tier, with pricing and onboarding complexity to match.
RAGFlow. Best for teams that want open-source document parsing and end-to-end RAG pipeline control. Production hardening is entirely your responsibility.
Meilisearch. Best for teams that need fast, typo-tolerant search as the retrieval backbone for a RAG pipeline. Vector search support is newer and less battle-tested than its lexical search core.
DSPy. Best for research-oriented teams that want to optimize prompt pipelines programmatically rather than by hand. DSPy requires ML fluency to use well, and it is a poor fit for teams looking to ship quickly.
For the vector database layer sitting underneath any of these tools, that comparison lives in a separate guide. The choice there is meaningfully different from the orchestration decision above.
Pick your direction before you pick your tools. Managed platforms reduce operational surface area but constrain customization. Open frameworks give you control but transfer the integration burden to your team. On engagements we have run, a custom build with a specialist partner makes sense when retrieval requirements are domain-specific enough that off-the-shelf pipelines will underperform, and you need someone accountable for the full system in production.
Generation Guardrails, Evaluation, and Observability
Retrieval quality sets the ceiling. The generation and observability layers determine whether the system stays at that ceiling or drifts downward unnoticed.
The generation layer must enforce retrieval-only answers with source citations and explicit "don't know" behavior. The observability and evaluation layer is the single clearest differentiator between RAG systems that improve over time and those that silently degrade. These layers separate an auditable, correctable system from one that fails in ways nobody can trace.
Synvestable reports that 70% of RAG systems still lack evaluation frameworks. That means most deployed enterprise RAG cannot detect quality regressions. The model itself is not a safety net: ChatGPT hallucinates 28.6% of the time on benchmarked questions even when ungrounded. A RAG system without an eval harness has no mechanism to detect when retrieval quality drops, when a data source change introduces contradictory content, or when prompt changes shift the hallucination rate upward. The system keeps serving answers. Nobody knows whether those answers are getting worse.
On production engagements where we enforced retrieval-only prompting, citation requirements, and continuous evaluation, we cut hallucination rates from roughly 20% on a base model to under 2%, measured on groundedness and faithfulness. That reduction comes from the architecture around the model.
Guardrails, Citation Enforcement, and the Cost of Not Measuring
Guardrails at the generation layer have three jobs: restrict the model to retrieved context, require a source citation on every factual claim, and produce an explicit "I don't know" response when retrieved context is insufficient to answer the question. The third job is the one most systems skip.
A model that always attempts an answer will confabulate when retrieved context is weak. Explicit "don't know" behavior must be engineered through a deliberate prompt instruction and a retrieval confidence threshold below which the system declines to answer rather than guessing. It must then be measured to confirm it holds across query types.
Measurement operates across three distinct layers. Retrieval quality requires precision@k, recall@k, MRR, and nDCG. Generation quality requires faithfulness scores, answer relevance, citation coverage, and hallucination rate. End-to-end quality requires correctness, factuality, latency, cost, and policy violation rates. Both classical metrics and LLM-as-a-judge approaches apply at the generation and end-to-end layers. Running only one category of measurement misses failure modes that only appear at another layer.
On a Meta engagement, we built a RAG system parsing unstructured vendor data across roughly 3.5 million records. The system delivered a 40% improvement in search precision. The internal tooling behind that engagement, Azumo RAG Primitive, is a no-code pipeline builder with modular blocks for parsing, chunking, embedding, and enrichment. We built it for the eval and observability layer: when we audit client systems, that layer is consistently where instrumentation is missing entirely. Teams have retrieval. They have generation. They have no signal on whether either is working.
One more objection, this time about budget: building a full evaluation harness with golden datasets and continuous monitoring is expensive overhead for a system serving a few hundred internal users. The budget and the user base do not justify it.
Counterintuitively, smaller user bases make evaluation more important, not less. A system with millions of queries per day generates enough organic feedback signal to surface regressions through user behavior. A system with a few hundred internal users does not. The only way to ship updates safely without that organic signal is a frozen golden dataset run on every change before deployment. Tools like Ragas, ARES, and LangSmith make the marginal cost of a baseline eval harness low. Not having one is what makes RAG failure points invisible, regardless of user count.
For any team past POC, the real question is whether they can afford the invisible failures that accumulate without continuous evaluation.
The Architecture Is the Product
Teams that treat RAG as "pick a vector DB and call an LLM" will keep shipping demos that fail at production scale. The six layers are the system.
The most common pattern we see in production RAG audits: strong retrieval work, weak or absent observability, no eval harness. The system shipped, it looked good in demo conditions, and then it slowly degraded with no instrumentation to detect it.
The next concrete step for any team past POC is specific: instrument every layer with eval hooks before adding a single new data source. Adding sources before measurement means compounding unknowns. A new data source changes retrieval behavior. Without a baseline, there is no way to know whether the change helped or hurt. The eval harness comes first. Pick the layer with the least instrumentation today, and put a metric on it this week. For teams ready to move from architecture design to production build, production RAG development is where that work starts.


.avif)
