Enterprise RAG Service Companies for Production

Galileo's "Seven Failure Points" research studied production enterprise RAG deployments and found that retrieval-focused failures, not LLM quality, accounted for 10 to 30 percentage points of answer-quality degradation. Most teams responded by tuning prompts. The retrieval layer stayed broken.

That finding matters for every enterprise RAG vendor ranking you read, because almost none of them score retrieval quality. They score features, connectors, pricing tiers, and time-to-value. Those criteria predict demo performance. They predict almost nothing about whether a system answers correctly under production traffic.

The production gap is determined by retrieval engineering, evaluation rigor, and data hygiene. Those three disciplines do not appear in feature matrices, and rankings built on feature matrices mislead buyers into making the wrong cut.

The market's most-cited comparison makes this explicit. Atlan's 2026 Enterprise RAG Platforms Comparison ranks Vectara, Ragie, Pinecone, Weaviate, LangChain, LlamaIndex, and the cloud-native services on features, pricing, and operational model. The comparison is genuinely useful as a market map. But its own framing evaluates time-to-value and pipeline control, not whether a deployed system answers correctly under production traffic. A buyer using Atlan's list as a selection framework has no basis for predicting which vendor closes the production gap on their specific corpus. The list does its job. The problem is that buyers treat it as a decision, not a starting filter.

The failure modes that kill production RAG systems are invisible in a vendor demo. Binariks states directly: "Retrieval is the backbone of RAG, and it is where enterprise RAG most often fails." Galileo's seven failure points framework names the dominant modes: missing content, irrelevant context, and incorrect ranking. None of those surface during a polished demo over a curated corpus. They surface at 2 AM when a support engineer is staring at a ticket where the system confidently fabricated a troubleshooting procedure.

That fabrication scenario has a documented root cause. Audits of customer support RAG deployments found that 30% of the highest-volume error codes had no troubleshooting content in the corpus at all. The LLM had no choice but to fill the gap. No feature matrix asks a vendor how they audit corpus coverage before go-live, because corpus coverage is a data hygiene question, not a platform question.

Snorkel puts the retrieval problem in sharper terms: "Some of the biggest error modes are in the retrieval step," where off-the-shelf embeddings and rankers cannot parse domain documents. A generalist embedding model trained on web text cannot tell the difference between two technical error codes that share surface vocabulary but describe entirely different failure states. The ranking column in a feature matrix records whether a vendor supports hybrid search. It does not record whether their default ranker has ever been tested on your document type.

The objection worth taking seriously: feature checks still matter. A vendor without SOC 2, without ACL support, or without connectors for your primary data sources is not a candidate regardless of their retrieval engineering quality. Feature parity is a real filter.

But the filter has been miscalibrated as the decision. Feature parity is the floor. Every vendor on a credible enterprise shortlist clears it. The question that separates systems that reach production from systems that stall in demo purgatory is what a vendor does after the connectors are wired and the embeddings are generated. Do they define recall@k targets before the first chunking decision? Do they bring a retrieval evaluation dataset, or do they bring a slide deck? Do they have a plan for permissions-aware retrieval, or do they treat ACLs as a checkbox?

Those questions are answerable before a contract is signed. The vendors who answer them well are not always the ones with the longest feature lists.

This article ranks nine enterprise RAG service companies for production deployment on the axis that actually predicts whether the system ships: retrieval engineering discipline, evaluation rigor, and data hygiene practice. For a deeper look at what that build process requires end to end, see our guide on how to build an enterprise RAG system. The tier structure that follows separates managed platforms, cloud-native services, vertical specialists, and implementation partners, and scores each category honestly on where the production gap closes and where it does not.

Feature matrices are how vendors sell. Production outcomes are how buyers decide. Those two things have not been the same ranking, and this article corrects that.

Five Criteria that May Predict Production

If platform features are the wrong axis, the next question is what the right axis looks like, and which observable signals separate vendors who clear the production bar from those who do not.

The five capabilities that correlate with RAG systems reaching and surviving in production are retrieval evaluation harnesses, embedding fine-tuning practice, permissions-aware retrieval, observability with human feedback loops, and structure-aware data preparation. A vendor's competence in each is observable before a contract is signed.

Most buyers never look.

Harvey's approach to selecting a vector database makes this concrete. Before standardizing on LanceDB Enterprise in production, Harvey evaluated multiple vector databases across scalability, ANN accuracy, filtering flexibility, ingestion throughput, and enterprise security. They did not pick a vendor from a feature matrix. They ran a structured evaluation against criteria they defined in advance, scored each candidate against those criteria, and selected the one that passed. That discipline, applied to a single infrastructure component, is the same discipline that predicts whether a vendor's overall RAG system makes it to production. Most enterprise buyers apply no equivalent discipline to the full stack.

recall@k, MRR, and the eval dataset test

The sharpest single predictor of production readiness is whether a vendor shows up with a retrieval evaluation harness. Leading teams use Ragas for offline RAG evaluation, measuring factual accuracy, answer relevance, and retrieval-question match against a ground-truth dataset. On the infrastructure side, Vertex AI's generative evaluation service extends this with test set creation, golden answer sets, automated metrics, and experiment tracking through Vertex AI Experiments. The pattern across every team that ships RAG to production is the same: the eval harness comes before the first chunking decision, not after the demo.

Ask a vendor to show you their retrieval evaluation dataset. Ask what recall@k and MRR targets they set before build begins. If they pivot to a demo, they have answered the question.

Embedding fine-tuning is the second criterion, and it is where the gap between a vendor's claims and a vendor's results widens fastest. Snorkel states it plainly: "Generalist embedding models usually won't capture the semantic nuances of domain-specific data, and an off-the-shelf embedding model likely can't tell the difference." Their documented fix is embedding fine-tuning with triplets of query, relevant chunk, and irrelevant chunk, which produces substantial gains in top-k retrieval metrics on domain-specific corpora. A vendor who proposes a general-purpose embedding model for a legal, clinical, or regulatory corpus without a fine-tuning plan is telling you something important about their production track record.

Permissions-aware retrieval and the ACL question

Permissions-aware retrieval is the third criterion, and for regulated industries it is non-negotiable. Glean enforces fine-grained document ACL-based retrieval and hybrid keyword-plus-semantic search as core features, not configuration options. In legal, HR, and regulated industries, retrieving a document that the querying user is not authorized to see is not a retrieval quality problem. It is a compliance incident.

Ask a vendor how their retrieval layer respects document-level ACLs. Ask whether permissions are enforced at query time or at ingestion time. The answer tells you whether they have shipped in a regulated environment.

Observability with human feedback loops is the fourth criterion. A system that reaches production but has no mechanism for capturing retrieval failures is a system that degrades silently. The feedback loop, where engineers can inspect which retrieved chunks contributed to a wrong answer, is what makes iterative improvement possible after launch.

Structure-aware data preparation is the fifth. AWS publishes prescriptive guidance on RAG document design because Bedrock Knowledge Bases' default ingestion pipeline cannot recover meaning from poorly structured inputs. That guidance is useful. It also confirms that the burden of preparing source documents for retrieval falls on the buyer, not the platform, unless the vendor explicitly owns it.

A sophisticated reader will push back here: these five criteria favor buyers who already know what to ask for. A less-mature team lacks the internal expertise to evaluate embedding fine-tuning or write a retrieval evaluation dataset from scratch. That tension is real.

The answer is not to pretend the disciplines do not exist. Less-mature teams have two credible paths. The first is buying a vertically integrated managed service like Glean, which encodes permissions-aware retrieval and hybrid ranking internally so the buyer does not have to specify them. The second is hiring an implementation partner who brings the eval harness, the fine-tuning practice, and the observability stack as part of the engagement scope. Both paths deliver the discipline. Neither path is to select a vendor on connector count and hope the retrieval layer works.

For LLM model evaluation services that bring this discipline to custom deployments, the eval harness is the deliverable, not an afterthought.

Score every vendor on this list against the five criteria before the feature comparison starts.

The Managed Platform Tier: Vectara, Personal AI, Nuclia, Ragie

Managed services are one production path. The cloud hyperscalers offer a different one, and the choice between them is more about organizational gravity than retrieval quality. But before examining that fork, the managed platform tier deserves an honest assessment, because it is where most enterprise RAG pilots begin and where the trade-off between speed and control becomes concrete fastest.

Vectara, Personal AI, Nuclia, and Ragie close the production gap fastest for narrow, well-bounded use cases. Each bundles ingestion, embedding, retrieval, and generation behind a single API. Each removes the infrastructure decisions that slow most pilot builds. The cost of that speed is pipeline control, and pipeline control is exactly what breaks down when a corpus outgrows the platform's default assumptions.

Vectara's market position makes the trade-off visible. Across vendor comparisons, it is positioned as "RAG in a box": ingestion, embedding, retrieval, hallucination filtering, and generation behind one API. For a customer support FAQ deployment over a curated 5,000-document corpus, that positioning is accurate and the fastest credible path to production. For a 50-million-document legal corpus with bespoke metadata filtering requirements, the abstraction starts hiding the levers a team needs to pull. The chunking strategy, the ranking weights, the embedding model selection: all of those become inaccessible at the point when domain-specific tuning matters most.

Atlan's 2026 comparison documents this explicitly. Vectara and Ragie "bundle ingestion, embedding, retrieval, and generation into a single API," optimized for fastest time-to-value with "least pipeline control." That framing is a description of the design choice, not a criticism. The trade-off is explicit. For some corpora it is the right trade. For others it is disqualifying.

Ragie follows the same architectural logic as Vectara. Both platforms prioritize developer experience and time-to-value over configurability. For a team that needs a working RAG endpoint in days rather than months, that is a genuine advantage. The production question is not whether you can get to a demo. It is whether the same API gets you to acceptable retrieval quality on your actual corpus at production volume.

Nuclia occupies a narrower but defensible position in the regulated industry segment. It is documented as SOC 2 Type II and ISO 27001 compliant, with automatic indexing across formats and languages and connectors for SharePoint and Google Drive. Those are meaningful production credentials. A compliance-constrained team that needs multi-language indexing over SharePoint without building a custom ingestion pipeline has a real argument for Nuclia. The compliance certifications are independently verifiable prerequisites for regulated industry deployments.

Personal AI addresses a different production constraint. It combines Small Language Models with RAG features, a unified ranker model designed to reduce hallucinations, and per-user knowledge graphs. The use case is narrow: privacy-sensitive workloads where per-user data isolation is a product requirement, not a preference. For that niche, Personal AI's architecture closes a production gap that general-purpose managed platforms do not address. Outside that niche, the SLM-plus-RAG approach trades retrieval breadth for privacy guarantees that most enterprise workloads do not need.

The objection that deserves direct engagement: the pipeline control critique is overstated, and most enterprise corpora are bounded enough that default chunking and ranking suffice. For sub-100K-document, single-domain corpora, that is largely accurate. A curated product documentation corpus with consistent formatting, a stable schema, and a single query pattern will often perform acceptably on a managed platform's defaults. The production bar is reachable without custom tuning. That describes a real class of enterprise RAG deployments.

The problem is that this class is smaller than vendors claim and smaller than buyers assume.

Snorkel's evidence on embedding fine-tuning is unambiguous: "Generalist embedding models usually won't capture the semantic nuances of domain-specific data." Their documented fix, fine-tuning with query-relevant chunk-irrelevant chunk triplets, produces substantial gains in top-k retrieval metrics on domain corpora. Managed APIs do not expose the embedding layer for fine-tuning. When a corpus contains domain-specific terminology, technical error codes, regulatory language, or anything that diverges from general web text, the default embedding model starts failing in ways that are invisible until production traffic arrives.

Galileo's seven failure points research identifies missing content, irrelevant context, and incorrect ranking as the dominant failure modes in production RAG. All three are retrieval failures. All three are addressable only at the retrieval layer. Managed platforms that abstract the retrieval layer behind a single API make those failure modes harder to diagnose and impossible to fix through configuration.

The abstraction that accelerates the pilot is the same abstraction that limits the fix when retrieval degrades.

The right framing for this tier is fit to corpus. A buyer evaluating Vectara, Ragie, Nuclia, or Personal AI should start with one question: does my corpus fit within the platform's default chunking and ranking assumptions, and do I have a way to verify that before signing? If the answer is yes, managed platforms offer the fastest credible path to production. If the answer is unknown, the platform's time-to-value advantage may be entirely consumed by the engineering work required after the first production failure.

For teams building toward more complex retrieval requirements, our custom RAG development services are designed around the retrieval engineering decisions that managed APIs do not expose.

The Cloud-Native Tier: AWS Bedrock, Azure AI Search, Vertex AI Search

AWS Bedrock Knowledge Bases, Azure AI Search, and Google Vertex AI Search reach production reliably when the corpus, identity model, and compliance boundary already live in that cloud. Outside that condition, they underperform on retrieval quality for specialized domains because the default rankers and chunking pipelines are tuned for general enterprise text, and the platforms do not expose the levers needed to fix that.

Atlan's 2026 comparison frames the value proposition precisely: these are "cloud-native RAG options that provide zero-ops RAG inside each cloud's ecosystem," and they are "best when your data already lives there." That framing is accurate. IAM, networking, and compliance controls are pre-integrated. A team running their data warehouse on AWS, their identity layer on Azure AD, and their compliance boundary inside a single cloud region gets genuine operational value from staying in-ecosystem. The retrieval system inherits the security posture and access controls that took years to build. That is a real production advantage.

AWS's own behavior reveals the first constraint. AWS publishes prescriptive guidance for RAG document design, covering headings, summaries, document splitting, and the avoidance of complex tables. The guidance exists for one reason: Bedrock Knowledge Bases' default ingestion pipeline cannot recover meaning from poorly structured inputs. AWS is telling buyers to prepare their documents before the pipeline touches them. That guidance is genuinely useful. It also means the burden of structure-aware data preparation falls entirely on the buyer, not the platform. A team with heterogeneous source documents, inconsistent formatting, and complex tables is not buying a zero-ops system. They are buying a system where the ops burden shifts from infrastructure to data preparation.

Vertex AI separates itself from AWS and Azure on one dimension that matters: evaluation infrastructure. Vertex AI ships a dedicated generative AI evaluation framework with test set creation, golden answer sets, automated metrics, and experiment tracking through Vertex AI Experiments. AWS and Azure leave that work to the buyer. For a team that knows how to use it, Vertex AI's evaluation tooling closes part of the production gap before retrieval failures reach end users. That is a meaningful differentiator, and it is the right kind of differentiator because it addresses one of the actual causes of production failures rather than adding another connector.

The retrieval quality problem, though, is not closed by evaluation tooling alone.

Snorkel's documented evidence on generalist embeddings applies directly to every cloud-native default embedding model: they are general-purpose, and they fail on domain-specific corpora in predictable ways. A regulatory corpus full of defined terms, a clinical corpus with procedure codes, a legal corpus with cross-references between statutes: none of those parse correctly through an embedding model trained on general web text. The ranking failures that follow are not marginal. Galileo's production deployment research shows retrieval-focused failures accounting for 10 to 30 percentage points of answer-quality degradation. Cloud-native default pipelines do not expose the embedding layer for fine-tuning. When the retrieval quality starts degrading, the configuration surface available to the buyer does not include the layer where the problem lives.

A sophisticated reader will object: zero-ops simplicity outweighs marginal retrieval quality losses for most enterprises. For general-purpose internal search workloads over well-structured corpora, that objection holds. A team searching structured HR documents, product FAQs with consistent formatting, or internal wiki content with a narrow query distribution will often reach acceptable retrieval quality on cloud-native defaults. The operational simplicity is real, and for that class of workload the cloud-native tier is a rational choice.

The disagreement is with the word "most." Legal, clinical, regulatory, and technical corpora are the highest-value enterprise AI use cases, the ones where answer quality has measurable business consequences and where retrieval failures translate directly into compliance risk or wrong decisions. Galileo grounds that evidence in production deployments, not benchmarks over curated test sets. The losses are real and measurable. Prompt engineering or chunking adjustments within a cloud-native default pipeline cannot recover them. The 10 to 30 percentage point retrieval improvement is achievable only through embedding fine-tuning and re-ranking work that cloud-native services do not expose to the buyer.

The net assessment for this tier is conditional. If the corpus lives in the cloud, follows consistent structure, and fits a general query distribution, AWS, Azure, and Vertex AI offer a production path with genuine operational advantages. If the corpus is domain-specific, heterogeneous, or requires fine-grained retrieval tuning, the zero-ops framing obscures a real production constraint: the platform abstracts the layer where the failure lives, and the buyer has no way to reach it.

For teams evaluating vector infrastructure decisions that sit beneath these cloud-native services, the vector database solutions for RAG comparison covers the retrieval layer in detail.

The Vertical Specialist Tier: Harvey, Glean, and Domain-Tuned Stacks

Both managed and cloud-native tiers assume the buyer is deploying a horizontal capability. A different class of vendor wins when the workload is vertical and the corpus is domain-specific.

Vertical specialists like Harvey in legal and Glean in enterprise search reach production more reliably than horizontal platforms within their domain. The reason is specific: they have invested in domain-tuned embeddings, permissions-aware retrieval, and hybrid ranking. Those are precisely the three failure modes that horizontal platforms most often leave unaddressed. The production advantage is not architectural sophistication for its own sake. It is the result of building a system around the failure modes of one domain until those failure modes stop occurring.

Harvey's published architecture makes this concrete. Harvey states explicitly that they "primarily use LanceDB Enterprise" in production. The reasons they cite are low latency, high accuracy, ingestion throughput, and strong data-privacy and hosting controls. More importantly, Harvey separates retrieval infrastructure from generation models so each can evolve independently. That separation is an architectural discipline decision. It means the retrieval layer can be tuned, re-evaluated, and replaced without touching the generation layer. Most horizontal platforms do not allow that separation because the abstraction fuses the two. Harvey's approach treats retrieval as a first-class engineering problem rather than a solved dependency.

The lesson for buyers in clinical, financial, or regulatory verticals is not "use Harvey." It is "demand the same discipline from your chosen vendor."

Glean demonstrates what that discipline looks like in enterprise search. Glean enforces fine-grained permissions-aware retrieval respecting document ACLs and hybrid search combining keyword and semantic signals through a unified ranker. These are core features, not configuration options. A document the querying user is not authorized to see does not appear in retrieval results, regardless of semantic similarity. That enforcement happens at query time, not at ingestion time.

The business outcomes Glean reports from that architecture are worth examining. Glean cites a 25 to 30% reduction in operational costs after implementing RAG-based enterprise search and knowledge systems. Separately, Glean reports approximately 40% faster information discovery for knowledge workers using RAG-augmented enterprise search versus legacy keyword search. Glean's own reporting produces those numbers. Treat them as directional; no independent audit has confirmed them. Even discounted, the directional signal is meaningful: permissions-aware retrieval and hybrid ranking are not quality-of-life features. They are the capabilities that determine whether a system is used after launch.

The objection that holds partial force here is real: vertical specialists lock customers into a single domain and command premium pricing. Horizontal platforms appear to offer better total cost of ownership across multiple use cases. That comparison deserves an honest answer.

The pricing premium is real. Lock-in is real. A horizontal platform purchased for legal search today can be redeployed for HR search tomorrow without renegotiating a contract. That flexibility has genuine value for organizations running multiple RAG workloads across different domains.

But the TCO calculation breaks down when retrieval quality fails. Picture a legal team that purchases a horizontal managed platform and runs a pilot over a curated sample corpus. The demo works. The contract closes. Then real case files enter the pipeline, with their complex cross-references and statutory citations, and retrieval quality collapses. They spend months in a pilot that cannot reach production, then additional months diagnosing a retrieval failure they cannot fix through the platform's configuration surface, then they make a second vendor selection under time pressure. The specialist premium, paid upfront, is almost always cheaper than that sequence.

The horizontal platform's TCO advantage assumes retrieval quality parity. Galileo's production deployment research shows that assumption fails for domain-specific corpora. The failure modes, including missing content, irrelevant context, and incorrect ranking, appear precisely when the corpus diverges from general web text. Legal, clinical, and regulatory corpora diverge in structured and predictable ways.

Domain-tuned embeddings close that gap. A specialist vendor that has fine-tuned embeddings on legal documents has already done the work that a horizontal platform leaves to the buyer. That work is not optional. Snorkel's documented evidence is clear: generalist embedding models cannot capture semantic nuances of domain-specific data. The fine-tuning work either gets done by the vendor before the contract or by the buyer after the pilot fails.

This is where the vertical specialist's production advantage compounds. The discipline is already encoded in the product. The buyer is not purchasing raw infrastructure and embedding expertise. They are purchasing a system where the failure modes of their domain have already been identified and addressed by an engineering team that ships nothing else.

For buyers in domains that do not yet have a packaged vertical specialist, the implication is direct: ask every vendor candidate to walk through how their system handles domain-specific terminology, cross-reference structures, and ACL enforcement at query time. The answers distinguish vendors who have shipped in a specialized domain from vendors who have shipped over general corpora and assumed the rest would follow.

Agentic RAG patterns are extending this discipline into multi-step retrieval workflows, where the domain-tuning requirements compound across retrieval steps.

The Implementation Partner Tier: Where Discipline Has to Be Bought, Not Licensed

Vertical specialists encode discipline into a product. For corpora and workflows that do not fit a packaged vertical, the discipline has to come from an implementation partner, which introduces a different vendor category with a different risk profile and a different selection test.

When the workload is too specialized for a vertical platform and too production-critical for a managed API, the production gap closes only when an implementation partner brings retrieval evaluation, embedding fine-tuning, and observability discipline that the buyer's own team does not yet have. The right partner is identified by the eval harness they show up with on day one, not the slide deck they present in the sales call.

This is a narrower claim than it sounds. Most enterprise RAG pilots that stall do not stall because the buyer chose the wrong LLM or the wrong vector database. They stall because nobody defined recall@k targets before the first chunking decision, nobody built a retrieval evaluation dataset against the actual corpus, and nobody owned the observability layer when retrieval quality started degrading after go-live. Those are discipline failures. A platform cannot fix them. A partner either brings the discipline or does not.

Keyhole Software's 2026 ranking of best AI consulting companies for RAG development scores partners explicitly on "architect-governed RAG delivery," enterprise architecture governance, and custom RAG pipelines built on top of vector databases and LLMs. The fact that a market research firm uses those criteria confirms what the market has already figured out: implementation partners are a distinct vendor category, not a fallback for buyers who could not afford a platform license.

The work that category is hired to perform is specific. Galileo's production deployment research documents that addressing retrieval failure points produces 10 to 30 percentage point improvements in answer quality. That improvement is not achievable by configuring a managed API differently. It requires iterative error analysis, re-chunking experiments, embedding tuning against domain-specific triplets, and re-ranking adjustments validated against a ground-truth retrieval dataset. Aplyca and Intelliarts both publish enterprise RAG playbooks that describe this work in detail: start with simple architectures, iterate rapidly against evaluation metrics, treat platforms like LangChain, LlamaIndex, and vector databases including Pinecone, Weaviate, Qdrant, pgvector, and LanceDB as composable components, and treat evaluation and observability as continuous practices rather than launch milestones. That is the integrator's job description.

The Meta engagement illustrates what that work looks like in practice. We designed and built a generative AI-powered semantic search engine over Meta's 3.5 million-plus supplier records using ChatGPT, Node.js, Python, and React. The result was a 40%-plus improvement in search precision. That improvement did not come from swapping the generation model. It came from retrieval engineering decisions: embedding selection calibrated to supplier record structure, ranking tuned against real query patterns, and filter design that matched how procurement teams actually searched. The corpus was too large, too heterogeneous, and too structurally specific for a managed platform's defaults to handle. A vertical specialist did not exist for supplier search at that scale. An implementation partner was the only path to production, and the path ran through the retrieval layer, not the prompt layer. The full Meta semantic search case study covers the retrieval engineering decisions in detail.

The objection worth taking seriously: implementation partners introduce dependency risk and ongoing cost that buying a platform avoids. A partner engagement ends. A platform license persists. The discipline walks out the door when the engagement closes unless the buyer has internalized it.

That objection holds real force, and the answer to it is a contract term, not a counterargument. The alternative to an implementation partner for a workload that requires retrieval engineering discipline is building that discipline in-house from zero. That path costs more in elapsed time and in pilot failures before the team develops the competency. It also takes longer than most enterprise timelines allow. The implementation partner's deliverable should explicitly include knowledge transfer of the eval harness and observability stack so the buyer's team can operate independently after the engagement closes. A partner who resists that transfer is building dependency by design. A partner who structures knowledge transfer into the SOW is building a handoff. Ask for it in writing before the contract is signed.

The selection test for this tier follows from the main claim. Ask every implementation partner candidate to show you their retrieval evaluation dataset for a prior engagement. Ask what recall@k targets they defined before build began on their most recent domain-specific deployment. Ask to see the observability stack they shipped alongside the retrieval system. A partner with genuine production experience answers all three questions specifically. A partner without it pivots to a demo.

The demo is not the answer. The eval harness is.

The Vendor Diligence Checklist Most Buyers Skip

Whether the buyer chooses a managed platform, a cloud-native service, a vertical specialist, or an implementation partner, the same evaluation questions apply on the sales call. Most buyers never ask them.

A buyer can identify whether any of the enterprise RAG service companies for production deployment on their shortlist will close the production gap before signing by asking six specific questions covering retrieval evaluation, embedding strategy, permissions-aware retrieval, chunking, observability, and failure mode coverage. Most enterprise RAG sales processes never reach any of them. The conversation stays on connectors, pricing tiers, and demo performance. The production gap stays open.

On RAG engagements we have led, the single sharpest predictor of whether a pilot reaches production is whether the vendor's scoping document defines a retrieval evaluation dataset with recall@k and MRR targets before the first chunking decision is made. When the eval harness comes first, the rest of the build sequences itself. When it comes last, the pilot stalls in "it works in the demo" purgatory. That pattern repeats regardless of which platform sits underneath.

The diligence questions that follow are derived from documented failure modes. Galileo's seven failure points framework names every category of production RAG failure: missing content, outdated content, retrieval ranking errors, chunking failures, irrelevant context, prompt and format failures, and evaluation gaps. Asking a vendor to walk through how their system handles each one is a falsifiable test of production maturity. A vendor with genuine production experience answers each point specifically. A vendor without it generalizes, deflects to platform features, or demonstrates the demo corpus.

Six questions to ask before signing

The first question is about retrieval evaluation. Ask the vendor to show you the retrieval evaluation dataset they built for a prior domain-specific engagement. Ask what recall@k and MRR targets they defined before build began. Snorkel's research consistently identifies the absence of a retrieval evaluation dataset, specifically query-to-gold-document pairs scored with recall@k and MRR, as the clearest marker of a team that has not done the production work. A team that has shipped production RAG has this dataset. A team that has shipped demos does not.

The second question is about embedding strategy. Ask whether they plan to use a general-purpose embedding model or fine-tune against your corpus. If fine-tuning, ask what triplet structure they use and how they measure improvement in top-k retrieval metrics. A vendor who proposes a general-purpose model for a specialized corpus without a fine-tuning plan is telling you something about their production track record.

The third question is about permissions-aware retrieval. Ask how document ACLs are enforced: at ingestion time or at query time. The correct answer is query time. Ingestion-time enforcement cannot account for permission changes after a document is indexed. In legal, HR, and regulated environments, that gap is a compliance incident, not a retrieval quality issue.

The fourth question is about chunking. Ask how their chunking strategy adapts to document structure variations in your corpus. A credible answer describes structure-aware segmentation tied to document type, not a fixed token window applied uniformly.

The fifth question is about observability. Ask what telemetry the system emits when retrieval quality degrades after go-live, and how that telemetry surfaces to the team. A production system without a human feedback loop degrades silently.

The sixth question uses Galileo's framework directly. Ask the vendor to walk through how their system handles each of the seven failure points: missing content, outdated content, retrieval ranking errors, chunking failures, irrelevant context, prompt and format failures, and evaluation gaps. A vendor with production experience has an answer for each. The answers do not have to be perfect. They have to be specific.

What a credible answer sounds like

A credible vendor answers the retrieval evaluation question by producing documentation, not by describing their evaluation philosophy. They name the metrics. They name the target thresholds. They show how the eval dataset was constructed.

A credible vendor answers the embedding question by acknowledging that general-purpose models fail on domain-specific corpora and describing the fine-tuning process they use to address that. Ragas, as an open-source RAG evaluation toolkit, measures factual accuracy, answer relevance, and retrieval-question match. A vendor's familiarity with Ragas, Vertex AI Eval, or an equivalent framework is a fast proxy for whether they have shipped to production before. Vendors who have shipped know these tools. Vendors who have not will ask what you mean by recall@k.

A credible vendor answers the observability question by describing specific telemetry, not by citing the platform's built-in logging. The question is not whether the system produces logs. The question is whether retrieval failures are visible, attributable, and actionable in a production environment.

There is a counterargument that holds partial force: asking deep technical questions in a sales process favors vendors with strong pre-sales engineering, not necessarily strong delivery. A vendor can rehearse the right answers without having earned them.

That concern is valid. The mitigation is structural. Ask for a written eval plan as part of the statement of work before the contract is signed. Require access to a working production reference deployment with measurable retrieval metrics, not a slideware reference. A vendor with genuine production experience can point to a deployed system and share its recall@k and MRR figures. A vendor without that experience cannot, regardless of how well their pre-sales engineer answered the questions on the call.

The six questions are a starting filter, not a complete evaluation. But they separate the vendor population faster than any feature matrix.

Pick the vendor who owns the eval harness

The single best predictor of whether your enterprise RAG reaches production is whether the vendor shows up to the sales call with a retrieval evaluation dataset, a metric definition for recall@k, and a plan for permissions-aware retrieval before the contract is signed.

Vendors who lead with platform demos and connector counts have inverted priorities. The demo proves the system works on their corpus, under their query patterns, with their data preparation. Your corpus is different. Your query distribution is different. Your document structure is almost certainly different. The demo tells you nothing about whether retrieval quality holds when your data enters the pipeline.

Ask for the eval harness on day one. If the vendor cannot show you one from a prior engagement, that absence is the answer to every other question you were going to ask. The production gap on your deployment will be determined by retrieval engineering discipline. The only question is whether that discipline belongs to the vendor or stays with you. For a fuller framing of what to probe in the sales call, see our guide on questions to ask an AI development company.

About the Author:

VP of Technology | Software Engineer | Expert in Scalable Systems & Leadership | React, Node.js & Cloud Architect

Gonzalo Buszmicz, VP of Technology at Azumo, specializes in scalable systems, full-stack development, and cloud architecture, with over 15 years of experience leading teams.

Text Link Text Link