What is the difference between a basic RAG system and a production RAG system?

A basic system connects one document source to one LLM and returns answers in a chatbot. A production system adds hybrid search (vector + keyword), access control so users only see documents they're allowed to see, citation display so users can verify answers, observability to monitor answer quality over time, and feedback loops to improve retrieval quality. The production additions are 60–70% of the total engineering work.

How much does RAG development cost? (2026 breakdown)

Ashit VoraBuyer's GuideAug 3, 2025 · 10 min read

Key Takeaways

A production RAG system costs $30,000–$60,000 to build. Data preparation and ingestion pipelines, not the LLM, are typically 40–50% of the total engineering work.
Ongoing infrastructure costs run $300–$2,500/month depending on vector database size, embedding compute, and query volume. At 10,000 queries/month, most deployments sit at $400–$800/month.
RAG answers are only as good as the source documents. Budget 20–30% of build time for data cleaning and chunking strategy. This is the step that determines whether the system actually works in production.
Off-the-shelf RAG tools handle simple cases. Custom RAG earns its cost when you have proprietary data, strict access control, or domain-specific accuracy requirements.

A RAG (retrieval-augmented generation) system lets an LLM answer questions using your private documents rather than its general training data. That includes your internal knowledge base, product documentation, legal contracts, or customer records.

The appeal is clear. The cost is less discussed. This breakdown covers what you're actually paying for, what the realistic price ranges are at each complexity tier, and what the ongoing costs look like once the system is live.

Gartner's 2024 AI Hype Cycle placed RAG firmly in the "Peak of Inflated Expectations" zone, meaning buyers are often sold generic tools that don't fit their data architecture. RaftLabs has shipped 15+ RAG systems to production across legal, SaaS, and healthcare clients, and the cost patterns are consistent enough that these ranges hold across most builds.

Key Takeaways

A production multi-source RAG system costs $30,000–$60,000 to build. Data preparation and ingestion pipelines are 40–50% of total engineering time.

Ongoing infrastructure runs $300–$2,500/month. At 10,000 queries/month, most deployments sit at $400–$800/month.

Retrieval quality determines answer quality. Budget 20–30% of build time for data cleaning and chunking strategy. This is the hidden work that decides whether the system actually helps users in production.

Off-the-shelf RAG tools handle simple cases. Custom RAG earns its cost when you have proprietary data formats, strict access control, or domain-specific accuracy requirements.

What a RAG system is actually made of

RAG isn't one thing. The cost varies because the architecture varies significantly depending on your use case.

The basic components:

Data ingestion pipeline: pulling documents from your sources (SharePoint, Google Drive, S3, databases, PDFs), cleaning them, chunking them into retrievable segments, and keeping them updated as documents change
Embedding model: converting text chunks into vector representations (OpenAI embeddings, Cohere, or open-source models)
Vector database: storing and indexing the embedded chunks for fast similarity search (Pinecone, Weaviate, Qdrant, pgvector)
Retrieval layer: the query logic for pure vector search, hybrid vector + keyword search, or re-ranking with a secondary model
LLM inference: the generation step where the retrieved context and the user's question go into the LLM, which synthesizes an answer
Application layer: the UI or API through which users interact, whether a chatbot, a search interface, a Slack bot, or an internal API your existing systems call
Access control: making sure users only retrieve documents they have permission to see
Observability: logging queries and retrieved documents, tracking answer quality, flagging hallucinations

A simple RAG system has components 1–6 in minimal form. A production system adds 7 and 8. An enterprise system adds fine-tuning, multi-tenant isolation, human feedback loops, and governance tooling.

Most RAG cost guides stop at the LLM and vector database, which are relatively cheap. The real cost is the data pipeline (ingestion, cleaning, chunking, updating) and the retrieval layer (hybrid search, re-ranking, access control). A team can wire up a basic chatbot with LangChain in a week. Getting it accurate, fast, and access-controlled in production takes 2–4 months. That's the gap most procurement conversations miss.

The three cost scenarios

Scenario 1: Simple document Q&A, $12,000–$30,000

What you get: A chatbot or search interface connected to a single document source, such as your product documentation, HR handbook, or support knowledge base. Users can ask natural language questions and get answers with citations. Basic access control, typically single-tenant where all users see all documents.

Who it's for: Teams that want to replace "search the docs" with "ask the docs." Internal tools for support teams, onboarding portals, documentation Q&A for customers, legal contract search within a single client's files.

Build time: 4–6 weeks

What's included:

Ingestion pipeline for one document type (PDFs, Notion, Confluence, or similar)
Embedding + vector storage setup
Basic retrieval (single-stage vector search)
Simple chat or search UI
Basic citation display

What's not included: Real-time document sync, hybrid retrieval, access control per user, answer quality monitoring.

Team size at RaftLabs rates: 2 people × 4–6 weeks = $12,000–$18,000 + setup overhead → total $12,000–$30,000.

Scenario 2: Production multi-source RAG, $30,000–$60,000

What you get: A RAG system that ingests from 3–5 document sources, maintains a live-sync pipeline as documents update, uses hybrid search (vector + keyword) for better retrieval accuracy, enforces per-user or per-team access control, and includes citation display so users can verify answers. Includes a custom UI and basic observability (query logs, flagged answers).

Who it's for: Businesses that need to make internal knowledge genuinely searchable. Law firms (client documents, case history), SaaS companies (customer-specific documentation, product configs), healthcare providers (patient protocols, clinical guidelines). Any use case where "user A must not see user B's documents" is a hard requirement.

Build time: 8–14 weeks

The 8–14 week range is real, and the variance is almost entirely in the data. In one RaftLabs deployment for a legal services firm, documents were in 14 different formats across 3 storage systems: PDFs, Word docs, a legacy DMS with SOAP-only access, and scanned images requiring OCR. The data ingestion pipeline alone took 5 weeks. In another deployment, documents were clean PDFs in a single S3 bucket with a consistent naming convention. Ingestion took 2 weeks. Same system architecture, very different timelines.

What drives the cost:

Multi-source ingestion with live sync: 3–5 weeks
Hybrid retrieval implementation and tuning: 1–2 weeks
Per-user access control layer: 1–2 weeks
Custom UI with citation display: 2–3 weeks
Observability setup: 1 week

Team size at RaftLabs rates: 3 people × 10–14 weeks = $36,000–$58,800 + overhead → total $30,000–$60,000.

Scenario 3: Enterprise RAG platform, $70,000–$120,000+

What you get: A multi-tenant RAG platform with strict tenant isolation (each client's data is siloed), fine-tuned retrieval for domain-specific terminology (legal, medical, financial), human feedback loops for answer improvement, governance tooling (audit logs, PII detection, hallucination flagging), API-first architecture that other internal systems consume, and full observability with alerting.

Who it's for: SaaS companies building RAG as a feature inside their product. Professional services firms with hundreds of clients and strict data isolation requirements. Regulated industries (healthcare, financial services, legal) that need audit trails and compliance documentation.

Build time: 14–22 weeks

Team size at RaftLabs rates: 5 people × 14–18 weeks = $70,000–$112,000 → total $70,000–$120,000+ with compliance overhead.

What actually drives the cost

According to research from LlamaIndex on production RAG deployments, the most common reason RAG systems underperform in production is document quality, not model choice. The implication for budgeting is clear: allocate more to data preparation, not more to inference infrastructure.

Data quality and variety: the biggest unknown

The most common project overrun in RAG development is data preparation. If your documents are:

Consistent PDFs with selectable text → fast
Mixed formats (PDF, Word, HTML, scanned images) → add 2–3 weeks
Stored in systems with poor APIs (legacy DMS, SharePoint on-prem) → add 2–4 weeks
Containing PII that needs redaction before ingestion → add 1–2 weeks + legal review time
Extremely long documents (200+ page contracts) → add 1–2 weeks for chunking strategy

Budget 20–30% of your total build timeline for data preparation regardless of how clean you think the data is. It's always messier than expected.

Retrieval accuracy requirements

A basic vector similarity search works well for clean, well-structured document sets. When accuracy matters more, as with legal research, medical protocol lookup, or financial compliance, you need hybrid retrieval (vector + BM25 keyword), re-ranking with a secondary model, and possibly fine-tuned embeddings. Each of these adds 1–3 weeks of engineering and tuning.

Access control complexity

Single-tenant, all-users-see-everything: minimal cost. Per-user row-level security across thousands of documents: significant engineering. If your access model mirrors a complex organizational hierarchy (department → team → role → individual), budget 2–4 weeks for access control alone.

Real-time vs. batch sync

If documents update rarely and a nightly re-index is acceptable, the ingestion pipeline is straightforward. If users expect new documents to be searchable within minutes of upload, you need event-driven ingestion with webhook or queue integration. Add 1–2 weeks for the real-time pipeline.

Ongoing infrastructure costs

Component	Monthly cost
Vector database (Pinecone/Qdrant hosted)	$50–$300
Embedding compute (OpenAI/Cohere)	$20–$200
LLM inference ($0.002–$0.02/query)	$20–$200 at 10K queries
Application hosting (cloud)	$100–$500
Document storage + ingestion compute	$50–$200
Observability tooling	$50–$200
Total	$290–$1,600

For large enterprise deployments with high query volume (100,000+ queries/month) or very large document sets (1M+ chunks), costs scale with vector database size and embedding compute. Plan for $2,000–$5,000/month at that scale.

Self-hosting the vector database and using open-source embedding models reduces costs by 40–60% but adds infrastructure management overhead. This is worth considering at scale but not for initial deployments.

Build custom vs. use an off-the-shelf tool

"The right question is not 'should we use RAG?' but 'do we have the data quality and access control requirements that justify custom RAG vs. an off-the-shelf tool?'" -- Andrej Karpathy, former Director of AI at Tesla, on evaluating AI architecture choices

Off-the-shelf options (Microsoft Copilot, Notion AI, Glean, Guru) make sense when:

Your documents are in formats they already support (SharePoint, Confluence, Google Workspace)
Access control maps to your existing SSO/directory structure
Generic retrieval quality is acceptable for your use case
You need to be live quickly and don't have engineering resources

Custom RAG earns its cost when:

You have proprietary data formats or sources those tools can't reach
Access control requirements are more complex than "department-level permissions"
Answer accuracy in a specific domain (legal, medical, financial) is critical to the product's value
You're building RAG as a feature inside your own product (SaaS companies)
You can't send your documents to a third-party hosted service for compliance reasons

For most internal knowledge base use cases, a hosted tool is the right starting point. Upgrade to custom when you've hit the ceiling of what the hosted tool can do.

For implementation specifics, see our guide to building a RAG pipeline and our overview of RAG pipeline development services. If you want a fully managed deployment rather than a custom build, see our RAG as a Service offering.

Frequently asked questions

How much does it cost to build a RAG system?

A simple single-source RAG system costs $12,000–$30,000. A production multi-source system with access control runs $30,000–$60,000. An enterprise platform with fine-tuning, multi-tenant isolation, and observability costs $70,000–$120,000+. The biggest cost variable is how complex and inconsistent your source data is.

How long does RAG development take?

A simple pipeline takes 4–6 weeks. A production system with multiple sources and a custom UI takes 8–14 weeks. An enterprise platform takes 14–22 weeks. Data preparation is the most common timeline risk. Inconsistent document formats can add 2–4 weeks.

What are the ongoing costs of running a RAG system?

Expect $300–$1,600/month for most production deployments at 10,000 queries/month. Main components are vector database hosting ($50–$300), embedding compute ($20–$200), LLM inference ($20–$200), and application infrastructure ($100–$500). Self-hosting vector infrastructure reduces costs 40–60% at the expense of management overhead.

When should I build custom RAG instead of using an off-the-shelf tool?

Build custom when you have proprietary data formats, strict per-user access control requirements, domain-specific accuracy needs (legal, medical, financial), or when you're embedding RAG inside your own product. Use off-the-shelf tools when your documents are in standard formats, access control maps to SSO, and generic accuracy is acceptable.

What is the most common reason RAG projects go over budget?

Data preparation. Source documents are almost always messier, more varied in format, or harder to access via API than initially assumed. Budget 20–30% of your total build timeline for data cleaning, format normalization, and ingestion pipeline work, even if the data looks clean at first glance.

Frequently asked questions

: A simple single-source RAG system (document Q&A, one data source) costs $12,000–$30,000. A production multi-source system with custom UI and access control runs $30,000–$60,000. An enterprise platform with fine-tuning, multi-tenant isolation, and observability costs $70,000–$120,000+. The biggest cost driver is how complex and inconsistent your source data is.
: A simple RAG pipeline takes 4–6 weeks. A production system with multiple data sources and a custom search UI takes 8–14 weeks. An enterprise platform takes 14–22 weeks. Data preparation is the most common timeline risk. If your documents are in inconsistent formats, unstructured, or require cleaning, add 2–4 weeks.
: Expect $300–$2,500/month depending on document volume and query traffic. Main components are vector database hosting ($50–$300/month), embedding compute ($20–$200/month), LLM inference per query ($0.002–$0.02/query), and application infrastructure ($100–$500/month). At 10,000 queries/month, most production systems run $400–$800/month total.
: A basic system connects one document source to one LLM and returns answers in a chatbot. A production system adds hybrid search (vector + keyword), access control so users only see documents they're allowed to see, citation display so users can verify answers, observability to monitor answer quality over time, and feedback loops to improve retrieval quality. The production additions are 60–70% of the total engineering work.
: Use an off-the-shelf tool (Langchain, LlamaIndex, Notion AI, Microsoft Copilot) when your documents are in a standard format and access control isn't critical. Build custom when you have proprietary data formats, strict access control requirements (users should only see their own client data), domain-specific accuracy needs (legal, medical, financial), or integration with internal systems that off-the-shelf tools can't reach.

Ask an AI

Get an instant summary of this post from your preferred AI assistant.

ChatGPT Claude Perplexity Gemini

RAG Architecture Diagram: Naive vs. Advanced RAG Explained

The most-searched question about RAG is not 'what is it?' -- it's 'what does it look like?' This guide describes the architecture at every stage, from the simplest naive RAG pipeline to a production advanced RAG system with hybrid search and reranking.

How much does voice AI development cost? (2026 breakdown)

A voice AI agent costs $15,000–$150,000+ to build depending on complexity. Here's what drives the cost, three real budget scenarios, and what to watch for in vendor quotes.

What drives AI development cost in 2026 (and what's just padding)

You sent the same brief to three firms. You got back $40K, $180K, and $320K. Here's what's actually driving the difference - and how to tell which quote is honest.