RAG Pipeline Builder

Design your retrieval-augmented generation architecture interactively. Select chunking strategies, embedding models, vector databases, retrieval parameters, and LLMs. Get cost estimates, latency projections, and architecture diagrams. Optimize chunk size for your specific document types.

No data leaves your browser

Pipeline Configuration

Pipeline Architecture

Configure your pipeline and click "Build Pipeline" to generate the architecture diagram and cost estimates.

Chunking Strategy Comparison

Enter a corpus size to compare chunking strategies on cost, chunk count, and estimated retrieval quality.

Click "Compare Strategies" to see cost and quality tradeoffs.

The Complete Guide to RAG Pipeline Architecture

Retrieval-Augmented Generation (RAG) is the dominant pattern for building AI applications that need access to private or current data. Instead of fine-tuning an LLM on your documents, RAG retrieves relevant passages at query time and includes them in the prompt as context. This approach combines the reasoning ability of LLMs with the accuracy of information retrieval, producing answers that are grounded in your actual data rather than the model's potentially outdated training knowledge.

RAG Pipeline Components

1. Document Ingestion

The ingestion pipeline transforms raw documents (PDFs, HTML, Word files, Markdown, etc.) into structured text. This step includes extraction (parsing format-specific content), cleaning (removing headers, footers, navigation elements, and boilerplate), and metadata extraction (dates, authors, document type, section headers). The quality of ingestion directly impacts retrieval quality — garbage in, garbage out. Libraries like Unstructured, LlamaIndex, and LangChain provide parsers for 50+ document formats.

2. Chunking

Chunking splits documents into passages small enough for the embedding model to process and specific enough for accurate retrieval. The chunk size is the single most impactful parameter in RAG quality. Chunks that are too small lose context and produce fragmented retrieval. Chunks that are too large dilute the relevant information with irrelevant text, reducing both retrieval precision and increasing LLM costs. Our builder lets you experiment with chunk sizes from 64 to 8192 tokens to find the optimal balance for your data.

3. Embedding

Each chunk is converted into a dense vector using an embedding model. The choice of embedding model affects retrieval quality, cost, and storage requirements. For most applications, text-embedding-3-small offers the best cost-to-quality ratio. For maximum retrieval accuracy, Voyage AI voyage-3-large leads current benchmarks. Open-source models like GTE-Qwen2 eliminate per-token costs for high-volume ingestion. The embedding model must match between ingestion and query time — you cannot mix embeddings from different models.

4. Vector Storage

Vectors are stored in a vector database optimized for similarity search. The database must support fast approximate nearest neighbor (ANN) search at your scale, metadata filtering (to restrict search by date, source, user, etc.), and ideally hybrid search combining vector similarity with keyword matching. For production workloads processing more than 100 queries per second, managed databases like Pinecone provide the best operational experience. For development and smaller-scale applications, pgvector adds vector search to your existing PostgreSQL without additional infrastructure.

5. Retrieval and Reranking

At query time, the user's question is embedded using the same model and used to search the vector database for the most similar chunks. The top-k results are returned. A reranking step dramatically improves quality: instead of relying solely on embedding similarity (which can be noisy), a cross-encoder model scores each candidate passage against the query more carefully. This retrieve-then-rerank pattern typically improves answer quality by 10-20% at minimal additional cost.

6. Generation

The retrieved (and optionally reranked) chunks are formatted into the LLM prompt along with the user's question. The LLM generates an answer grounded in the provided context. The prompt template matters: explicitly instruct the model to base its answer only on the provided context and to acknowledge when the context does not contain enough information to answer. This reduces hallucination significantly compared to allowing the model to supplement context with its training knowledge.

Chunking Strategies Compared

StrategyBest ForChunk QualityImplementation
Fixed Token SizeHomogeneous textMediumSimple
Recursive CharacterGeneral purposeGoodModerate
Sentence-LevelQ&A, supportGoodModerate
Semantic (Section)Structured docsExcellentComplex
Markdown HeaderDocumentationExcellentSimple

Cost Optimization Strategies

RAG costs come from three sources: embedding (one-time per document), storage (ongoing monthly), and generation (per query). Generation dominates ongoing costs — each query uses both retrieval and LLM inference. To optimize: use smaller, cheaper LLMs for simple queries and route complex queries to premium models. Cache frequent queries and their answers. Use metadata filters to narrow the search space before vector similarity (reducing false positives). Implement prompt compression to reduce the context window size — techniques like LLMLingua can compress retrieved passages by 50% while preserving answer quality.

Frequently Asked Questions

What is the best chunk size for RAG?

The optimal chunk size depends on your document type. For general-purpose RAG, 512-1024 tokens with 50-100 token overlap works well. Smaller chunks (256) improve precision but lose context. Larger chunks (2048+) preserve context but reduce precision. Benchmark on your data to find the optimum.

Should I use fixed-size or semantic chunking?

Fixed-size is simpler and works for homogeneous content. Semantic chunking preserves meaning better for structured documents. Recursive character splitting is the best middle ground for most production systems.

How many chunks should I retrieve for RAG context?

Retrieve 3-5 chunks (top-k) for most applications. Use a retrieve-then-rerank pattern: fetch 20-50 chunks, rerank, then pass top 3-5 to the LLM for 10-20% quality improvement.

What vector database should I use for RAG?

For prototyping: pgvector. For production: Pinecone (managed), Qdrant (open-source), or Weaviate (hybrid search). Key factors: query latency, filtering capabilities, and operational complexity.

How do I evaluate RAG pipeline quality?

Evaluate at three levels: retrieval quality (recall@k), generation quality (faithfulness and relevance), and end-to-end quality. Use frameworks like RAGAS or TruLens. Aim for >85% retrieval recall@5 and >90% faithfulness.

About the Author

Built by Michael Lip — solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ developer tools across the Zovo network. No tracking, no ads, no data collection.