RAG Pipeline Builder

Q: What is the best chunk size for RAG?

The optimal chunk size depends on your document type and retrieval model. For general-purpose RAG with text-embedding-3-small, 512-1024 tokens with 50-100 token overlap works well for most use cases. Smaller chunks (256 tokens) improve retrieval precision but may lose context. Larger chunks (2048+) preserve more context but reduce precision and increase LLM input costs. The best practice is to benchmark 3-4 chunk sizes on your specific data using a test set of 50+ queries, measuring both retrieval recall and final answer quality.

Q: Should I use fixed-size or semantic chunking?

Fixed-size chunking (splitting at token boundaries with overlap) is simpler, faster, and works well for homogeneous content like articles and documentation. Semantic chunking (splitting at natural boundaries like paragraphs, sections, or topic changes) preserves meaning better and produces higher retrieval quality for structured documents with clear sections. Recursive character splitting is a good middle ground — it tries to split at paragraph boundaries first, falling back to sentence and word boundaries. For most production systems, recursive splitting with 512-token target chunks outperforms both pure fixed-size and complex semantic approaches.

Q: How many chunks should I retrieve for RAG context?

Retrieve 3-5 chunks (top-k) for most applications. Too few chunks risk missing relevant information; too many dilute the signal with irrelevant content and increase LLM costs. A reranking step (using Cohere Rerank or a cross-encoder) dramatically improves quality: retrieve 20-50 chunks, rerank them, then pass the top 3-5 to the LLM. This retrieve-then-rerank pattern is the current best practice, improving answer quality by 10-20% compared to using raw vector similarity alone.

Q: What vector database should I use for RAG?

For prototyping: use pgvector with your existing PostgreSQL database — it handles up to 5 million vectors with decent performance. For production at scale: Pinecone offers the best managed experience with automatic scaling. Qdrant provides the best open-source option with advanced filtering. Weaviate excels at hybrid search combining vector and keyword matching. Chroma is popular for local development. The key factors are: query latency (under 50ms for real-time), filtering capabilities (metadata filters are essential), and operational complexity (managed vs self-hosted).

Q: How do I evaluate RAG pipeline quality?

Evaluate RAG at three levels: retrieval quality (are the right chunks retrieved?), generation quality (is the answer correct and well-formed?), and end-to-end quality (does the user get a satisfactory answer?). For retrieval: measure recall@k — what percentage of relevant chunks appear in the top-k results. For generation: use LLM-as-judge to score faithfulness (does the answer stay grounded in retrieved context?) and relevance (does it answer the question?). Tools like RAGAS, TruLens, and Phoenix provide automated RAG evaluation frameworks. Aim for >85% retrieval recall@5 and >90% faithfulness.

Design your retrieval-augmented generation architecture interactively. Select chunking strategies, embedding models, vector databases, retrieval parameters, and LLMs. Get cost estimates, latency projections, and architecture diagrams. Optimize chunk size for your specific document types.

No data leaves your browser

Pipeline Configuration

Document Type

Chunking Strategy

Target Chunk Size (tokens)

Overlap (tokens)

Embedding Model

Vector Database

Retrieval Top-K

Reranking

LLM for Generation

Pipeline Architecture

Configure your pipeline and click "Build Pipeline" to generate the architecture diagram and cost estimates.

Chunking Strategy Comparison

Enter a corpus size to compare chunking strategies on cost, chunk count, and estimated retrieval quality.

Total documents

Avg tokens per document

Monthly queries

Click "Compare Strategies" to see cost and quality tradeoffs.

The Complete Guide to RAG Pipeline Architecture

Retrieval-Augmented Generation (RAG) is the dominant pattern for building AI applications that need access to private or current data. Instead of fine-tuning an LLM on your documents, RAG retrieves relevant passages at query time and includes them in the prompt as context. This approach combines the reasoning ability of LLMs with the accuracy of information retrieval, producing answers that are grounded in your actual data rather than the model's potentially outdated training knowledge.

RAG Pipeline Components

1. Document Ingestion

The ingestion pipeline transforms raw documents (PDFs, HTML, Word files, Markdown, etc.) into structured text. This step includes extraction (parsing format-specific content), cleaning (removing headers, footers, navigation elements, and boilerplate), and metadata extraction (dates, authors, document type, section headers). The quality of ingestion directly impacts retrieval quality — garbage in, garbage out. Libraries like Unstructured, LlamaIndex, and LangChain provide parsers for 50+ document formats.

2. Chunking

Chunking splits documents into passages small enough for the embedding model to process and specific enough for accurate retrieval. The chunk size is the single most impactful parameter in RAG quality. Chunks that are too small lose context and produce fragmented retrieval. Chunks that are too large dilute the relevant information with irrelevant text, reducing both retrieval precision and increasing LLM costs. Our builder lets you experiment with chunk sizes from 64 to 8192 tokens to find the optimal balance for your data.

3. Embedding

Each chunk is converted into a dense vector using an embedding model. The choice of embedding model affects retrieval quality, cost, and storage requirements. For most applications, text-embedding-3-small offers the best cost-to-quality ratio. For maximum retrieval accuracy, Voyage AI voyage-3-large leads current benchmarks. Open-source models like GTE-Qwen2 eliminate per-token costs for high-volume ingestion. The embedding model must match between ingestion and query time — you cannot mix embeddings from different models.

4. Vector Storage

Vectors are stored in a vector database optimized for similarity search. The database must support fast approximate nearest neighbor (ANN) search at your scale, metadata filtering (to restrict search by date, source, user, etc.), and ideally hybrid search combining vector similarity with keyword matching. For production workloads processing more than 100 queries per second, managed databases like Pinecone provide the best operational experience. For development and smaller-scale applications, pgvector adds vector search to your existing PostgreSQL without additional infrastructure.

5. Retrieval and Reranking

At query time, the user's question is embedded using the same model and used to search the vector database for the most similar chunks. The top-k results are returned. A reranking step dramatically improves quality: instead of relying solely on embedding similarity (which can be noisy), a cross-encoder model scores each candidate passage against the query more carefully. This retrieve-then-rerank pattern typically improves answer quality by 10-20% at minimal additional cost.

6. Generation

The retrieved (and optionally reranked) chunks are formatted into the LLM prompt along with the user's question. The LLM generates an answer grounded in the provided context. The prompt template matters: explicitly instruct the model to base its answer only on the provided context and to acknowledge when the context does not contain enough information to answer. This reduces hallucination significantly compared to allowing the model to supplement context with its training knowledge.

Chunking Strategies Compared

Strategy	Best For	Chunk Quality	Implementation
Fixed Token Size	Homogeneous text	Medium	Simple
Recursive Character	General purpose	Good	Moderate
Sentence-Level	Q&A, support	Good	Moderate
Semantic (Section)	Structured docs	Excellent	Complex
Markdown Header	Documentation	Excellent	Simple

Cost Optimization Strategies

RAG costs come from three sources: embedding (one-time per document), storage (ongoing monthly), and generation (per query). Generation dominates ongoing costs — each query uses both retrieval and LLM inference. To optimize: use smaller, cheaper LLMs for simple queries and route complex queries to premium models. Cache frequent queries and their answers. Use metadata filters to narrow the search space before vector similarity (reducing false positives). Implement prompt compression to reduce the context window size — techniques like LLMLingua can compress retrieved passages by 50% while preserving answer quality.

Frequently Asked Questions

What is the best chunk size for RAG?

The optimal chunk size depends on your document type. For general-purpose RAG, 512-1024 tokens with 50-100 token overlap works well. Smaller chunks (256) improve precision but lose context. Larger chunks (2048+) preserve context but reduce precision. Benchmark on your data to find the optimum.

Should I use fixed-size or semantic chunking?

Fixed-size is simpler and works for homogeneous content. Semantic chunking preserves meaning better for structured documents. Recursive character splitting is the best middle ground for most production systems.

How many chunks should I retrieve for RAG context?

Retrieve 3-5 chunks (top-k) for most applications. Use a retrieve-then-rerank pattern: fetch 20-50 chunks, rerank, then pass top 3-5 to the LLM for 10-20% quality improvement.

What vector database should I use for RAG?

For prototyping: pgvector. For production: Pinecone (managed), Qdrant (open-source), or Weaviate (hybrid search). Key factors: query latency, filtering capabilities, and operational complexity.

How do I evaluate RAG pipeline quality?

Evaluate at three levels: retrieval quality (recall@k), generation quality (faithfulness and relevance), and end-to-end quality. Use frameworks like RAGAS or TruLens. Aim for >85% retrieval recall@5 and >90% faithfulness.

About the Author

Built by Michael Lip — solo developer with 10+ years experience. 140+ PRs merged into open source projects including Google Chrome and Axios. Creator of 20+ developer tools across the Zovo network. No tracking, no ads, no data collection.