Back to Architecture Patterns
AI / DataAdvanced

RAG Pipeline Architecture

Give your LLM access to your data without fine-tuning. RAG bridges the gap between general-purpose language models and domain-specific knowledge.

|
2 topics covered
Discuss This Architecture
rag-pipeline-architecture.webp
AI / Data
Category
Advanced
Complexity
Legal, Healthcare
Industries
2+
Technologies

When You Need This

You want to build an AI assistant that answers questions about your organization's documents — contracts, policies, knowledge bases, product documentation, medical records. Fine-tuning an LLM on your data is expensive, slow, and creates a model that's frozen at the point of training. You need an architecture where the LLM can access up-to-date, domain-specific information at query time, cite its sources, and avoid hallucinating facts that aren't in your documents. RAG (Retrieval-Augmented Generation) is how you get there.

Pattern Overview

RAG augments LLM generation with retrieved context from a knowledge base. At query time, the system converts the user's question into an embedding, searches a vector database for semantically similar document chunks, and includes the most relevant chunks as context in the LLM prompt. This grounds the model's response in actual documents, enables source citation, and keeps the knowledge base updatable without retraining. A production RAG pipeline handles ingestion (parsing, chunking, embedding), retrieval (vector search, reranking, hybrid search), and generation (prompt construction, streaming, guardrails).

Reference Architecture

The architecture has two pipelines. The ingestion pipeline processes documents through parsing (PDF, DOCX, HTML extraction), chunking (semantic or fixed-size with overlap), embedding (via embedding model), and storage (vector database + document store). The query pipeline takes a user question, generates a query embedding, retrieves candidate chunks from the vector database, reranks them for relevance, constructs a prompt with the top chunks as context, and streams the LLM response with source citations.

Core Components
  • Document Ingestion Pipeline: Multi-format parser (Apache Tika, Unstructured, or custom) that extracts text from PDFs, DOCX, HTML, Markdown, and scanned images (OCR). Chunking strategy splits documents into retrievable units — MW defaults to semantic chunking (split at paragraph/section boundaries) with 512-token target size and 50-token overlap
  • Embedding Service: Converts text chunks into vector embeddings. Uses models like OpenAI text-embedding-3-large, Cohere embed-v4, or open-source alternatives (BGE, E5). Batch processing for ingestion, single-query processing for search
  • Vector Database: Stores embeddings with metadata for filtered search. Supports approximate nearest neighbor (ANN) search at scale. See Scalable Vector Database Architecture for production-scale considerations
  • Retrieval & Reranking: Two-stage retrieval — fast ANN search returns top-50 candidates, then a cross-encoder reranker (Cohere Rerank, BGE Reranker, or ColBERT) scores each candidate against the query for precise relevance ranking. Top-5 chunks go to the LLM
  • Hybrid Search: Combines vector (semantic) search with keyword (BM25) search. This catches cases where vector search misses exact terminology (product codes, legal clauses, medical terms) that keyword search handles well. Reciprocal rank fusion merges the two result sets

Design Decisions & Trade-offs

Chunking Strategy: Fixed-Size vs. Semantic vs. Document-Structure
Fixed-size chunking (split every N tokens) is simple but breaks mid-sentence and loses document structure. Semantic chunking (split at natural boundaries — paragraphs, sections, headers) preserves context but produces variable-size chunks. Document-structure chunking (respect the document's hierarchy — chapters, sections, subsections) is best for structured documents like legal contracts or technical manuals. MW defaults to semantic chunking and switches to document-structure for highly formatted sources.
Vector Search vs. Hybrid Search
Pure vector search works well for conversational queries ("how do I handle refunds?") but fails on exact-match queries ("what's clause 7.3.2?"). Hybrid search (vector + BM25 keyword) handles both. MW recommends hybrid search for any domain with specific terminology, codes, or identifiers — which is most enterprise domains. The 10-15% additional complexity is worth the significant relevance improvement.
Reranking: Cross-Encoder vs. None
Cross-encoder reranking adds 100-300ms latency but dramatically improves retrieval precision — we've measured 15-25% improvement in top-5 relevance across legal and healthcare domains. MW includes reranking by default for any RAG system where answer quality matters more than sub-second latency. For chatbots where speed is critical, we skip reranking and compensate with better chunking and prompt engineering.
Single-Vector vs. Multi-Vector (ColBERT-style)
Single-vector embeddings are simpler and cheaper to store/search. Multi-vector representations (one vector per token, late interaction scoring) capture more nuance but require specialized infrastructure. MW uses single-vector for most deployments and reserves multi-vector for domains where retrieval quality is the bottleneck and the document corpus exceeds 100K chunks.
RAG Pipeline Architecture - System Architecture Diagram

System Architecture Overview

Technology Choices

LayerTechnologies
Document ParsingUnstructured, Apache Tika, LlamaParse, Docling, custom OCR (Tesseract, AWS Textract)
EmbeddingOpenAI text-embedding-3-large, Cohere embed-v4, BGE-M3, E5-large-v2
Vector DatabaseMilvus, Pinecone, Qdrant, Weaviate, pgvector (for small-scale)
Keyword SearchElasticsearch, OpenSearch, PostgreSQL full-text search
RerankingCohere Rerank, BGE Reranker, ColBERT v2, FlashRank
LLMClaude (via AI Gateway), GPT-4, Gemini — provider-agnostic via AI SDK
OrchestrationLangChain, LlamaIndex, or custom pipeline (MW preference for production)

When to Use / When to Avoid

Use WhenAvoid When
Users need answers grounded in your organization's specific documentsThe knowledge base is < 50 pages — just put it in the system prompt
Documents are updated frequently and the AI needs current informationYou need the model to learn a new skill/behavior, not access new facts (fine-tune instead)
Source citation and auditability are requirements (legal, compliance, healthcare)The questions are purely conversational and don't require factual grounding
Multiple user groups need access to different document subsets (permission-filtered RAG)You're building a creative writing tool where factual accuracy isn't the goal

Our Approach

MW builds RAG pipelines from the retrieval quality outward — we benchmark retrieval precision before touching the LLM prompt. A RAG system with mediocre retrieval and a great LLM produces confident-sounding wrong answers. Our standard pipeline includes a retrieval evaluation harness: a set of test queries with known-relevant documents, measured by MRR@5 and NDCG@10. We iterate on chunking, embedding model, and reranking until retrieval metrics hit target thresholds before optimizing generation. We've built RAG systems across legal document review, healthcare knowledge bases, and multi-language customer support — and the common lesson is that retrieval quality accounts for 80% of answer quality.

Related Blueprints

Related Industry Guides

  • AI for Legal — RAG applications in contract review and legal research

Related Case Studies

Related Technologies
AI DevelopmentSaaS Development

Frequently Asked Questions

MicrocosmWorks implements conflict resolution in RAG pipelines through source authority ranking, timestamp-based recency weighting, and confidence scoring that evaluates how strongly each retrieved passage supports its claim. When conflicting passages are retrieved, our pipeline presents the highest-authority answer while transparently surfacing the disagreement and source citations so users can make informed decisions. We also build feedback loops where domain experts can flag incorrect resolutions, which improves the retrieval ranking over time.

MicrocosmWorks uses content-aware chunking that applies different strategies based on document structure—semantic paragraph splitting for prose, row-level or section-level chunking for tables with header context preserved, and function-level chunking for code with import statements attached. We enrich each chunk with metadata including document title, section hierarchy, and content type so the retrieval stage can apply type-specific scoring. This approach consistently outperforms naive fixed-size chunking by 25-40% on retrieval relevance benchmarks in our client projects.

MicrocosmWorks builds evaluation harnesses that test RAG pipelines across three dimensions: retrieval relevance (are the right chunks being found), answer faithfulness (does the generated answer actually reflect the retrieved content), and answer completeness (does it address the full question). We create golden test sets with domain experts that include known-answer queries, adversarial edge cases, and questions that require multi-document synthesis. This evaluation runs automatically in CI/CD so every pipeline change is benchmarked against baseline quality metrics before deployment.

MicrocosmWorks selects vector databases based on your scale, query pattern, and operational requirements—Pinecone for managed simplicity, Weaviate for hybrid keyword-vector search, pgvector for teams already invested in PostgreSQL, and Qdrant for high-throughput self-hosted deployments. At scales below 10 million vectors, most options deliver sub-100ms latency, but the differences become significant at hundreds of millions of vectors where index type, quantization, and sharding strategy matter enormously. We benchmark your actual embedding dimensions and query patterns against shortlisted options during our architecture design phase.

MicrocosmWorks builds incremental ingestion pipelines that watch source document repositories for changes, re-chunk and re-embed only the modified sections, and update the vector store without requiring a full reindex. We implement document fingerprinting that detects content changes at the section level, so a single paragraph edit does not trigger reprocessing of an entire 200-page document. For clients with real-time freshness requirements, we add a live retrieval layer that queries the source system directly for recently modified documents and merges those results with vector search hits.

Need Help Implementing This Architecture?

Our architects can help design and build systems using this pattern for your specific requirements.

Get In Touch
Contact UsSchedule Appointment