Local-First Document RAG System with Hybrid Search & Multi-Format Support
A team building developer tools needed a fully local, privacy-preserving document intelligence system that could ingest multiple file formats, build searchable knowledge bases, and answer natural language queries using Retrieval-Augmented Generation β without sending any data to external APIs.
Discuss Your Project
The Challenge
Existing RAG solutions had significant limitations for privacy-conscious and developer-focused use cases:
- External API Dependency β Most RAG tools required sending document content to cloud-based embedding APIs, violating privacy requirements
- Limited Format Support β Solutions typically handled only plain text or PDF, ignoring spreadsheets, Word docs, HTML, and Markdown
- Poor Chunking β Naive text splitting ignored document structure (pages, sheets, headings), creating context-poor chunks
- Keyword Gaps β Pure embedding-based search missed exact keyword matches that lexical search would catch
- Spreadsheet Blindness β RAG systems couldn't handle structured tabular data or answer filtering/aggregation queries
- No Reranking β First-pass retrieval often surfaced only partially relevant results without a second-pass quality filter
Our Solution
We built a complete local-first RAG system with multi-format document ingestion, structure-aware chunking, local embedding generation, a hybrid search pipeline (semantic + full-text + recency), cross-encoder reranking, and a web-based UI β all running entirely on the user's machine.
Architecture
- Document Loaders: Format-specific parsers for PDF, DOCX, XLSX, CSV, HTML, Markdown, and plain text
- Chunker: Structure-aware splitting that preserves page, sheet, and heading boundaries
- Embeddings: Local embedding model via Transformers.js (no external API calls)
- Vector Database: LanceDB (serverless, file-based) for embedding storage and similarity search
- Full-Text Search: Trigram-based indexing for lexical matching
- Reranker: Cross-encoder model for context-aware result scoring
- Query Analyzer: Intent detection routing between semantic and structured queries
- Web Server: Express.js API with project management and search endpoints
- Frontend: Web-based UI for document upload, management, and interactive search
Document Processing Pipeline
Multi-Format Loaders
A registry pattern auto-detects file type and routes to the appropriate parser:
- PDF β Text extraction with page-level segmentation
- Word (.docx/.doc) β Heading-aware parsing preserving document hierarchy
- Excel/CSV β Sheet-by-sheet parsing with header detection and row-level content
- HTML β Tag-aware extraction with structure preservation
- Markdown β Heading-based section parsing
- Plain Text β Line-based segmentation
Each loader extracts metadata (title, author, creation date, page/sheet count, word count) alongside the content, producing structured sections with source references.
Structure-Aware Chunking
Unlike naive text splitting, the chunker respects document boundaries:
- Preserves page breaks (PDFs), sheet boundaries (spreadsheets), and heading hierarchy (Word/Markdown)
- Token-based sizing with configurable chunk size and overlap
- Hierarchical fallback: splits by sections first, then paragraphs, then sentences
- Each chunk retains source metadata (page number, sheet name, heading) for attribution
Embedding & Indexing
Local Embedding Model
- Runs entirely locally via Transformers.js β no data leaves the machine
- Quantized model for performance optimization
- Batch embedding for efficient bulk processing
- Automatic truncation at word boundaries with L2 normalization
Vector Storage
LanceDB provides serverless vector storage:
- File-based (no separate database server needed)
- Per-project isolation with independent indices
- SHA256-based cache keys for deduplication
- Metadata stored alongside vectors for filtered retrieval
Hybrid Search Pipeline
The retrieval pipeline combines three ranking signals for better results than any single approach:
Signal 1: Embedding Search (Semantic)
Vector similarity search finds chunks with related meaning even when different words are used. Handles paraphrasing, synonyms, and conceptual queries.
Signal 2: Full-Text Search (Lexical)
Trigram-based indexing with Jaccard similarity catches exact keyword matches that embedding search might miss β important for technical terms, names, and identifiers.
Signal 3: Recency Boost
Exponential decay weighting favors recently accessed or modified documents, ensuring up-to-date information surfaces first.
Score Combination
Signals are combined with configurable weights (default: 50% semantic, 25% lexical, 25% recency), normalized, and filtered by a minimum score threshold.
Cross-Encoder Reranking
After initial retrieval, a cross-encoder model re-scores the top candidates:
- Context-aware scoring considers query-document pairs together (not independently)
- Keyword boost calculation for term overlap
- Blended scoring (cross-encoder + keyword signals)
- Produces a final ranked list with higher precision than first-pass retrieval alone
Structured Data Support
For spreadsheet content, the system provides additional capabilities:
- Auto-detection of column types (numeric, date, boolean, string)
- Natural language filtering (e.g., "employees in engineering with salary above threshold")
- Aggregation support (count, sum, average, min, max)
- Query analyzer routes structured queries to a dedicated engine rather than embedding search
Web Interface
- Project Management β Create, update, and delete knowledge base projects
- Document Upload β Drag-and-drop file upload with format auto-detection
- Document Creation β Create documents from text directly in the UI
- Interactive Search β Natural language query interface with ranked results
- Statistics β Index size, document count, and format distribution per project
Key Features
- Fully Local β All processing on-device; no external API calls for embeddings or search
- 9 Input Formats β PDF, DOCX, DOC, XLSX, XLS, CSV, HTML, Markdown, plain text
- Structure-Aware Chunking β Preserves pages, sheets, and headings as chunk boundaries
- Hybrid Search β Combines semantic, lexical, and recency signals for better retrieval
- Cross-Encoder Reranking β Second-pass scoring for higher precision results
- Structured Queries β Natural language filtering and aggregation on spreadsheet data
- Serverless Vector DB β LanceDB file-based storage with no infrastructure overhead
- Document Writing β Export capabilities for PDF, DOCX, and XLSX creation
- Project Isolation β Independent knowledge bases with separate indices
- Web UI β Complete interface for document management and interactive search
Results
Technology Stack
caseStudyDetail.more Case Studies
Explore more of our technical implementations
AI-Powered Spreadsheet & Document Analysis with Multi-Agent Orchestration and Cross-Document Reference
An enterprise data team needed to analyze, query, and edit large collections of spreadsheets and documents (Excel, CSV, Google Sheets, PDFs, Word docs) using natural language β with the ability to cross-reference data across multiple files and execute multi-step analytical workflows without manual data wrangling.
AI-Powered Blog Content Scraping & Generation Platform
A media company needed an intelligent content platform that could automate blog content creation by scraping existing web content, analyzing it using AI, and generating original, SEO-optimized blog posts from the extracted data.
Ready to Transform Your Business?
Let's discuss how we can apply similar solutions to your challenges.