Local-First Document RAG System with Hybrid Search & Multi-Format Support
A team building developer tools needed a fully local, privacy-preserving document intelligence system that could ingest multiple file formats, build searchable knowledge bases, and answer natural language queries using Retrieval-Augmented Generation โ without sending any data to external APIs.
ํ๋ก์ ํธ ์๋ดํ๊ธฐ
๊ณผ์
Existing RAG solutions had significant limitations for privacy-conscious and developer-focused use cases:
- External API Dependency โ Most RAG tools required sending document content to cloud-based embedding APIs, violating privacy requirements
- Limited Format Support โ Solutions typically handled only plain text or PDF, ignoring spreadsheets, Word docs, HTML, and Markdown
- Poor Chunking โ Naive text splitting ignored document structure (pages, sheets, headings), creating context-poor chunks
- Keyword Gaps โ Pure embedding-based search missed exact keyword matches that lexical search would catch
- Spreadsheet Blindness โ RAG systems couldn't handle structured tabular data or answer filtering/aggregation queries
- No Reranking โ First-pass retrieval often surfaced only partially relevant results without a second-pass quality filter
์ฐ๋ฆฌ์ ์๋ฃจ์
We built a complete local-first RAG system with multi-format document ingestion, structure-aware chunking, local embedding generation, a hybrid search pipeline (semantic + full-text + recency), cross-encoder reranking, and a web-based UI โ all running entirely on the user's machine.
Architecture
- Document Loaders: Format-specific parsers for PDF, DOCX, XLSX, CSV, HTML, Markdown, and plain text
- Chunker: Structure-aware splitting that preserves page, sheet, and heading boundaries
- Embeddings: Local embedding model via Transformers.js (no external API calls)
- Vector Database: LanceDB (serverless, file-based) for embedding storage and similarity search
- Full-Text Search: Trigram-based indexing for lexical matching
- Reranker: Cross-encoder model for context-aware result scoring
- Query Analyzer: Intent detection routing between semantic and structured queries
- Web Server: Express.js API with project management and search endpoints
- Frontend: Web-based UI for document upload, management, and interactive search
Document Processing Pipeline
Multi-Format Loaders
A registry pattern auto-detects file type and routes to the appropriate parser:
- PDF โ Text extraction with page-level segmentation
- Word (.docx/.doc) โ Heading-aware parsing preserving document hierarchy
- Excel/CSV โ Sheet-by-sheet parsing with header detection and row-level content
- HTML โ Tag-aware extraction with structure preservation
- Markdown โ Heading-based section parsing
- Plain Text โ Line-based segmentation
Each loader extracts metadata (title, author, creation date, page/sheet count, word count) alongside the content, producing structured sections with source references.
Structure-Aware Chunking
Unlike naive text splitting, the chunker respects document boundaries:
- Preserves page breaks (PDFs), sheet boundaries (spreadsheets), and heading hierarchy (Word/Markdown)
- Token-based sizing with configurable chunk size and overlap
- Hierarchical fallback: splits by sections first, then paragraphs, then sentences
- Each chunk retains source metadata (page number, sheet name, heading) for attribution
Embedding & Indexing
Local Embedding Model
- Runs entirely locally via Transformers.js โ no data leaves the machine
- Quantized model for performance optimization
- Batch embedding for efficient bulk processing
- Automatic truncation at word boundaries with L2 normalization
Vector Storage
LanceDB provides serverless vector storage:
- File-based (no separate database server needed)
- Per-project isolation with independent indices
- SHA256-based cache keys for deduplication
- Metadata stored alongside vectors for filtered retrieval
Hybrid Search Pipeline
The retrieval pipeline combines three ranking signals for better results than any single approach:
Signal 1: Embedding Search (Semantic)
Vector similarity search finds chunks with related meaning even when different words are used. Handles paraphrasing, synonyms, and conceptual queries.
Signal 2: Full-Text Search (Lexical)
Trigram-based indexing with Jaccard similarity catches exact keyword matches that embedding search might miss โ important for technical terms, names, and identifiers.
Signal 3: Recency Boost
Exponential decay weighting favors recently accessed or modified documents, ensuring up-to-date information surfaces first.
Score Combination
Signals are combined with configurable weights (default: 50% semantic, 25% lexical, 25% recency), normalized, and filtered by a minimum score threshold.
Cross-Encoder Reranking
After initial retrieval, a cross-encoder model re-scores the top candidates:
- Context-aware scoring considers query-document pairs together (not independently)
- Keyword boost calculation for term overlap
- Blended scoring (cross-encoder + keyword signals)
- Produces a final ranked list with higher precision than first-pass retrieval alone
Structured Data Support
For spreadsheet content, the system provides additional capabilities:
- Auto-detection of column types (numeric, date, boolean, string)
- Natural language filtering (e.g., "employees in engineering with salary above threshold")
- Aggregation support (count, sum, average, min, max)
- Query analyzer routes structured queries to a dedicated engine rather than embedding search
Web Interface
- Project Management โ Create, update, and delete knowledge base projects
- Document Upload โ Drag-and-drop file upload with format auto-detection
- Document Creation โ Create documents from text directly in the UI
- Interactive Search โ Natural language query interface with ranked results
- Statistics โ Index size, document count, and format distribution per project
Key Features
- Fully Local โ All processing on-device; no external API calls for embeddings or search
- 9 Input Formats โ PDF, DOCX, DOC, XLSX, XLS, CSV, HTML, Markdown, plain text
- Structure-Aware Chunking โ Preserves pages, sheets, and headings as chunk boundaries
- Hybrid Search โ Combines semantic, lexical, and recency signals for better retrieval
- Cross-Encoder Reranking โ Second-pass scoring for higher precision results
- Structured Queries โ Natural language filtering and aggregation on spreadsheet data
- Serverless Vector DB โ LanceDB file-based storage with no infrastructure overhead
- Document Writing โ Export capabilities for PDF, DOCX, and XLSX creation
- Project Isolation โ Independent knowledge bases with separate indices
- Web UI โ Complete interface for document management and interactive search
๊ฒฐ๊ณผ
๊ธฐ์ ์คํ
caseStudyDetail.more ์ฌ๋ก ์ฐ๊ตฌ
๋ ๋ง์ ๊ธฐ์ ๊ตฌํ ์ฌ๋ก๋ฅผ ์ดํด๋ณด์ธ์
๋ค์ค ์์ด์ ํธ ์ค์ผ์คํธ๋ ์ด์ ๋ฐ ๋ฌธ์ ๊ฐ ์ฐธ์กฐ๋ฅผ ํตํ AI ๊ธฐ๋ฐ ์คํ๋ ๋์ํธ ๋ฐ ๋ฌธ์ ๋ถ์
ํ ๊ธฐ์ ์ ๋ฐ์ดํฐ ํ์ ์๋ ๋ฐ์ดํฐ ์ ๋ฆฌ ์์ ์์ด ์ฌ๋ฌ ํ์ผ์ ๊ฑธ์ณ ๋ฐ์ดํฐ๋ฅผ ๊ต์ฐจ ์ฐธ์กฐํ๊ณ ๋ค๋จ๊ณ ๋ถ์ ์ํฌํ๋ก์ฐ๋ฅผ ์คํํ ์ ์๋ ๊ธฐ๋ฅ์ ํตํด ์์ฐ์ด๋ฅผ ์ฌ์ฉํ์ฌ ๋ฐฉ๋ํ ์คํ๋ ๋์ํธ ๋ฐ ๋ฌธ์ ์ปฌ๋ ์ (Excel, CSV, Google Sheets, PDFs, Word docs)์ ๋ถ์, ์ฟผ๋ฆฌ ๋ฐ ํธ์งํด์ผ ํ์ต๋๋ค.
OCR ๋ฐ QuickBooks ์ฐ๋์ ํตํ AI ๊ธฐ๋ฐ ์ก์ฅ ์ฒ๋ฆฌ
๋งค์ ์๋ฐฑ ๊ฑด์ ๊ณต๊ธ์ ์ฒด ์ก์ฅ์ ์ฒ๋ฆฌํ๋ ์ค๊ฒฌ ๊ธฐ์ ์ AI/OCR์ ์ฌ์ฉํ์ฌ ์ก์ฅ ๋ฐ์ดํฐ๋ฅผ ์๋์ผ๋ก ์ถ์ถํ๊ณ ์ด๋ฅผ QuickBooks์ ์ง์ ๋๊ธฐํํ์ฌ ์ฅ๋ถ ์ ๋ฆฌ ๋ฐ ์ง๊ธ ์ถ์ ์ ํจ์ผ๋ก์จ ์๋ ๋ฐ์ดํฐ ์ ๋ ฅ์ ์์ ์ผ ํ์ต๋๋ค.
๋น์ฆ๋์ค ํ์ ์ ์์ํ ์ค๋น๊ฐ ๋์ จ๋์?
๊ทํ์ ๊ณผ์ ์ ์ ์ฌํ ์๋ฃจ์ ์ ์ ์ฉํ๋ ๋ฐฉ๋ฒ์ ๋ํด ๋ ผ์ํด ๋ณด๊ฒ ์ต๋๋ค.