Document Intelligenceפורסם June 18, 2026 · עודכן May 25, 2026

Local-First Document RAG System with Hybrid Search & Multi-Format Support

A team building developer tools needed a fully local, privacy-preserving document intelligence system that could ingest multiple file formats, build searchable knowledge bases, and answer natural language queries using Retrieval-Augmented Generation — without sending any data to external APIs.

דון בפרויקט שלך

Document Intelligence

Domain

Technologies

Key Results

Delivered

Status

האתגר

Existing RAG solutions had significant limitations for privacy-conscious and developer-focused use cases:

External API Dependency — Most RAG tools required sending document content to cloud-based embedding APIs, violating privacy requirements
Limited Format Support — Solutions typically handled only plain text or PDF, ignoring spreadsheets, Word docs, HTML, and Markdown
Poor Chunking — Naive text splitting ignored document structure (pages, sheets, headings), creating context-poor chunks
Keyword Gaps — Pure embedding-based search missed exact keyword matches that lexical search would catch
Spreadsheet Blindness — RAG systems couldn't handle structured tabular data or answer filtering/aggregation queries
No Reranking — First-pass retrieval often surfaced only partially relevant results without a second-pass quality filter

הפתרון שלנו

We built a complete local-first RAG system with multi-format document ingestion, structure-aware chunking, local embedding generation, a hybrid search pipeline (semantic + full-text + recency), cross-encoder reranking, and a web-based UI — all running entirely on the user's machine.

Architecture

Document Loaders: Format-specific parsers for PDF, DOCX, XLSX, CSV, HTML, Markdown, and plain text
Chunker: Structure-aware splitting that preserves page, sheet, and heading boundaries
Embeddings: Local embedding model via Transformers.js (no external API calls)
Vector Database: LanceDB (serverless, file-based) for embedding storage and similarity search
Full-Text Search: Trigram-based indexing for lexical matching
Reranker: Cross-encoder model for context-aware result scoring
Query Analyzer: Intent detection routing between semantic and structured queries
Web Server: Express.js API with project management and search endpoints
Frontend: Web-based UI for document upload, management, and interactive search

Document Processing Pipeline

Multi-Format Loaders

A registry pattern auto-detects file type and routes to the appropriate parser:

PDF — Text extraction with page-level segmentation
Word (.docx/.doc) — Heading-aware parsing preserving document hierarchy
Excel/CSV — Sheet-by-sheet parsing with header detection and row-level content
HTML — Tag-aware extraction with structure preservation
Markdown — Heading-based section parsing
Plain Text — Line-based segmentation

Each loader extracts metadata (title, author, creation date, page/sheet count, word count) alongside the content, producing structured sections with source references.

Structure-Aware Chunking

Unlike naive text splitting, the chunker respects document boundaries:

Preserves page breaks (PDFs), sheet boundaries (spreadsheets), and heading hierarchy (Word/Markdown)
Token-based sizing with configurable chunk size and overlap
Hierarchical fallback: splits by sections first, then paragraphs, then sentences
Each chunk retains source metadata (page number, sheet name, heading) for attribution

Embedding & Indexing

Local Embedding Model

Runs entirely locally via Transformers.js — no data leaves the machine
Quantized model for performance optimization
Batch embedding for efficient bulk processing
Automatic truncation at word boundaries with L2 normalization

Vector Storage

LanceDB provides serverless vector storage:

File-based (no separate database server needed)
Per-project isolation with independent indices
SHA256-based cache keys for deduplication
Metadata stored alongside vectors for filtered retrieval

Hybrid Search Pipeline

The retrieval pipeline combines three ranking signals for better results than any single approach:

Signal 1: Embedding Search (Semantic)

Vector similarity search finds chunks with related meaning even when different words are used. Handles paraphrasing, synonyms, and conceptual queries.

Signal 2: Full-Text Search (Lexical)

Trigram-based indexing with Jaccard similarity catches exact keyword matches that embedding search might miss — important for technical terms, names, and identifiers.

Signal 3: Recency Boost

Exponential decay weighting favors recently accessed or modified documents, ensuring up-to-date information surfaces first.

Score Combination

Signals are combined with configurable weights (default: 50% semantic, 25% lexical, 25% recency), normalized, and filtered by a minimum score threshold.

Cross-Encoder Reranking

After initial retrieval, a cross-encoder model re-scores the top candidates:

Context-aware scoring considers query-document pairs together (not independently)
Keyword boost calculation for term overlap
Blended scoring (cross-encoder + keyword signals)
Produces a final ranked list with higher precision than first-pass retrieval alone

Structured Data Support

For spreadsheet content, the system provides additional capabilities:

Auto-detection of column types (numeric, date, boolean, string)
Natural language filtering (e.g., "employees in engineering with salary above threshold")
Aggregation support (count, sum, average, min, max)
Query analyzer routes structured queries to a dedicated engine rather than embedding search

Web Interface

Project Management — Create, update, and delete knowledge base projects
Document Upload — Drag-and-drop file upload with format auto-detection
Document Creation — Create documents from text directly in the UI
Interactive Search — Natural language query interface with ranked results
Statistics — Index size, document count, and format distribution per project

Key Features

Fully Local — All processing on-device; no external API calls for embeddings or search
9 Input Formats — PDF, DOCX, DOC, XLSX, XLS, CSV, HTML, Markdown, plain text
Structure-Aware Chunking — Preserves pages, sheets, and headings as chunk boundaries
Hybrid Search — Combines semantic, lexical, and recency signals for better retrieval
Cross-Encoder Reranking — Second-pass scoring for higher precision results
Structured Queries — Natural language filtering and aggregation on spreadsheet data
Serverless Vector DB — LanceDB file-based storage with no infrastructure overhead
Document Writing — Export capabilities for PDF, DOCX, and XLSX creation
Project Isolation — Independent knowledge bases with separate indices
Web UI — Complete interface for document management and interactive search

תוצאות

Search Latency: ~60ms for full hybrid search pipeline (semantic + FTS + reranking)

Embedding Speed: ~50ms per chunk (batch: ~2s for 100 chunks)

Format Coverage: 9 input formats handled natively without external converters

מחסנית טכנולוגית

TypeScriptNode.jsExpress.jsTransformers.jsLanceDBVitestpnpmHTML/CSS/JS Frontend

caseStudyDetail.more מקרי בוחן

גלה עוד מהיישומים הטכניים שלנו

Document Intelligence

ניתוח גיליונות אלקטרוניים ומסמכים מבוסס AI עם תזמור מרובה-סוכנים והפניה בין מסמכים

צוות נתונים ארגוני נזקק לנתח, לשלוף ולערוך אוספים גדולים של גיליונות אלקטרוניים ומסמכים (Excel, CSV, Google Sheets, PDFs, Word docs) באמצעות שפה טבעית — עם היכולת להצליב נתונים בין קבצים מרובים ולבצע זרימות עבודה אנליטיות מרובות שלבים ללא טיוב נתונים ידני.

קרא מקרה בוחן

AI Accounting

עיבוד חשבוניות מבוסס AI עם OCR ושילוב QuickBooks

עסק בגודל בינוני שעיבד מאות חשבוניות ספק בחודש נזקק לביטול הזנת נתונים ידנית על ידי חילוץ אוטומטי של נתוני חשבוניות באמצעות AI/OCR וסנכרונם ישירות ל-QuickBooks לצורך הנהלת חשבונות ומעקב תשלומים.

קרא מקרה בוחן

שאלות נפוצות

MicrocosmWorks בנתה מערכת local-first RAG שבה כל תהליכי document ingestion, embedding generation, vector storage, ו-LLM inference מתבצעים במלואם על התשתית שלכם, מבלי לשלוח נתונים כלשהם ל-cloud APIs חיצוניים. ארכיטקטורה זו חיונית לארגונים המטפלים במסמכים מסווגים, חומרים חסויים בין עורך דין ללקוח, או קניין רוחני רגיש, שבהם דרישות data sovereignty אוסרות כל cloud processing, אפילו עם הצפנה.

MicrocosmWorks יישמה hybrid retrieval pipeline שמפעיל BM25 keyword search ו-dense vector semantic search במקביל, ולאחר מכן משתמש ב-reciprocal rank fusion כדי למזג ולדרג מחדש את התוצאות המשולבות לפני העברתן ל-LLM כ-context. גישה זו לוכדת exact-match queries כמו קודי מוצר וציטוטים משפטיים ש-semantic search מפספס, ובמקביל מאחזרת conceptually related content ש-keyword search לעולם לא ימצא.

MicrocosmWorks בנתה format-specific parsers עבור PDF, DOCX, XLSX, PPTX, HTML, Markdown, ו-plain text, עם OCR pipeline המשתמש ב-Tesseract עבור קובצי PDF סרוקים ומסמכים מבוססי תמונה. המערכת מזהה אוטומטית אם קובץ PDF מכיל selectable text או דורש OCR, מיישמת layout analysis כדי לשמר table structures ו-reading order, ומחלקת מסמכים ל-chunks באמצעות semantic boundaries ולא arbitrary character limits כדי לשפר את retrieval quality.

MicrocosmWorks יישמה incremental indexing העוקב אחר document checksums ומעבד מחדש רק קבצים שהשתנו מאז ה-ingestion run האחרון. במסמכים מעודכנים, ה-old chunks מוסרות וה-new chunks מוכנסות atomically, כך שאינדקס החיפוש לעולם אינו במצב inconsistent state. המערכת תומכת גם ב-versioned document retrieval, המאפשר למשתמשים לבצע שאילתות מול historical versions של מסמכים כאשר נדרש למטרות audit או compliance.

MicrocosmWorks ביצעה אופטימיזציה ל-local RAG pipeline כדי שירוץ על hardware צנוע, כאשר ה-minimum recommended configuration היא מכונה עם 32GB RAM, 8 CPU cores, ואופציונלית mid-range GPU עבור accelerated embedding generation. עבור ארגונים ללא חומרת GPU, המערכת חוזרת ל-CPU-based embedding models עם latency גבוה במקצת, ומסד הנתונים הווקטורי מכוונן לאחסון SSD כדי לשמור על query response times מתחת ל-200ms עבור corpora של עד מיליון document chunks.

מוכן לשנות את העסק שלך?

בואו נדון כיצד נוכל ליישם פתרונות דומים לאתגרים שלך.

צור קשר caseStudyDetail.viewAllCaseStudies

Local-First Document RAG System with Hybrid Search & Multi-Format Support

האתגר

הפתרון שלנו

Architecture

Document Processing Pipeline

Multi-Format Loaders

Structure-Aware Chunking

Embedding & Indexing

Local Embedding Model

Vector Storage

Hybrid Search Pipeline

Signal 1: Embedding Search (Semantic)

Signal 2: Full-Text Search (Lexical)

Signal 3: Recency Boost

Score Combination

Cross-Encoder Reranking

Structured Data Support

Web Interface

Key Features

תוצאות

מחסנית טכנולוגית

caseStudyDetail.more מקרי בוחן

ניתוח גיליונות אלקטרוניים ומסמכים מבוסס AI עם תזמור מרובה-סוכנים והפניה בין מסמכים

עיבוד חשבוניות מבוסס AI עם OCR ושילוב QuickBooks

שאלות נפוצות

מוכן לשנות את העסק שלך?

הזרקת פרסומות בצד הלקוח (CSAI) עם ניתוח סמני SCTE-35 ושילוב נגן מרובה פלטפורמות