MicrocosmWorksInnovando y Arquitectando el Cosmos Digital
Acerca deContacto
MicrocosmWorksInnovando y Arquitectando el Cosmos Digital

Ofreciendo soluciones de TI que importan. Nos apasiona la tecnología, la seguridad y ayudar a las empresas a crecer a través de una infraestructura de TI confiable e innovadora.

[email protected]
+91 7011868196
New Delhi, India

Centro de Crecimiento de IA

Centro de IAInnovación para StartupsAcelerador Empresarial

Soluciones

Todas las SolucionesAplicaciones de Bienestar y FitnessPlataforma de Video con IADesarrollo de Agentes de IA

Recursos

PerspectivasGuías de la IndustriaPlanos de Casos de UsoPatrones de ArquitecturaEstudios de Caso

Compañía

Sobre NosotrosContactoNuestro Trabajo

Servicios

Consultoría DigitalInfraestructura en la NubeDesarrollo SaaSDesarrollo de IATecnología de Video
Desarrollo ERPPersonalización de ZohoDesarrollo de OdooIntegración de SalesforceDesarrollo de CRM Personalizado
Integración de QuickBooksSoluciones IoTDesarrollo de Blockchain
Consultoría de CiberseguridadSoporte IT - L3

© 2026 MicrocosmWorks. Todos los derechos reservados.

Política de PrivacidadTérminos de Servicio
Volver a Casos de Estudio
Document IntelligencePublicado June 18, 2026 · Actualizado May 25, 2026

Local-First Document RAG System with Hybrid Search & Multi-Format Support

A team building developer tools needed a fully local, privacy-preserving document intelligence system that could ingest multiple file formats, build searchable knowledge bases, and answer natural language queries using Retrieval-Augmented Generation — without sending any data to external APIs.

Discuta Su Proyecto
local-rag-hybrid-search-pipeline.webp
Document Intelligence
Domain
8
Technologies
5
Key Results
Delivered
Status

El Desafío

Existing RAG solutions had significant limitations for privacy-conscious and developer-focused use cases:

  • External API Dependency — Most RAG tools required sending document content to cloud-based embedding APIs, violating privacy requirements
  • Limited Format Support — Solutions typically handled only plain text or PDF, ignoring spreadsheets, Word docs, HTML, and Markdown
  • Poor Chunking — Naive text splitting ignored document structure (pages, sheets, headings), creating context-poor chunks
  • Keyword Gaps — Pure embedding-based search missed exact keyword matches that lexical search would catch
  • Spreadsheet Blindness — RAG systems couldn't handle structured tabular data or answer filtering/aggregation queries
  • No Reranking — First-pass retrieval often surfaced only partially relevant results without a second-pass quality filter

Nuestra Solución

We built a complete local-first RAG system with multi-format document ingestion, structure-aware chunking, local embedding generation, a hybrid search pipeline (semantic + full-text + recency), cross-encoder reranking, and a web-based UI — all running entirely on the user's machine.

Architecture

  • Document Loaders: Format-specific parsers for PDF, DOCX, XLSX, CSV, HTML, Markdown, and plain text
  • Chunker: Structure-aware splitting that preserves page, sheet, and heading boundaries
  • Embeddings: Local embedding model via Transformers.js (no external API calls)
  • Vector Database: LanceDB (serverless, file-based) for embedding storage and similarity search
  • Full-Text Search: Trigram-based indexing for lexical matching
  • Reranker: Cross-encoder model for context-aware result scoring
  • Query Analyzer: Intent detection routing between semantic and structured queries
  • Web Server: Express.js API with project management and search endpoints
  • Frontend: Web-based UI for document upload, management, and interactive search

Document Processing Pipeline

Multi-Format Loaders

A registry pattern auto-detects file type and routes to the appropriate parser:

  • PDF — Text extraction with page-level segmentation
  • Word (.docx/.doc) — Heading-aware parsing preserving document hierarchy
  • Excel/CSV — Sheet-by-sheet parsing with header detection and row-level content
  • HTML — Tag-aware extraction with structure preservation
  • Markdown — Heading-based section parsing
  • Plain Text — Line-based segmentation

Each loader extracts metadata (title, author, creation date, page/sheet count, word count) alongside the content, producing structured sections with source references.

Structure-Aware Chunking

Unlike naive text splitting, the chunker respects document boundaries:

  • Preserves page breaks (PDFs), sheet boundaries (spreadsheets), and heading hierarchy (Word/Markdown)
  • Token-based sizing with configurable chunk size and overlap
  • Hierarchical fallback: splits by sections first, then paragraphs, then sentences
  • Each chunk retains source metadata (page number, sheet name, heading) for attribution

Embedding & Indexing

Local Embedding Model

  • Runs entirely locally via Transformers.js — no data leaves the machine
  • Quantized model for performance optimization
  • Batch embedding for efficient bulk processing
  • Automatic truncation at word boundaries with L2 normalization

Vector Storage

LanceDB provides serverless vector storage:

  • File-based (no separate database server needed)
  • Per-project isolation with independent indices
  • SHA256-based cache keys for deduplication
  • Metadata stored alongside vectors for filtered retrieval

Hybrid Search Pipeline

The retrieval pipeline combines three ranking signals for better results than any single approach:

Signal 1: Embedding Search (Semantic)

Vector similarity search finds chunks with related meaning even when different words are used. Handles paraphrasing, synonyms, and conceptual queries.

Signal 2: Full-Text Search (Lexical)

Trigram-based indexing with Jaccard similarity catches exact keyword matches that embedding search might miss — important for technical terms, names, and identifiers.

Signal 3: Recency Boost

Exponential decay weighting favors recently accessed or modified documents, ensuring up-to-date information surfaces first.

Score Combination

Signals are combined with configurable weights (default: 50% semantic, 25% lexical, 25% recency), normalized, and filtered by a minimum score threshold.

Cross-Encoder Reranking

After initial retrieval, a cross-encoder model re-scores the top candidates:

  • Context-aware scoring considers query-document pairs together (not independently)
  • Keyword boost calculation for term overlap
  • Blended scoring (cross-encoder + keyword signals)
  • Produces a final ranked list with higher precision than first-pass retrieval alone

Structured Data Support

For spreadsheet content, the system provides additional capabilities:

  • Auto-detection of column types (numeric, date, boolean, string)
  • Natural language filtering (e.g., "employees in engineering with salary above threshold")
  • Aggregation support (count, sum, average, min, max)
  • Query analyzer routes structured queries to a dedicated engine rather than embedding search

Web Interface

  • Project Management — Create, update, and delete knowledge base projects
  • Document Upload — Drag-and-drop file upload with format auto-detection
  • Document Creation — Create documents from text directly in the UI
  • Interactive Search — Natural language query interface with ranked results
  • Statistics — Index size, document count, and format distribution per project

Key Features

  1. Fully Local — All processing on-device; no external API calls for embeddings or search
  2. 9 Input Formats — PDF, DOCX, DOC, XLSX, XLS, CSV, HTML, Markdown, plain text
  3. Structure-Aware Chunking — Preserves pages, sheets, and headings as chunk boundaries
  4. Hybrid Search — Combines semantic, lexical, and recency signals for better retrieval
  5. Cross-Encoder Reranking — Second-pass scoring for higher precision results
  6. Structured Queries — Natural language filtering and aggregation on spreadsheet data
  7. Serverless Vector DB — LanceDB file-based storage with no infrastructure overhead
  8. Document Writing — Export capabilities for PDF, DOCX, and XLSX creation
  9. Project Isolation — Independent knowledge bases with separate indices
  10. Web UI — Complete interface for document management and interactive search

Resultados

Search Latency: ~60ms for full hybrid search pipeline (semantic + FTS + reranking)
Embedding Speed: ~50ms per chunk (batch: ~2s for 100 chunks)
Format Coverage: 9 input formats handled natively without external converters

Stack Tecnológico

TypeScriptNode.jsExpress.jsTransformers.jsLanceDBVitestpnpmHTML/CSS/JS Frontend

caseStudyDetail.more Casos de Estudio

Explore más de nuestras implementaciones técnicas

Document Intelligence

Análisis de Hojas de Cálculo y Documentos Impulsado por AI con Orquestación Multi-Agente y Referencia Cruzada entre Documentos

Un equipo de datos empresarial necesitaba analizar, consultar y editar grandes colecciones de hojas de cálculo y documentos (Excel, CSV, Google Sheets, PDFs, Word docs) usando lenguaje natural, con la capacidad de hacer referencia cruzada de datos entre múltiples archivos y ejecutar flujos de trabajo analíticos de varios pasos sin manipulación manual de datos.

Leer Caso de Estudio
AI Accounting

Procesamiento de Facturas Potenciado por AI con OCR e Integración con QuickBooks

Una empresa de tamaño mediano que procesa cientos de facturas de proveedores mensualmente necesitaba eliminar la entrada de datos manual extrayendo automáticamente los datos de las facturas usando AI/OCR y sincronizándolos directamente en QuickBooks para la contabilidad y el seguimiento de pagos.

¿Listo para Transformar su Negocio?

Hablemos sobre cómo podemos aplicar soluciones similares a sus desafíos.

ContáctenoscaseStudyDetail.viewAllCaseStudies
Privacy: Zero data transmitted externally — complete local processing
Memory Footprint: ~100MB for embedding model, ~1MB per 1,000 indexed chunks
Leer Caso de Estudio
Video Encoding

Inserción de Anuncios en el Lado del Cliente (CSAI) con Análisis de Marcadores SCTE-35 e Integración de Reproductor Multiplataforma

Una plataforma de streaming de video necesitaba implementar la Inserción de Anuncios en el Lado del Cliente (CSAI) en sus aplicaciones web, móviles y de TV conectada, lo que permitiría experiencias publicitarias personalizadas a nivel de dispositivo con soporte completo para la interacción con anuncios (superposiciones clicables, banners complementarios, botones para omitir) que la inserción del lado del servidor no puede proporcionar.

Leer Caso de Estudio

Preguntas Frecuentes

MicrocosmWorks built a local-first RAG system where all document ingestion, embedding generation, vector storage, and LLM inference run entirely on your infrastructure without sending any data to external cloud APIs. This architecture is essential for organizations handling classified documents, attorney-client privileged materials, or sensitive intellectual property where data sovereignty requirements prohibit any cloud processing, even with encryption.

MicrocosmWorks implemented a hybrid retrieval pipeline that runs BM25 keyword search and dense vector semantic search in parallel, then uses reciprocal rank fusion to merge and re-rank the combined results before passing them to the LLM as context. This approach catches exact-match queries like product codes and legal citations that semantic search misses, while also retrieving conceptually related content that keyword search would never find.

MicrocosmWorks built format-specific parsers for PDF, DOCX, XLSX, PPTX, HTML, Markdown, and plain text, with an OCR pipeline using Tesseract for scanned PDFs and image-based documents. The system automatically detects whether a PDF contains selectable text or requires OCR, applies layout analysis to preserve table structures and reading order, and chunks documents using semantic boundaries rather than arbitrary character limits to improve retrieval quality.

MicrocosmWorks implemented incremental indexing that tracks document checksums and only re-processes files that have changed since the last ingestion run. Updated documents have their old chunks removed and new chunks inserted atomically, so the search index is never in an inconsistent state. The system also supports versioned document retrieval, allowing users to query against historical versions of documents when needed for audit or compliance purposes.

MicrocosmWorks optimized the local RAG pipeline to run on modest hardware, with the minimum recommended configuration being a machine with 32GB RAM, 8 CPU cores, and optionally a mid-range GPU for accelerated embedding generation. For organizations without GPU hardware, the system falls back to CPU-based embedding models with slightly higher latency, and the vector database is tuned for SSD storage to keep query response times under 200ms for corpora up to 1 million document chunks.