Scalable Vector Database Architecture

Q: At what data scale does a dedicated vector database become necessary instead of using pgvector in PostgreSQL?

MicrocosmWorks generally recommends pgvector for projects with fewer than 5-10 million vectors where the team already uses PostgreSQL, as it avoids introducing a new infrastructure component and supports hybrid SQL-plus-vector queries natively. Beyond 10 million vectors or when you need sub-50ms p99 latency at high concurrency, a purpose-built vector database like Qdrant, Weaviate, or Milvus provides significantly better performance through optimized indexing algorithms and GPU-accelerated search. We help clients make this decision during architecture review by benchmarking their actual query patterns and growth projections.

Q: How do you handle vector database sharding when the dataset grows beyond what a single node can serve?

MicrocosmWorks designs vector database clusters with hash-based or metadata-based sharding strategies that distribute vectors across nodes while keeping semantically related data co-located for efficient search. We implement query routing layers that fan out search requests to relevant shards and merge results using a global top-K aggregation, maintaining sub-100ms latency even across dozens of shards. Our monitoring dashboards track shard balance, query distribution, and replication lag to prevent hotspots as your dataset scales.

Q: What quantization techniques can reduce vector storage costs without significantly degrading search quality?

MicrocosmWorks applies scalar quantization (reducing float32 to int8) and product quantization to compress vector storage by 4-8x with typically less than 2% degradation in recall, which we validate through A/B testing on your actual query workload before deploying to production. We also implement a two-stage retrieval approach where quantized vectors serve the initial candidate retrieval and full-precision vectors are used only for final re-ranking of the top results. This hybrid strategy lets clients store hundreds of millions of vectors at a fraction of the cost while maintaining search quality indistinguishable from uncompressed operation.

Q: How does MicrocosmWorks ensure high availability for vector databases serving real-time AI applications?

MicrocosmWorks deploys vector databases in multi-replica configurations with synchronous replication for write durability and read replicas distributed across availability zones for fault tolerance and load balancing. We configure automated failover with health-check-driven leader election so that a node failure results in less than 10 seconds of read unavailability and zero data loss. Our infrastructure-as-code templates include pre-configured backup schedules, point-in-time recovery, and disaster recovery runbooks tailored to each vector database engine.

Q: Can we use a single vector database to serve multiple AI applications with different embedding models and dimensions?

MicrocosmWorks architects multi-collection vector database deployments where each application or embedding model gets its own isolated collection with appropriate index configurations, while sharing the underlying cluster infrastructure for cost efficiency. We implement a unified query gateway that routes requests to the correct collection based on application context and applies collection-specific pre-processing like query embedding with the matching model. This multi-tenant vector database approach typically reduces infrastructure costs by 40-60% compared to running separate clusters per application.

Design Decisions & Trade-offs

Milvus vs. Pinecone vs. Qdrant vs. pgvector

pgvector is fine for < 1M vectors where you already have PostgreSQL and can tolerate ~200ms latency. Pinecone for teams that want zero operational burden and can accept the pricing (scales well but gets expensive past 10M vectors). Qdrant for a clean API with good single-node performance. Milvus for serious scale — it's the only open-source option with true distributed architecture, tiered storage, and production-grade sharding. MW defaults to Milvus for >5M vectors and Pinecone for teams that prioritize managed simplicity.

HNSW vs. IVF_FLAT vs. IVF_PQ

HNSW (Hierarchical Navigable Small World) gives the best recall at low latency but uses the most memory (full vectors in RAM). IVF_FLAT clusters vectors and searches only relevant clusters — good balance of speed and memory. IVF_PQ (Product Quantization) compresses vectors for massive memory savings but reduces recall by 3-8%. MW uses HNSW for collections under 10M vectors and switches to IVF_PQ with PQ refinement (re-score top candidates against full vectors) for larger collections where memory cost matters.

Write Isolation

Concurrent writes during ingestion degrade query latency in most vector databases. MW separates the write path: new vectors are buffered in a write-ahead log, periodically flushed into sealed segments, and merged into the searchable index during low-traffic windows. For systems requiring real-time ingestion (e.g., live document processing), we deploy separate ingestion and query node pools with different resource allocations.

Cost Optimization

Vector databases are memory-hungry. A 100M-vector collection with 1536-dimensional embeddings needs ~600GB of RAM in HNSW mode. MW optimizes cost through: (a) dimensionality reduction where feasible (Matryoshka embeddings, PCA), (b) quantization (scalar or product quantization), (c) tiered storage to push cold segments off RAM, and (d) right-sizing embedding dimensions — 768 dimensions is often sufficient when 1536 is overkill.

Scalable Vector Database Architecture - System Architecture Diagram

System Architecture Overview

Technology Choices

Layer	Technologies
Vector Database	Milvus (distributed), Qdrant (single-node/small-cluster), Pinecone (managed)
Storage Backend	MinIO / S3 (segment storage), SSD (warm tier), RAM (hot tier)
Coordination	etcd (Milvus metadata), Pulsar/Kafka (write-ahead log)
Embedding Models	OpenAI text-embedding-3-large, Cohere embed-v4, BGE-M3, E5-large-v2
Infrastructure	Kubernetes (EKS/GKE) with GPU nodes for embedding, memory-optimized nodes for query
Monitoring	Grafana + Milvus metrics exporter, custom P99/recall dashboards

When to Use / When to Avoid

Use When	Avoid When
Vector count exceeds 5M and growing, requiring horizontal scaling	You have < 1M vectors — pgvector on your existing PostgreSQL is sufficient
Sub-100ms P99 query latency is a hard requirement	Query latency of 500ms+ is acceptable — simpler options work
Multiple applications/tenants share the vector infrastructure	A single application with a single collection — use a managed service
Cost optimization requires tiered storage (not everything in RAM)	Budget allows fully managed services and the vendor's pricing works at your scale

常见问题

MicrocosmWorks generally recommends pgvector for projects with fewer than 5-10 million vectors where the team already uses PostgreSQL, as it avoids introducing a new infrastructure component and supports hybrid SQL-plus-vector queries natively. Beyond 10 million vectors or when you need sub-50ms p99 latency at high concurrency, a purpose-built vector database like Qdrant, Weaviate, or Milvus provides significantly better performance through optimized indexing algorithms and GPU-accelerated search. We help clients make this decision during architecture review by benchmarking their actual query patterns and growth projections.

MicrocosmWorks designs vector database clusters with hash-based or metadata-based sharding strategies that distribute vectors across nodes while keeping semantically related data co-located for efficient search. We implement query routing layers that fan out search requests to relevant shards and merge results using a global top-K aggregation, maintaining sub-100ms latency even across dozens of shards. Our monitoring dashboards track shard balance, query distribution, and replication lag to prevent hotspots as your dataset scales.

MicrocosmWorks applies scalar quantization (reducing float32 to int8) and product quantization to compress vector storage by 4-8x with typically less than 2% degradation in recall, which we validate through A/B testing on your actual query workload before deploying to production. We also implement a two-stage retrieval approach where quantized vectors serve the initial candidate retrieval and full-precision vectors are used only for final re-ranking of the top results. This hybrid strategy lets clients store hundreds of millions of vectors at a fraction of the cost while maintaining search quality indistinguishable from uncompressed operation.

MicrocosmWorks deploys vector databases in multi-replica configurations with synchronous replication for write durability and read replicas distributed across availability zones for fault tolerance and load balancing. We configure automated failover with health-check-driven leader election so that a node failure results in less than 10 seconds of read unavailability and zero data loss. Our infrastructure-as-code templates include pre-configured backup schedules, point-in-time recovery, and disaster recovery runbooks tailored to each vector database engine.

MicrocosmWorks architects multi-collection vector database deployments where each application or embedding model gets its own isolated collection with appropriate index configurations, while sharing the underlying cluster infrastructure for cost efficiency. We implement a unified query gateway that routes requests to the correct collection based on application context and applies collection-specific pre-processing like query embedding with the matching model. This multi-tenant vector database approach typically reduces infrastructure costs by 40-60% compared to running separate clusters per application.

When You Need This

Related Architecture Patterns

RAG 流水线架构

需要帮助实现此架构吗?

Pattern Overview

Reference Architecture

Design Decisions & Trade-offs

Technology Choices

When to Use / When to Avoid

Our Approach

Related Blueprints

Related Case Studies

AI/ML 管道架构

多租户 SaaS 架构

常见问题