Back to Architecture Patterns
AI / DataEnterprise

Scalable Vector Database Architecture

Embedding search is easy at 10K vectors. At 100M vectors with sub-100ms P99, it's an infrastructure problem — and that's what this pattern solves.

May 2, 2026
|
2 topics covered
Discuss This Architecture
Scalable Vector Database Architecture
AI / Data
Category
Enterprise
Complexity
AI/ML, E-Commerce
Industries
2+
Technologies

When You Need This

Your RAG pipeline or recommendation system works beautifully in development with a few thousand vectors. Now you have 50 million embeddings, queries need sub-100ms latency, the index keeps growing, and you're burning through memory. You need a vector database architecture that scales horizontally, manages memory efficiently (not everything needs to live in RAM), handles concurrent writes during ingestion without degrading query performance, and doesn't cost $10K/month in infrastructure for what is fundamentally a search index.

Pattern Overview

Scalable vector database architecture addresses the challenges of operating vector search at production scale: index partitioning across nodes (sharding), tiered storage (hot segments in memory, warm on SSD, cold on S3), query routing with load balancing, and autoscaling based on query load and index size. The pattern covers deployment topology, capacity planning, write/read isolation, and cost optimization. It's the infrastructure layer that makes RAG and recommendation systems viable at scale.

Reference Architecture

The architecture deploys vector database nodes in a clustered topology with separation between query nodes (read path) and data nodes (write path). An ingestion pipeline handles embedding generation and batch upserts with write buffering to avoid impacting query latency. A query router distributes searches across read replicas with shard-level parallelism. Tiered storage moves infrequently accessed segments from memory to SSD to S3, with transparent query-time loading. Autoscaling adjusts replica count based on query QPS and P99 latency.

Core Components
  • Cluster Management: Milvus (our default for scale) with etcd for metadata coordination, MinIO/S3 for segment storage, and Pulsar/Kafka for write-ahead logging. Alternatively, managed services (Pinecone, Zilliz Cloud) when operational simplicity outweighs cost
  • Shard & Partition Strategy: Logical partitions aligned to data boundaries (per-tenant, per-document-collection, per-time-window). Each partition is independently searchable, enabling filtered queries without scanning the full index. Shards distributed across nodes for parallel query execution
  • Tiered Storage Engine: Hot tier (in-memory HNSW/IVF index) for frequently queried collections. Warm tier (memory-mapped SSD) for large collections with moderate query load. Cold tier (S3-backed) for archival collections that are searchable but tolerate higher latency. Segment-level promotion/demotion based on access patterns
  • Autoscaling Controller: Horizontal pod autoscaler (HPA) on Kubernetes that scales query nodes based on QPS and P99 latency metrics. Scale-up on latency breach, scale-down on sustained low utilization. Separate scaling for ingestion workers to handle burst uploads without affecting query performance

Design Decisions & Trade-offs

Milvus vs. Pinecone vs. Qdrant vs. pgvector
pgvector is fine for < 1M vectors where you already have PostgreSQL and can tolerate ~200ms latency. Pinecone for teams that want zero operational burden and can accept the pricing (scales well but gets expensive past 10M vectors). Qdrant for a clean API with good single-node performance. Milvus for serious scale — it's the only open-source option with true distributed architecture, tiered storage, and production-grade sharding. MW defaults to Milvus for >5M vectors and Pinecone for teams that prioritize managed simplicity.
HNSW vs. IVF_FLAT vs. IVF_PQ
HNSW (Hierarchical Navigable Small World) gives the best recall at low latency but uses the most memory (full vectors in RAM). IVF_FLAT clusters vectors and searches only relevant clusters — good balance of speed and memory. IVF_PQ (Product Quantization) compresses vectors for massive memory savings but reduces recall by 3-8%. MW uses HNSW for collections under 10M vectors and switches to IVF_PQ with PQ refinement (re-score top candidates against full vectors) for larger collections where memory cost matters.
Write Isolation
Concurrent writes during ingestion degrade query latency in most vector databases. MW separates the write path: new vectors are buffered in a write-ahead log, periodically flushed into sealed segments, and merged into the searchable index during low-traffic windows. For systems requiring real-time ingestion (e.g., live document processing), we deploy separate ingestion and query node pools with different resource allocations.
Cost Optimization
Vector databases are memory-hungry. A 100M-vector collection with 1536-dimensional embeddings needs ~600GB of RAM in HNSW mode. MW optimizes cost through: (a) dimensionality reduction where feasible (Matryoshka embeddings, PCA), (b) quantization (scalar or product quantization), (c) tiered storage to push cold segments off RAM, and (d) right-sizing embedding dimensions — 768 dimensions is often sufficient when 1536 is overkill.
Scalable Vector Database Architecture - System Architecture Diagram

System Architecture Overview

Technology Choices

LayerTechnologies
Vector DatabaseMilvus (distributed), Qdrant (single-node/small-cluster), Pinecone (managed)
Storage BackendMinIO / S3 (segment storage), SSD (warm tier), RAM (hot tier)
Coordinationetcd (Milvus metadata), Pulsar/Kafka (write-ahead log)
Embedding ModelsOpenAI text-embedding-3-large, Cohere embed-v4, BGE-M3, E5-large-v2
InfrastructureKubernetes (EKS/GKE) with GPU nodes for embedding, memory-optimized nodes for query
MonitoringGrafana + Milvus metrics exporter, custom P99/recall dashboards

When to Use / When to Avoid

Use WhenAvoid When
Vector count exceeds 5M and growing, requiring horizontal scalingYou have < 1M vectors — pgvector on your existing PostgreSQL is sufficient
Sub-100ms P99 query latency is a hard requirementQuery latency of 500ms+ is acceptable — simpler options work
Multiple applications/tenants share the vector infrastructureA single application with a single collection — use a managed service
Cost optimization requires tiered storage (not everything in RAM)Budget allows fully managed services and the vendor's pricing works at your scale

Our Approach

MW designs vector database infrastructure with a "right-size from day one, scale when measured" approach. We start with capacity planning based on vector count, dimensionality, index type, and target latency — not guesswork. Our Milvus deployments on Kubernetes include Grafana dashboards tracking segment count, memory utilization, query latency percentiles, and recall estimates. We've implemented autoscaling Milvus clusters that handle 10x traffic spikes during business hours and scale down overnight, reducing infrastructure cost by 40-60% compared to static provisioning.

Related Blueprints

Related Case Studies

Related Technologies
AI DevelopmentCloud Solutions

Need Help Implementing This Architecture?

Our architects can help design and build systems using this pattern for your specific requirements.

Get In Touch
Contact UsSchedule Appointment