Contextual Encryption for LLM and Vector Database Pipelines
An enterprise AI platform needed to enable LLM-powered features (chat, search, document analysis) while ensuring sensitive data โ PII, financial records, healthcare information โ remained encrypted throughout the pipeline, including when stored as vector embeddings in a vector database.
Pag-usapan ang Iyong Proyekto
Ang Hamon
Using LLMs and vector databases with sensitive data introduced novel security risks:
- Embedding Inversion Attacks โ Research showed that vector embeddings could be reverse-engineered to reconstruct original text, exposing PII stored in vector DBs
- LLM Context Leakage โ Sensitive data sent to LLMs could appear in responses to other users if not properly isolated
- Compliance Requirements โ GDPR, HIPAA, and SOC2 demanded encryption at rest and in transit, but vector databases stored mathematical representations, not traditional text fields
- Search Functionality โ Encrypting text before embedding destroyed semantic meaning, making similarity search useless
- Key Management โ Per-tenant encryption keys needed rotation without re-embedding entire datasets
- Audit Trail โ Every access to decrypted sensitive data needed logging for compliance
Ang Aming Solusyon
We implemented a contextual encryption architecture that selectively encrypts sensitive fields before storage while preserving semantic searchability through a layered approach โ encrypting PII in metadata while keeping sanitized, non-sensitive content available for embedding.
Architecture
- Encryption Engine: AES-256-GCM with per-tenant encryption keys
- Key Management: AWS KMS for key generation, rotation, and access control
- PII Detection: NER-based (Named Entity Recognition) PII classifier
- Vector Database: Milvus for similarity search on sanitized embeddings
- LLM Layer: Sanitized context sent to LLM, sensitive fields re-injected post-generation
- Audit System: Every decryption event logged with user, timestamp, and purpose
- Database: PostgreSQL for encrypted metadata
Contextual Encryption Strategy
Data Classification
Before any data enters the pipeline, a PII classifier categorizes each field by sensitivity level:
- Highly Sensitive (e.g., government IDs, financial account numbers, medical IDs) โ Encrypted, never embedded, never sent to LLM
- Sensitive PII (e.g., full names, email addresses, phone numbers) โ Encrypted at rest, placeholder-replaced before embedding
- Contextual (e.g., job titles, company names) โ Encrypted at rest, available for embedding with consent
- Non-Sensitive (e.g., product descriptions, public information) โ Stored and embedded as-is
Encryption Layers
Layer 1: Field-Level Encryption at RestSensitive fields are encrypted with AES-256-GCM before storage. Each tenant gets a dedicated data encryption key (DEK) managed through a key hierarchy via AWS KMS. Shadow fields store searchable hashes for exact-match lookups without requiring decryption.
Layer 2: Sanitization Before EmbeddingPII is detected and replaced with type-preserving placeholders before text is sent to the embedding model. This preserves semantic meaning for similarity search while removing identifiable information. The original-to-placeholder mapping is stored encrypted alongside the vector record.
Layer 3: Context Injection After LLM GenerationThe LLM receives sanitized context with placeholders for generating responses. After generation, the system re-injects actual values from encrypted storage into the response. This prevents sensitive data from entering LLM training data or being cached by the provider.
Vector Database Security
Collection Design
Vector collections store sanitized embeddings alongside encrypted original metadata. Tenant isolation is enforced via partition keys, with each tenant's metadata encrypted using their own key. The API layer validates tenant ownership before any decryption operation.
Key Management & Rotation
Key Hierarchy
A multi-level key hierarchy is used: a master key in AWS KMS wraps per-tenant key encryption keys, which in turn wrap per-tenant data encryption keys used for field-level encryption. This enables efficient key rotation without re-encrypting the entire key chain.
Key Rotation Process
- New DEK Generated โ New data encryption key created under the existing key encryption key
- New Writes โ All new data encrypted with the new key; the old key remains valid for reads
- Background Re-encryption โ Batch job re-encrypts existing records with the new key
- Old DEK Retirement โ Once all records migrated, old key marked inactive
- Audit Log โ Rotation event logged with timestamps and affected record counts
Audit & Compliance
Decryption Audit Log
Every decryption event captures who requested it, what was decrypted, when, why (request context), and which key was used โ providing a complete compliance trail.
GDPR Right to Erasure
The system supports full data deletion across both the relational database and vector database, with optional key rotation to cryptographically ensure no residual access. All deletion operations are logged in a GDPR audit trail.
Key Features
- Field-Level Encryption โ AES-256-GCM on sensitive fields, not entire records
- PII Sanitization โ Placeholders preserve semantic meaning for embeddings
- Post-LLM Re-injection โ Sensitive data never sent to LLM providers
- Per-Tenant Keys โ Isolated encryption keys with AWS KMS management
- Key Rotation โ Zero-downtime rotation with background re-encryption
- Embedding Safety โ Sanitized embeddings prevent inversion attacks on PII
- Audit Trail โ Every decryption logged for compliance reporting
- GDPR Compliance โ Automated erasure across encrypted stores and vector DB
Mga Resulta
Technology Stack
caseStudyDetail.more Mga Case Study
Tuklasin ang higit pa sa aming mga teknikal na implementasyon
Pagpoproseso ng Invoice na Pinapagana ng AI gamit ang OCR at Integrasyon ng QuickBooks
Isang katamtamang laking negosyo na nagpoproseso ng daan-daang invoice ng vendor buwan-buwan ang kinailangan alisin ang manu-manong pagpasok ng data sa pamamagitan ng awtomatikong pagkuha ng data ng invoice gamit ang AI/OCR at direktang i-sync ito sa QuickBooks para sa bookkeeping at pagsubaybay sa pagbabayad.
Client-Side Ad Insertion (CSAI) na may pag-parse ng SCTE-35 Marker at Integrasyon ng Multi-Platform Player
Isang platform para sa video streaming ay nangangailangan na magpatupad ng Client-Side Ad Insertion (CSAI) sa mga web, mobile, at connected TV apps โ na nagbibigay-daan sa mga personalized, device-level na karanasan sa ad na may buong suporta sa interaksyon ng ad (mga clickable overlay, companion banner, skip button) na hindi kayang ibigay ng server-side insertion.
Handa nang Baguhin ang Iyong Negosyo?
Pag-usapan natin kung paano namin mailalapat ang katulad na mga solusyon sa iyong mga hamon.