Data Quality
Vector DBs
Safety

Vector-store rot: the data quality crisis nobody talks about.

Stale embeddings, leaked PII, and the surprisingly hard problem of safely deleting a row from a vector index.

May 13, 2026

Sculley et al. (2015), "Hidden Technical Debt in Machine Learning Systems" — Google's now-canonical NeurIPS paper — estimates that data-quality and pipeline debt account for a larger share of long-term ML maintenance cost than the models themselves. A decade later, the rise of RAG has moved most of that debt into a place it did not previously live: the vector store.

Three failure modes dominate. The first is duplication from re-ingestion: when a document gets re-chunked under a new pipeline, the old vectors usually stay. Studies of production Pinecone and Qdrant indexes report 8–22% near-duplicate rates at cosine ≥ 0.95, enough to bias retrieval toward whichever pipeline ran last week.

The second is embedding drift. When you switch from OpenAI ada-002 to text-embedding-3-large, or from MiniLM to BGE, the vectors you wrote yesterday live in a different semantic space than the queries you embed today. The cosine similarities still compute, but they no longer mean what they used to. Most teams discover this only when relevance silently degrades.

The third — and the one regulators care about — is sensitive data leakage. The Samsung incident of April 2023, in which engineers pasted proprietary source code into ChatGPT, made the headlines, but the steadier problem is unintentional ingestion of PII into RAG indexes. GDPR Article 17 (right to erasure) and the EU AI Act's record-keeping requirements (Article 12) both apply to vector representations of personal data, and "we cannot delete a single embedding" is not a defensible answer.

Cleaning blindly is the dangerous part. Vector stores rarely have foreign-key constraints, and a deletion that looks safe at the row level can break a downstream cluster the retrieval relied on. The reversibility literature is thin — Pinecone published its first official restore-from-snapshot guide only in late 2025 — and most teams default to "never delete, only overwrite," which is itself a compliance problem.

The state of the art is to (1) snapshot before mutation, (2) require human confirmation on bulk deletes, and (3) keep an undo window measured in weeks, not seconds. Whether you build that yourself, use a managed vector DB with restore-points (Pinecone, Qdrant Cloud), or layer something like Buzo on top, the underlying invariant is the same: every delete is a hypothesis, and you should be able to roll it back.