← All notes
  • Observability
  • RAG
  • Quality

The retrieval blind spot: why RAG systems fail silently.

What the literature, the benchmarks, and three years of production incidents tell us about catching bad answers before users do.

In Liu et al. (2023), "Lost in the Middle: How Language Models Use Long Contexts" (arXiv:2307.03172), the authors show that LLM accuracy drops by roughly 20 percentage points when the relevant passage sits in the middle of a long retrieval window instead of at the edges. That is not a model problem you can prompt-engineer away. It is a retrieval problem your system inherits silently.

Production RAG stacks compound this. The Ragas evaluation framework (Es et al., 2023, arXiv:2309.15217) breaks retrieval quality into four orthogonal axes — faithfulness, answer relevance, context precision, and context recall — and finds that most live systems score below 0.7 on at least two of them when measured against held-out user questions. The dominant failure mode is not hallucination; it is the model dutifully answering from the wrong fragment.

Teams almost never instrument for this. Anthropic's 2024 retrieval-grounded report, Stanford's HELM benchmark, and the OpenAI Evals corpus all converge on the same observation: the bug is rarely in the model. It is in what the model was handed. And the handoff happens inside infrastructure most teams treat as a black box.

The minimum useful instrumentation is the join between (query, retrieved_ids, generated_answer, cited_ids). Without that join you cannot ask the only question that matters when a user complains: which fragment caused this answer? With it, you can replay the failure deterministically and you can score every production query against the four Ragas axes nightly.

A few teams ship this themselves on top of OpenTelemetry. Most do not — building it competes with feature work and the value only shows up the first time a customer reports a bad answer. Tools like Ragas, TruLens, and Buzo cover this layer out of the box; pick whichever fits your stack, but pick something. The cost of catching one regression before it ships pays the integration back many times over.

The deeper point: retrieval observability is not a metric, it is a habit. The teams that recover fastest from a bad-answer incident are the ones whose dashboards already show every retrieval-to-answer chain — not the ones who add logging after the fire.