In Liu et al. (2023), "Lost in the Middle: How Language Models Use Long Contexts" (arXiv:2307.03172), the authors show that LLM accuracy drops by roughly 20 percentage points when the relevant passage sits in the middle of a long retrieval window instead of at the edges. That is not a model problem you can prompt-engineer away. It is a retrieval problem your system inherits silently.
Production RAG stacks compound this. The Ragas evaluation framework (Es et al., 2023, arXiv:2309.15217) breaks retrieval quality into four orthogonal axes — faithfulness, answer relevance, context precision, and context recall — and finds that most live systems score below 0.7 on at least two of them when measured against held-out user questions. The dominant failure mode is not hallucination; it is the model dutifully answering from the wrong fragment.
Teams almost never instrument for this. Anthropic's 2024 retrieval-grounded report, Stanford's HELM benchmark, and the OpenAI Evals corpus all converge on the same observation: the bug is rarely in the model. It is in what the model was handed. And the handoff happens inside infrastructure most teams treat as a black box.
The minimum useful instrumentation is the join between (query, retrieved_ids, generated_answer, cited_ids). Without that join you cannot ask the only question that matters when a user complains: which fragment caused this answer? With it, you can replay the failure deterministically and you can score every production query against the four Ragas axes nightly.
A few teams ship this themselves on top of OpenTelemetry. Most do not — building it competes with feature work and the value only shows up the first time a customer reports a bad answer. Tools like Ragas, TruLens, and Buzo cover this layer out of the box; pick whichever fits your stack, but pick something. The cost of catching one regression before it ships pays the integration back many times over.
The deeper point: retrieval observability is not a metric, it is a habit. The teams that recover fastest from a bad-answer incident are the ones whose dashboards already show every retrieval-to-answer chain — not the ones who add logging after the fire.
