docs: extend RAG Failure Mode Checklist with advanced failures#20760
docs: extend RAG Failure Mode Checklist with advanced failures#20760AstraBert merged 5 commits intorun-llama:mainfrom
Conversation
Follow-up to run-llama#20702 and run-llama#20721. This PR keeps the existing RAG Failure Mode Checklist and extends it with a small set of system-level failure families that often show up in production, without changing any of the current recommendations. Summary of changes - Keep sections 1–9 as-is (single-query failures: retrieval, chunking, embeddings, query understanding, synthesis). - Add section 10 “Embedding Metric Mismatch (Cosine Score ≠ True Meaning)” to cover cases where the distance metric or normalization does not match how meaning is distributed in the data. - Add section 11 “Session and Cache Memory Breaks” for cross-session instability caused by stateless indices, cache keys, or environment changes. - Add section 12 “Observability Gaps ("Black-Box Debugging")” to highlight that many issues cannot be fixed before basic traces and logs are in place. - Add section 13 “Index Lifecycle and Deployment Ordering” to capture failures caused by empty or half-built indices, wrong snapshot routing, or deployment ordering bugs. - Slightly update the introduction and the Quick Diagnostic Flowchart so they point to the new sections when issues appear only in production or after deploys. All new content is written in a project-native way (no external dependencies or naming schemes) and is based on recurring failure patterns seen in real-world RAG deployments. Happy to adjust wording, scope, or numbering if you would prefer a slimmer version or a separate “advanced” doc instead of extending this page.
Head branch was pushed to by a user without write access
|
I pushed a follow-up commit to fix the trailing whitespace that the linter reported. |
|
Linting is failing because of uv pip install pre-commit
pre-commit install
pre-commit run -aFrom the root folder of the llama_index repo |
|
Please do not merge main into this branch until the current CI is finished (so that we can merge this PR without problems, otherwise I have to keep re-approving the workflows to run at every commit pushed) |
|
Thanks a lot for the review and merge. |
Follow-up to #20702 and #20721.
This PR keeps the existing RAG Failure Mode Checklist and extends it with a small set of system-level failure families that often show up in production, without changing any of the current recommendations.
Summary of changes
All new content is written in a project-native way (no external dependencies or naming schemes) and is based on recurring failure patterns seen in real-world RAG deployments.
Happy to adjust wording, scope, or numbering if you would prefer a slimmer version or a separate “advanced” doc instead of extending this page.
Description
This is a documentation-only change that expands the existing RAG Failure Mode Checklist with several additional failure families that commonly appear in production systems (embedding metric issues, cross-session instability, observability gaps, and index lifecycle / deployment ordering problems).
Related issues: #20702, #20721 (docs follow-up; does not close new issues).
New Package?
Did I fill in the
tool.llamahubsection in thepyproject.tomland provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.tomlfile of the package I am updating? (Except for thellama-index-corepackage)Type of Change
How Has This Been Tested?
This is a documentation-only change; no code paths were modified, so no additional tests were added.
Suggested Checklist
uv run make format; uv run make lintto appease the lint gods