Thanks to visit codestin.com
Credit goes to github.com

Skip to content

docs: extend RAG Failure Mode Checklist with advanced failures#20760

Merged
AstraBert merged 5 commits intorun-llama:mainfrom
onestardao:main
Feb 23, 2026
Merged

docs: extend RAG Failure Mode Checklist with advanced failures#20760
AstraBert merged 5 commits intorun-llama:mainfrom
onestardao:main

Conversation

@onestardao
Copy link
Contributor

Follow-up to #20702 and #20721.

This PR keeps the existing RAG Failure Mode Checklist and extends it with a small set of system-level failure families that often show up in production, without changing any of the current recommendations.

Summary of changes

  • Keep sections 1–9 as-is (single-query failures: retrieval, chunking, embeddings, query understanding, synthesis).
  • Add section 10 “Embedding Metric Mismatch (Cosine Score ≠ True Meaning)” to cover cases where the distance metric or normalization does not match how meaning is distributed in the data.
  • Add section 11 “Session and Cache Memory Breaks” for cross-session instability caused by stateless indices, cache keys, or environment changes.
  • Add section 12 “Observability Gaps ("Black-Box Debugging")” to highlight that many issues cannot be fixed before basic traces and logs are in place.
  • Add section 13 “Index Lifecycle and Deployment Ordering” to capture failures caused by empty or half-built indices, wrong snapshot routing, or deployment ordering bugs.
  • Slightly update the introduction and the Quick Diagnostic Flowchart so they point to the new sections when issues appear only in production or after deploys.

All new content is written in a project-native way (no external dependencies or naming schemes) and is based on recurring failure patterns seen in real-world RAG deployments.

Happy to adjust wording, scope, or numbering if you would prefer a slimmer version or a separate “advanced” doc instead of extending this page.

Description

This is a documentation-only change that expands the existing RAG Failure Mode Checklist with several additional failure families that commonly appear in production systems (embedding metric issues, cross-session instability, observability gaps, and index lifecycle / deployment ordering problems).

Related issues: #20702, #20721 (docs follow-up; does not close new issues).

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

This is a documentation-only change; no code paths were modified, so no additional tests were added.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist

  • I have performed a self-review of my own changes
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Follow-up to run-llama#20702 and run-llama#20721.

This PR keeps the existing RAG Failure Mode Checklist and extends it with a small set of system-level failure families that often show up in production, without changing any of the current recommendations.

Summary of changes

- Keep sections 1–9 as-is (single-query failures: retrieval, chunking, embeddings, query understanding, synthesis).
- Add section 10 “Embedding Metric Mismatch (Cosine Score ≠ True Meaning)” to cover cases where the distance metric or normalization does not match how meaning is distributed in the data.
- Add section 11 “Session and Cache Memory Breaks” for cross-session instability caused by stateless indices, cache keys, or environment changes.
- Add section 12 “Observability Gaps ("Black-Box Debugging")” to highlight that many issues cannot be fixed before basic traces and logs are in place.
- Add section 13 “Index Lifecycle and Deployment Ordering” to capture failures caused by empty or half-built indices, wrong snapshot routing, or deployment ordering bugs.
- Slightly update the introduction and the Quick Diagnostic Flowchart so they point to the new sections when issues appear only in production or after deploys.

All new content is written in a project-native way (no external dependencies or naming schemes) and is based on recurring failure patterns seen in real-world RAG deployments.

Happy to adjust wording, scope, or numbering if you would prefer a slimmer version or a separate “advanced” doc instead of extending this page.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 20, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 20, 2026
@logan-markewich logan-markewich enabled auto-merge (squash) February 20, 2026 23:15
auto-merge was automatically disabled February 21, 2026 00:55

Head branch was pushed to by a user without write access

@onestardao
Copy link
Contributor Author

I pushed a follow-up commit to fix the trailing whitespace that the linter reported.
From my side the checklist doc should now be lint-clean.
Happy to tweak any wording if you’d like.

@AstraBert
Copy link
Member

Linting is failing because of prettier. In order to be sure that everything is linted correctly, please run:

uv pip install pre-commit
pre-commit install 
pre-commit run -a

From the root folder of the llama_index repo

@AstraBert AstraBert enabled auto-merge (squash) February 23, 2026 11:57
@AstraBert
Copy link
Member

Please do not merge main into this branch until the current CI is finished (so that we can merge this PR without problems, otherwise I have to keep re-approving the workflows to run at every commit pushed)

@AstraBert AstraBert merged commit a281640 into run-llama:main Feb 23, 2026
12 checks passed
@onestardao
Copy link
Contributor Author

Thanks a lot for the review and merge.
And got it on the CI process — will keep that in mind next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants