Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: Document-level GraphRAG — markdown nodes + cross-reference relationships#39

Open
de-snake wants to merge 9 commits into
Muvon:masterfrom
de-snake:feat/doc-graphrag
Open

feat: Document-level GraphRAG — markdown nodes + cross-reference relationships#39
de-snake wants to merge 9 commits into
Muvon:masterfrom
de-snake:feat/doc-graphrag

Conversation

@de-snake
Copy link
Copy Markdown

Summary

Add markdown document nodes and cross-reference relationships to GraphRAG. After this change, octocode graphrag get-relationships --node-id docs/credit-accounts.md returns which docs link to it and which docs it links to.

Problem

GraphRAG currently only indexes source code files. Documentation files (.md) are processed for semantic search via document_blocks but are invisible to the knowledge graph. For projects with rich documentation (architecture docs, API guides, tutorials), this misses a major source of structural relationships.

Changes (4 commits)

1. types.rs — Add References relationship type

  • New RelationType::References variant for document cross-links
  • Importance weight 0.6 (between Imports at 0.7 and SiblingModule at 0.3)
  • Semantic clarity: "docs/a.md references docs/b.md" reads correctly vs "imports"

2. markdown.rs — Extract cross-reference links in Language trait

  • extract_imports_exports now parses [text](path.md) links via regex
  • Skips external URLs (http://, https://)
  • Strips anchor fragments (#section)
  • Gates on node.parent().is_none() to avoid duplicate extraction during AST walk
  • resolve_import resolves relative paths (../, ./) against source file directory

3. mod.rs — Feed markdown into GraphRAG pipeline

  • When graphrag.enabled, creates a synthetic CodeBlock per markdown file
  • Pushes into all_code_blocks so GraphBuilder creates a node for it
  • Actual link extraction happens when builder calls extract_imports_exports_from_file

4. relationships.rs — Route markdown relationships to References type

  • When source node language is "markdown", uses References instead of Imports
  • determine_file_kind returns "document_file" for .md files (was "documentation")

Design Decisions

Why synthetic CodeBlocks? The GraphRAG pipeline is all_code_blocks → GraphBuilder → CodeNode → relationships. Rather than refactoring the builder to accept a new input type, we create minimal CodeBlocks that the existing pipeline handles naturally.

Why regex in the Language trait? The extract_imports_exports signature receives (node: Node, contents: &str). For markdown, the tree-sitter JSON parser produces garbage AST, but contents is the full file. We regex-parse contents directly — keeps the change localized to markdown.rs.

Scope: Explicit [text](path.md) links only. Out of scope: wiki-style [[links]], HTML <a href>, implied topic references, frontmatter metadata.

Testing

All 35 existing GraphRAG tests pass plus new tests:

  • test_extract_markdown_links — regex link extraction
  • test_resolve_markdown_import — relative path resolution
  • test_document_reference_discovery — References type routing
cargo test --lib indexer::graphrag  # 35 tests pass
cargo test --lib indexer::languages::markdown  # 2 new tests pass

…inks

New RelationType::References variant for markdown-to-markdown
cross-references. Weight 0.6 (between structural imports at 0.7
and organizational sibling_module at 0.3).
Parse [text](path.md) links as 'imports' in the Language trait.
Implements resolve_import with relative path normalization (../ and ./).
Skips external URLs and non-.md links. Deduplicates.
Root-node gate avoids redundant extraction during recursive AST walk.
Create synthetic CodeBlocks for markdown files in all three indexing
paths (git-optimized, standard walker, and file watcher). The builder
reads files from disk and calls Markdown::extract_imports_exports for
link extraction. Only created when graphrag.enabled = true.
When source file is markdown, relationships from link extraction use
RelationType::References instead of Imports. Also split determine_file_kind
to return 'document_file' for .md files (separate from .txt/.rst 'documentation').
…rom_existing_database

When GraphRAG rebuilds from existing database (no new commits), it only read
from code_blocks.lance, missing all markdown files stored in document_blocks.
This caused markdown files to be invisible to GraphRAG despite being indexed.

Fix: get_all_code_blocks_for_graphrag() now also creates synthetic CodeBlocks
for markdown files found in document_blocks, enabling cross-reference discovery
on incremental rebuilds.
…ion in efficient discovery

discover_relationships_efficiently had inline symbol-matching for imports
that compared file-path strings like '../intro/credit-accounts.md' against
export symbols like 'Credit Accounts' — which never matched.

Fix: markdown nodes now use discover_import_relationships (with proper
resolve_import path resolution) instead of the symbol-matching path.

Added integration test that calls discover_relationships_efficiently
directly with markdown nodes to prove References edges are produced.
@donhardman donhardman self-requested a review March 25, 2026 17:14
The debug logging confirmed that resolve_import works correctly (1420/1567
resolutions succeed). The remaining issue is in upstream batch storage:
save_graph_incremental silently drops relationship batches after the first 2
when processing 2K+ nodes. This is a pre-existing scalability issue, not
related to our doc-graphrag changes.
@de-snake
Copy link
Copy Markdown
Author

Update: 8 commits now (3 additional fixes + 1 test + 1 cleanup)

Additional commits since PR open:

  1. fix(graphrag): include markdown files from document_blocks in build_from_existing_database — When GraphRAG rebuilds from existing DB (no new commits), it only read code_blocks.lance, missing all markdown files stored in document_blocks. Fixed get_all_code_blocks_for_graphrag() to also create synthetic CodeBlocks from document_blocks.

  2. fix(graphrag): route markdown nodes through path-based import resolutiondiscover_relationships_efficiently used inline symbol-matching that compared file paths like ../intro/credit-accounts.md against export symbols like Credit Accounts — which never matched. Fixed by routing markdown nodes through discover_import_relationships (with proper resolve_import).

  3. test: integration test for markdown references via efficient discovery — Tests the exact runtime code path (discover_relationships_efficiently) with markdown nodes, proving References edges are produced correctly.

  4. chore: remove debug logging — Cleanup.

Verified working:

  • 1030 document_file nodes created (was 0)
  • Import extraction: 349 markdown files have imports, 1420/1567 resolve successfully
  • Integration test passes: discover_relationships_efficiently produces References edges
  • All 40 tests pass (36 graphrag + 4 markdown)

Known limitation (pre-existing upstream issue):

At scale (2K+ nodes), save_graph_incremental in builder.rs silently drops relationship batches after storing ~2 of 4499 batches. This means the generated References relationships are correct but never persisted to LanceDB. This affects ALL relationship types at scale, not just our doc-graphrag changes. The batch storage loop at lines 583-608 appears to have a concurrency or timeout issue when writing large numbers of relationships.

This limitation is pre-existing — the same behavior occurs with code-only relationships on large repos. It should be addressed in a separate PR focused on relationship storage scalability.

get_graph_relationships had a TODO comment: 'For simplicity, return the
first batch. In production, you might want to concatenate all batches.'
This meant only ~64 of 575K relationships were returned on large repos.

Fix: use arrow::compute::concat_batches to merge all result batches.

Verified: 2838 references + 500K imports + 70K sibling_module now load correctly.
Batch writes produce duplicate relationships across incremental flushes.
Added HashSet-based dedup by (source, target, type) triple when loading
relationships from database. Prints dedup stats when not in quiet mode.
@donhardman
Copy link
Copy Markdown
Contributor

Hey! Thanks for the pull request. I would like to merge it once CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants