feat: Document-level GraphRAG — markdown nodes + cross-reference relationships#39
feat: Document-level GraphRAG — markdown nodes + cross-reference relationships#39de-snake wants to merge 9 commits into
Conversation
…inks New RelationType::References variant for markdown-to-markdown cross-references. Weight 0.6 (between structural imports at 0.7 and organizational sibling_module at 0.3).
Parse [text](path.md) links as 'imports' in the Language trait. Implements resolve_import with relative path normalization (../ and ./). Skips external URLs and non-.md links. Deduplicates. Root-node gate avoids redundant extraction during recursive AST walk.
Create synthetic CodeBlocks for markdown files in all three indexing paths (git-optimized, standard walker, and file watcher). The builder reads files from disk and calls Markdown::extract_imports_exports for link extraction. Only created when graphrag.enabled = true.
When source file is markdown, relationships from link extraction use RelationType::References instead of Imports. Also split determine_file_kind to return 'document_file' for .md files (separate from .txt/.rst 'documentation').
…rom_existing_database When GraphRAG rebuilds from existing database (no new commits), it only read from code_blocks.lance, missing all markdown files stored in document_blocks. This caused markdown files to be invisible to GraphRAG despite being indexed. Fix: get_all_code_blocks_for_graphrag() now also creates synthetic CodeBlocks for markdown files found in document_blocks, enabling cross-reference discovery on incremental rebuilds.
…ion in efficient discovery discover_relationships_efficiently had inline symbol-matching for imports that compared file-path strings like '../intro/credit-accounts.md' against export symbols like 'Credit Accounts' — which never matched. Fix: markdown nodes now use discover_import_relationships (with proper resolve_import path resolution) instead of the symbol-matching path. Added integration test that calls discover_relationships_efficiently directly with markdown nodes to prove References edges are produced.
The debug logging confirmed that resolve_import works correctly (1420/1567 resolutions succeed). The remaining issue is in upstream batch storage: save_graph_incremental silently drops relationship batches after the first 2 when processing 2K+ nodes. This is a pre-existing scalability issue, not related to our doc-graphrag changes.
Update: 8 commits now (3 additional fixes + 1 test + 1 cleanup)Additional commits since PR open:
Verified working:
Known limitation (pre-existing upstream issue):At scale (2K+ nodes), This limitation is pre-existing — the same behavior occurs with code-only relationships on large repos. It should be addressed in a separate PR focused on relationship storage scalability. |
get_graph_relationships had a TODO comment: 'For simplicity, return the first batch. In production, you might want to concatenate all batches.' This meant only ~64 of 575K relationships were returned on large repos. Fix: use arrow::compute::concat_batches to merge all result batches. Verified: 2838 references + 500K imports + 70K sibling_module now load correctly.
Batch writes produce duplicate relationships across incremental flushes. Added HashSet-based dedup by (source, target, type) triple when loading relationships from database. Prints dedup stats when not in quiet mode.
|
Hey! Thanks for the pull request. I would like to merge it once CI passes. |
Summary
Add markdown document nodes and cross-reference relationships to GraphRAG. After this change,
octocode graphrag get-relationships --node-id docs/credit-accounts.mdreturns which docs link to it and which docs it links to.Problem
GraphRAG currently only indexes source code files. Documentation files (
.md) are processed for semantic search viadocument_blocksbut are invisible to the knowledge graph. For projects with rich documentation (architecture docs, API guides, tutorials), this misses a major source of structural relationships.Changes (4 commits)
1.
types.rs— AddReferencesrelationship typeRelationType::Referencesvariant for document cross-linksImportsat 0.7 andSiblingModuleat 0.3)2.
markdown.rs— Extract cross-reference links in Language traitextract_imports_exportsnow parses[text](path.md)links via regexhttp://,https://)#section)node.parent().is_none()to avoid duplicate extraction during AST walkresolve_importresolves relative paths (../,./) against source file directory3.
mod.rs— Feed markdown into GraphRAG pipelinegraphrag.enabled, creates a syntheticCodeBlockper markdown fileall_code_blockssoGraphBuildercreates a node for itextract_imports_exports_from_file4.
relationships.rs— Route markdown relationships toReferencestype"markdown", usesReferencesinstead ofImportsdetermine_file_kindreturns"document_file"for.mdfiles (was"documentation")Design Decisions
Why synthetic CodeBlocks? The GraphRAG pipeline is
all_code_blocks → GraphBuilder → CodeNode → relationships. Rather than refactoring the builder to accept a new input type, we create minimal CodeBlocks that the existing pipeline handles naturally.Why regex in the Language trait? The
extract_imports_exportssignature receives(node: Node, contents: &str). For markdown, the tree-sitter JSON parser produces garbage AST, butcontentsis the full file. We regex-parsecontentsdirectly — keeps the change localized tomarkdown.rs.Scope: Explicit
[text](path.md)links only. Out of scope: wiki-style[[links]], HTML<a href>, implied topic references, frontmatter metadata.Testing
All 35 existing GraphRAG tests pass plus new tests:
test_extract_markdown_links— regex link extractiontest_resolve_markdown_import— relative path resolutiontest_document_reference_discovery— References type routing