Codestin Search App

de-snake · 2026-03-24T10:10:56Z

Summary

Add markdown document nodes and cross-reference relationships to GraphRAG. After this change, octocode graphrag get-relationships --node-id docs/credit-accounts.md returns which docs link to it and which docs it links to.

Problem

GraphRAG currently only indexes source code files. Documentation files (.md) are processed for semantic search via document_blocks but are invisible to the knowledge graph. For projects with rich documentation (architecture docs, API guides, tutorials), this misses a major source of structural relationships.

Changes (4 commits)

1. `types.rs` — Add `References` relationship type

New RelationType::References variant for document cross-links
Importance weight 0.6 (between Imports at 0.7 and SiblingModule at 0.3)
Semantic clarity: "docs/a.md references docs/b.md" reads correctly vs "imports"

2. `markdown.rs` — Extract cross-reference links in Language trait

extract_imports_exports now parses [text](path.md) links via regex
Skips external URLs (http://, https://)
Strips anchor fragments (#section)
Gates on node.parent().is_none() to avoid duplicate extraction during AST walk
resolve_import resolves relative paths (../, ./) against source file directory

3. `mod.rs` — Feed markdown into GraphRAG pipeline

When graphrag.enabled, creates a synthetic CodeBlock per markdown file
Pushes into all_code_blocks so GraphBuilder creates a node for it
Actual link extraction happens when builder calls extract_imports_exports_from_file

4. `relationships.rs` — Route markdown relationships to `References` type

When source node language is "markdown", uses References instead of Imports
determine_file_kind returns "document_file" for .md files (was "documentation")

Design Decisions

Why synthetic CodeBlocks? The GraphRAG pipeline is all_code_blocks → GraphBuilder → CodeNode → relationships. Rather than refactoring the builder to accept a new input type, we create minimal CodeBlocks that the existing pipeline handles naturally.

Why regex in the Language trait? The extract_imports_exports signature receives (node: Node, contents: &str). For markdown, the tree-sitter JSON parser produces garbage AST, but contents is the full file. We regex-parse contents directly — keeps the change localized to markdown.rs.

Scope: Explicit [text](path.md) links only. Out of scope: wiki-style [[links]], HTML <a href>, implied topic references, frontmatter metadata.

Testing

All 35 existing GraphRAG tests pass plus new tests:

test_extract_markdown_links — regex link extraction
test_resolve_markdown_import — relative path resolution
test_document_reference_discovery — References type routing

cargo test --lib indexer::graphrag  # 35 tests pass
cargo test --lib indexer::languages::markdown  # 2 new tests pass

…inks New RelationType::References variant for markdown-to-markdown cross-references. Weight 0.6 (between structural imports at 0.7 and organizational sibling_module at 0.3).

Parse [text](path.md) links as 'imports' in the Language trait. Implements resolve_import with relative path normalization (../ and ./). Skips external URLs and non-.md links. Deduplicates. Root-node gate avoids redundant extraction during recursive AST walk.

Create synthetic CodeBlocks for markdown files in all three indexing paths (git-optimized, standard walker, and file watcher). The builder reads files from disk and calls Markdown::extract_imports_exports for link extraction. Only created when graphrag.enabled = true.

When source file is markdown, relationships from link extraction use RelationType::References instead of Imports. Also split determine_file_kind to return 'document_file' for .md files (separate from .txt/.rst 'documentation').

…rom_existing_database When GraphRAG rebuilds from existing database (no new commits), it only read from code_blocks.lance, missing all markdown files stored in document_blocks. This caused markdown files to be invisible to GraphRAG despite being indexed. Fix: get_all_code_blocks_for_graphrag() now also creates synthetic CodeBlocks for markdown files found in document_blocks, enabling cross-reference discovery on incremental rebuilds.

…ion in efficient discovery discover_relationships_efficiently had inline symbol-matching for imports that compared file-path strings like '../intro/credit-accounts.md' against export symbols like 'Credit Accounts' — which never matched. Fix: markdown nodes now use discover_import_relationships (with proper resolve_import path resolution) instead of the symbol-matching path. Added integration test that calls discover_relationships_efficiently directly with markdown nodes to prove References edges are produced.

The debug logging confirmed that resolve_import works correctly (1420/1567 resolutions succeed). The remaining issue is in upstream batch storage: save_graph_incremental silently drops relationship batches after the first 2 when processing 2K+ nodes. This is a pre-existing scalability issue, not related to our doc-graphrag changes.

de-snake · 2026-03-26T15:33:06Z

Update: 8 commits now (3 additional fixes + 1 test + 1 cleanup)

Additional commits since PR open:

fix(graphrag): include markdown files from document_blocks in build_from_existing_database — When GraphRAG rebuilds from existing DB (no new commits), it only read code_blocks.lance, missing all markdown files stored in document_blocks. Fixed get_all_code_blocks_for_graphrag() to also create synthetic CodeBlocks from document_blocks.
fix(graphrag): route markdown nodes through path-based import resolution — discover_relationships_efficiently used inline symbol-matching that compared file paths like ../intro/credit-accounts.md against export symbols like Credit Accounts — which never matched. Fixed by routing markdown nodes through discover_import_relationships (with proper resolve_import).
test: integration test for markdown references via efficient discovery — Tests the exact runtime code path (discover_relationships_efficiently) with markdown nodes, proving References edges are produced correctly.
chore: remove debug logging — Cleanup.

Verified working:

1030 document_file nodes created (was 0)
Import extraction: 349 markdown files have imports, 1420/1567 resolve successfully
Integration test passes: discover_relationships_efficiently produces References edges
All 40 tests pass (36 graphrag + 4 markdown)

Known limitation (pre-existing upstream issue):

At scale (2K+ nodes), save_graph_incremental in builder.rs silently drops relationship batches after storing ~2 of 4499 batches. This means the generated References relationships are correct but never persisted to LanceDB. This affects ALL relationship types at scale, not just our doc-graphrag changes. The batch storage loop at lines 583-608 appears to have a concurrency or timeout issue when writing large numbers of relationships.

This limitation is pre-existing — the same behavior occurs with code-only relationships on large repos. It should be addressed in a separate PR focused on relationship storage scalability.

get_graph_relationships had a TODO comment: 'For simplicity, return the first batch. In production, you might want to concatenate all batches.' This meant only ~64 of 575K relationships were returned on large repos. Fix: use arrow::compute::concat_batches to merge all result batches. Verified: 2838 references + 500K imports + 70K sibling_module now load correctly.

Batch writes produce duplicate relationships across incremental flushes. Added HashSet-based dedup by (source, target, type) triple when loading relationships from database. Prints dedup stats when not in quiet mode.

donhardman · 2026-03-28T15:47:28Z

Hey! Thanks for the pull request. I would like to merge it once CI passes.

de-snake added 6 commits March 24, 2026 12:35

feat(graphrag): add References relationship type for document cross-l…

b5ce920

…inks New RelationType::References variant for markdown-to-markdown cross-references. Weight 0.6 (between structural imports at 0.7 and organizational sibling_module at 0.3).

donhardman self-requested a review March 25, 2026 17:14

de-snake added 2 commits March 26, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Document-level GraphRAG — markdown nodes + cross-reference relationships#39

feat: Document-level GraphRAG — markdown nodes + cross-reference relationships#39
de-snake wants to merge 9 commits into
Muvon:masterfrom
de-snake:feat/doc-graphrag

de-snake commented Mar 24, 2026

Uh oh!

de-snake commented Mar 26, 2026

Uh oh!

donhardman commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

de-snake commented Mar 24, 2026

Summary

Problem

Changes (4 commits)

1. types.rs — Add References relationship type

2. markdown.rs — Extract cross-reference links in Language trait

3. mod.rs — Feed markdown into GraphRAG pipeline

4. relationships.rs — Route markdown relationships to References type

Design Decisions

Testing

Uh oh!

de-snake commented Mar 26, 2026

Update: 8 commits now (3 additional fixes + 1 test + 1 cleanup)

Additional commits since PR open:

Verified working:

Known limitation (pre-existing upstream issue):

Uh oh!

donhardman commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `types.rs` — Add `References` relationship type

2. `markdown.rs` — Extract cross-reference links in Language trait

3. `mod.rs` — Feed markdown into GraphRAG pipeline

4. `relationships.rs` — Route markdown relationships to `References` type