Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat(read_file): add DOCX, XLSX, PPTX office document support#3336

Merged
Re-bin merged 1 commit into
HKUDS:mainfrom
aiguozhi123456:feat/read-office-documents
Apr 21, 2026
Merged

feat(read_file): add DOCX, XLSX, PPTX office document support#3336
Re-bin merged 1 commit into
HKUDS:mainfrom
aiguozhi123456:feat/read-office-documents

Conversation

@aiguozhi123456

@aiguozhi123456 aiguozhi123456 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Extend read_file tool to support reading office documents (DOCX, XLSX, PPTX) by connecting to the existing extract_text() utility in nanobot/utils/document.py.

Context

Wire up the existing office document extractors in document.py to
ReadFileTool by adding an extension guard and _read_office_doc() method
that follows the established PDF pattern. Handles missing libraries,
corrupt files, empty documents, and 128K truncation consistently.

Intentionally minimal: does not refactor document.py error protocol or
unify truncation footers to avoid behavioral changes in existing PDF path.

Why

The ReadFileTool supported UTF-8 text, images, and PDFs, but rejected .docx, .xlsx, and .pptx files with "Error: Cannot read binary file". Meanwhile, nanobot/utils/document.py already contained complete extraction logic (_extract_docx(), _extract_xlsx(), _extract_pptx()) with proper library guards, error handling, and truncation — it was simply not wired up to the tool layer.

Solve

Connected the two layers by adding an extension guard in execute() (matching the existing PDF pattern) and a new _read_office_doc() method that delegates to document.extract_text().

Changes

File Change
nanobot/agent/tools/filesystem.py:173-175 Add extension guard for .docx/.xlsx/.pptx after PDF check, before raw bytes read
nanobot/agent/tools/filesystem.py:312-329 Add _read_office_doc() — delegates to extract_text(), handles errors, empty docs, and 128K truncation
nanobot/agent/tools/filesystem.py:139-146 Update tool description to advertise document format support
tests/tools/test_read_enhancements.py Add 12 new tests: happy paths (DOCX/XLSX/PPTX), errors (missing lib, corrupt, unsupported), truncation, empty doc, description

Tests

Completed (29 passed, 2 skipped — pre-existing PDF skips):

  • DOCX returns extracted text
  • XLSX returns extracted text with sheet headers
  • PPTX returns extracted text with slide headers
  • Missing library returns actionable error message
  • Corrupt file returns error (not unhandled exception)
  • Unsupported extension returns error
  • Empty document returns descriptive message (consistent with PDF)
  • Large document (>128K) is truncated with footer
  • Small document is not truncated
  • Error responses are not truncated
  • Tool description mentions document support
  • Tool description no longer says "cannot read"
  • All existing tests (text, PDF, image, dedup, device blacklist) continue to pass

Not yet done:

  • Boundary test at exactly _MAX_CHARS (off-by-one guard)
  • Error path tests for .xlsx and .pptx (currently only .docx tested)
  • Integration test with real office document libraries (current tests mock extract_text)

Enhancement Directions

  1. Pagination for office documents — PDF has a pages parameter; office docs lack equivalent. Large XLSX with many sheets truncates at 128K with no way to continue. Consider adding sheet/slide-level pagination in a follow-up.

  2. Structured error protocol — Error detection relies on startswith("[error:") string sniffing. A shared constant or discriminated return type from extract_text() would eliminate the fragile coupling between filesystem.py and document.py.

  3. Dead code cleanup — The result is None guard in _read_office_doc() is unreachable because the extension gate already filters to .docx/.xlsx/.pptx. Harmless but misleading to future maintainers.


Built with OpenCode
Compound Engineering
HARNESS

…xt()

Wire up the existing office document extractors in document.py to
ReadFileTool by adding an extension guard and _read_office_doc() method
that follows the established PDF pattern. Handles missing libraries,
corrupt files, empty documents, and 128K truncation consistently.

@Re-bin Re-bin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean delegation, mirrors the existing PDF branch exactly as advertised.

What I verified locally:

  • Merged origin/main — clean (main just picked up #3353's xlsx close fix).
  • pytest tests/tools/test_read_enhancements.py tests/test_document_parsing.py → 51 passed.
  • pytest tests/tools/ full sweep → 295 passed, no regressions.
  • 8/8 CI matrix green.

Why the shape is right:

  • Extension gate at filesystem.py:173-175 sits right after the PDF branch and before fp.read_bytes() — the correct insertion point, otherwise the zip bytes of a .docx would be handled as generic binary.
  • _read_office_doc() handles all four possible returns from document.extract_text() one-for-one: None (unsupported ext), "[error:..." (missing lib / corrupt), "" (empty doc), non-empty success → optional 128K truncate. I walked through document.py::extract_text to confirm the contract matches.
  • Test strategy is correctly layered: mocking extract_text to test the wiring is right — the extractors themselves are already covered with real libraries in tests/test_document_parsing.py. No need to duplicate.
  • The "[error:" prefix sniff matches the pre-existing convention in document.py::extract_documents (line 287), so the PR conforms to the existing protocol rather than inventing a new one.

On the self-flagged "not yet done" items:

  • Boundary test at exactly _MAX_CHARSlen > _MAX_CHARS has no off-by-one room; trusting Python's > is fine.
  • .xlsx/.pptx error path tests_read_office_doc uses the same code for all three extensions (only fp.suffix differs in the formatted string). The .docx error test covers the logic. Adding parallel tests for .xlsx/.pptx would be coverage padding.
  • Real-library integration — already covered one layer down. Not this PR's job.

On the "result is None is unreachable" note in the PR body: I'd actually keep it. The outer gate and extract_text's supported-extension set are maintained in two files; the None guard is cheap defensive alignment so that if someone adds e.g. .odt to the outer gate without touching extract_text, we return a clean error instead of crashing. test_unsupported_extension pins that invariant.

Ship it.

@Re-bin Re-bin merged commit 53ba410 into HKUDS:main Apr 21, 2026
8 checks passed
@aiguozhi123456 aiguozhi123456 deleted the feat/read-office-documents branch April 25, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants