Codestin Search App

aiguozhi123456 · 2026-04-20T15:58:35Z

Summary

Extend read_file tool to support reading office documents (DOCX, XLSX, PPTX) by connecting to the existing extract_text() utility in nanobot/utils/document.py.

Context

Wire up the existing office document extractors in document.py to
ReadFileTool by adding an extension guard and _read_office_doc() method
that follows the established PDF pattern. Handles missing libraries,
corrupt files, empty documents, and 128K truncation consistently.

Intentionally minimal: does not refactor document.py error protocol or
unify truncation footers to avoid behavioral changes in existing PDF path.

Why

The ReadFileTool supported UTF-8 text, images, and PDFs, but rejected .docx, .xlsx, and .pptx files with "Error: Cannot read binary file". Meanwhile, nanobot/utils/document.py already contained complete extraction logic (_extract_docx(), _extract_xlsx(), _extract_pptx()) with proper library guards, error handling, and truncation — it was simply not wired up to the tool layer.

Solve

Connected the two layers by adding an extension guard in execute() (matching the existing PDF pattern) and a new _read_office_doc() method that delegates to document.extract_text().

Changes

File	Change
`nanobot/agent/tools/filesystem.py:173-175`	Add extension guard for `.docx/.xlsx/.pptx` after PDF check, before raw bytes read
`nanobot/agent/tools/filesystem.py:312-329`	Add `_read_office_doc()` — delegates to `extract_text()`, handles errors, empty docs, and 128K truncation
`nanobot/agent/tools/filesystem.py:139-146`	Update tool description to advertise document format support
`tests/tools/test_read_enhancements.py`	Add 12 new tests: happy paths (DOCX/XLSX/PPTX), errors (missing lib, corrupt, unsupported), truncation, empty doc, description

Tests

Completed (29 passed, 2 skipped — pre-existing PDF skips):

Not yet done:

Boundary test at exactly _MAX_CHARS (off-by-one guard)
Error path tests for .xlsx and .pptx (currently only .docx tested)
Integration test with real office document libraries (current tests mock extract_text)

Enhancement Directions

Pagination for office documents — PDF has a pages parameter; office docs lack equivalent. Large XLSX with many sheets truncates at 128K with no way to continue. Consider adding sheet/slide-level pagination in a follow-up.
Structured error protocol — Error detection relies on startswith("[error:") string sniffing. A shared constant or discriminated return type from extract_text() would eliminate the fragile coupling between filesystem.py and document.py.
Dead code cleanup — The result is None guard in _read_office_doc() is unreachable because the extension gate already filters to .docx/.xlsx/.pptx. Harmless but misleading to future maintainers.

…xt() Wire up the existing office document extractors in document.py to ReadFileTool by adding an extension guard and _read_office_doc() method that follows the established PDF pattern. Handles missing libraries, corrupt files, empty documents, and 128K truncation consistently.

Re-bin

LGTM — clean delegation, mirrors the existing PDF branch exactly as advertised.

What I verified locally:

Merged origin/main — clean (main just picked up #3353's xlsx close fix).
pytest tests/tools/test_read_enhancements.py tests/test_document_parsing.py → 51 passed.
pytest tests/tools/ full sweep → 295 passed, no regressions.
8/8 CI matrix green.

Why the shape is right:

Extension gate at filesystem.py:173-175 sits right after the PDF branch and before fp.read_bytes() — the correct insertion point, otherwise the zip bytes of a .docx would be handled as generic binary.
_read_office_doc() handles all four possible returns from document.extract_text() one-for-one: None (unsupported ext), "[error:..." (missing lib / corrupt), "" (empty doc), non-empty success → optional 128K truncate. I walked through document.py::extract_text to confirm the contract matches.
Test strategy is correctly layered: mocking extract_text to test the wiring is right — the extractors themselves are already covered with real libraries in tests/test_document_parsing.py. No need to duplicate.
The "[error:" prefix sniff matches the pre-existing convention in document.py::extract_documents (line 287), so the PR conforms to the existing protocol rather than inventing a new one.

On the self-flagged "not yet done" items:

Boundary test at exactly _MAX_CHARS — len > _MAX_CHARS has no off-by-one room; trusting Python's > is fine.
.xlsx/.pptx error path tests — _read_office_doc uses the same code for all three extensions (only fp.suffix differs in the formatted string). The .docx error test covers the logic. Adding parallel tests for .xlsx/.pptx would be coverage padding.
Real-library integration — already covered one layer down. Not this PR's job.

On the "result is None is unreachable" note in the PR body: I'd actually keep it. The outer gate and extract_text's supported-extension set are maintained in two files; the None guard is cheap defensive alignment so that if someone adds e.g. .odt to the outer gate without touching extract_text, we return a clean error instead of crashing. test_unsupported_extension pins that invariant.

Ship it.

Re-bin approved these changes Apr 21, 2026

View reviewed changes

Re-bin merged commit 53ba410 into HKUDS:main Apr 21, 2026
8 checks passed

github-actions Bot mentioned this pull request Apr 22, 2026

🦞 OpenClaw 生态日报 2026-04-22 gsscsd/big_model_radar#226

Open

aiguozhi123456 deleted the feat/read-office-documents branch April 25, 2026 15:49

This was referenced Apr 26, 2026

🦞 OpenClaw 生态日报 2026-04-26 gsscsd/big_model_radar#246

Open

🦞 OpenClaw 生态日报 2026-04-26 borq168/radar-forge#29

Open

🦞 OpenClaw Ecosystem Digest 2026-04-26 borq168/radar-forge#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(read_file): add DOCX, XLSX, PPTX office document support#3336

feat(read_file): add DOCX, XLSX, PPTX office document support#3336
Re-bin merged 1 commit into
HKUDS:mainfrom
aiguozhi123456:feat/read-office-documents

aiguozhi123456 commented Apr 20, 2026 •

edited

Loading

Uh oh!

Re-bin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aiguozhi123456 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Why

Solve

Changes

Tests

Enhancement Directions

Uh oh!

Re-bin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aiguozhi123456 commented Apr 20, 2026 •

edited

Loading