feat(read_file): add DOCX, XLSX, PPTX office document support#3336
Merged
Conversation
…xt() Wire up the existing office document extractors in document.py to ReadFileTool by adding an extension guard and _read_office_doc() method that follows the established PDF pattern. Handles missing libraries, corrupt files, empty documents, and 128K truncation consistently.
Re-bin
approved these changes
Apr 21, 2026
Re-bin
left a comment
Collaborator
There was a problem hiding this comment.
LGTM — clean delegation, mirrors the existing PDF branch exactly as advertised.
What I verified locally:
- Merged
origin/main— clean (main just picked up #3353's xlsx close fix). pytest tests/tools/test_read_enhancements.py tests/test_document_parsing.py→ 51 passed.pytest tests/tools/full sweep → 295 passed, no regressions.- 8/8 CI matrix green.
Why the shape is right:
- Extension gate at
filesystem.py:173-175sits right after the PDF branch and beforefp.read_bytes()— the correct insertion point, otherwise the zip bytes of a .docx would be handled as generic binary. _read_office_doc()handles all four possible returns fromdocument.extract_text()one-for-one:None(unsupported ext),"[error:..."(missing lib / corrupt),""(empty doc), non-empty success → optional 128K truncate. I walked throughdocument.py::extract_textto confirm the contract matches.- Test strategy is correctly layered: mocking
extract_textto test the wiring is right — the extractors themselves are already covered with real libraries intests/test_document_parsing.py. No need to duplicate. - The
"[error:"prefix sniff matches the pre-existing convention indocument.py::extract_documents(line 287), so the PR conforms to the existing protocol rather than inventing a new one.
On the self-flagged "not yet done" items:
- Boundary test at exactly
_MAX_CHARS—len > _MAX_CHARShas no off-by-one room; trusting Python's>is fine. - .xlsx/.pptx error path tests —
_read_office_docuses the same code for all three extensions (onlyfp.suffixdiffers in the formatted string). The .docx error test covers the logic. Adding parallel tests for .xlsx/.pptx would be coverage padding. - Real-library integration — already covered one layer down. Not this PR's job.
On the "result is None is unreachable" note in the PR body: I'd actually keep it. The outer gate and extract_text's supported-extension set are maintained in two files; the None guard is cheap defensive alignment so that if someone adds e.g. .odt to the outer gate without touching extract_text, we return a clean error instead of crashing. test_unsupported_extension pins that invariant.
Ship it.
This was referenced Apr 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend
read_filetool to support reading office documents (DOCX, XLSX, PPTX) by connecting to the existingextract_text()utility innanobot/utils/document.py.Context
Wire up the existing office document extractors in document.py to
ReadFileTool by adding an extension guard and _read_office_doc() method
that follows the established PDF pattern. Handles missing libraries,
corrupt files, empty documents, and 128K truncation consistently.
Intentionally minimal: does not refactor document.py error protocol or
unify truncation footers to avoid behavioral changes in existing PDF path.
Why
The
ReadFileToolsupported UTF-8 text, images, and PDFs, but rejected.docx,.xlsx, and.pptxfiles with"Error: Cannot read binary file". Meanwhile,nanobot/utils/document.pyalready contained complete extraction logic (_extract_docx(),_extract_xlsx(),_extract_pptx()) with proper library guards, error handling, and truncation — it was simply not wired up to the tool layer.Solve
Connected the two layers by adding an extension guard in
execute()(matching the existing PDF pattern) and a new_read_office_doc()method that delegates todocument.extract_text().Changes
nanobot/agent/tools/filesystem.py:173-175.docx/.xlsx/.pptxafter PDF check, before raw bytes readnanobot/agent/tools/filesystem.py:312-329_read_office_doc()— delegates toextract_text(), handles errors, empty docs, and 128K truncationnanobot/agent/tools/filesystem.py:139-146tests/tools/test_read_enhancements.pyTests
Completed (29 passed, 2 skipped — pre-existing PDF skips):
Not yet done:
_MAX_CHARS(off-by-one guard).xlsxand.pptx(currently only.docxtested)extract_text)Enhancement Directions
Pagination for office documents — PDF has a
pagesparameter; office docs lack equivalent. Large XLSX with many sheets truncates at 128K with no way to continue. Consider adding sheet/slide-level pagination in a follow-up.Structured error protocol — Error detection relies on
startswith("[error:")string sniffing. A shared constant or discriminated return type fromextract_text()would eliminate the fragile coupling betweenfilesystem.pyanddocument.py.Dead code cleanup — The
result is Noneguard in_read_office_doc()is unreachable because the extension gate already filters to.docx/.xlsx/.pptx. Harmless but misleading to future maintainers.