-
Notifications
You must be signed in to change notification settings - Fork 4.9k
[feat] Support Excel files in Knowledge #5862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Follow-up: addressed whitespace preservation for Excel ingestion.
Note: we still trim trailing empty cells to avoid producing long |
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
3b93681 to
e91a129
Compare
Add .xlsx and .xls support to FieldLabeledCSVReader, matching the capability added to CSVReader. Each Excel row becomes a Document with field-labeled content (e.g., "Name: Alice\nAge: 30"). Changes: - Add _read_xlsx() and _read_xls() methods to FieldLabeledCSVReader - Reuse helper functions from csv_reader.py for consistency - Add async support using asyncio.to_thread() - Add xlwt to dev dependencies for .xls test file generation - Add 15 new tests for Excel support in FieldLabeledCSVReader Co-Authored-By: Claude <[email protected]>
- Remove _EXCEL_SUFFIXES constant, inline the check - Use ContentType enum directly (no .value needed for str enum) Co-Authored-By: Claude <[email protected]>
|
@claude review this |
|
Claude finished @Mustafa-Esoofally's task in 2m 3s —— View job Code Review for PR #5862Reviewing the Excel file support implementation for Knowledge. Review Checklist
SummaryThis PR successfully adds first-class Excel support (.xlsx/.xls) to the Knowledge system by routing Excel files through the existing CSV reader. The implementation is clean, well-tested, and follows the existing patterns in the codebase. The author's decision to reuse Strengths
Issues & Suggestions1. Inconsistent boolean formatting (csv_reader.py:58-60)Issue: Booleans are converted to their Python string representation ( # Current behavior in tests (line 266-267):
assert lines[3] == "boolean_true, True" # Python repr
assert lines[4] == "boolean_false, False" # Python reprSuggestion: Consider converting to lowercase or numeric (0/1) for consistency with CSV conventions: if isinstance(value, bool):
return "1" if value else "0" # or "true"/"false"Why: Most spreadsheet applications and CSV processors use lowercase or numeric representations for booleans. The current approach might cause parsing issues downstream. 2. Missing type hints for private functions (csv_reader.py:22-98)Issue: The helper functions lack complete type hints: def _get_workbook_name(file: Union[Path, IO[Any]], name: Optional[str]) -> str:
def _excel_rows_to_documents(...) -> List[Document]: # Missing parameter typesSuggestion: Add complete type hints to all helper functions for better IDE support and type checking. 3. Potential memory issue with large Excel filesIssue: Both Code location: csv_reader.py:289-341 Suggestion: For large Excel files, consider implementing streaming:
Note: This is a minor issue since 4. Silent failure on corrupted files (csv_reader.py:184-187)Issue: When reading corrupted Excel files, errors are caught and logged, but an empty list is returned without distinguishing between "empty file" and "corrupted file". except Exception as e:
file_desc = getattr(file, "name", str(file)) if isinstance(file, IO) else file
log_error(f"Error reading {file_desc}: {e}")
return []Suggestion: Consider differentiating error types:
This would help users debug issues rather than silently failing. 5. Hardcoded delimiter in output (csv_reader.py:69)Issue: The return ", ".join(values)Suggestion: Consider making this configurable or consistent with the CSV reader's delimiter parameter. Currently, actual CSV files respect the delimiter parameter, but Excel files always use comma-space. Impact: This inconsistency means Excel-derived content looks different from CSV-derived content when using custom delimiters. 6. Missing docstrings for public API changesIssue: The Code location:
Suggestion: Update docstrings to mention Excel support: """
Read a CSV file or Excel workbook (.xlsx/.xls).
For Excel files: each sheet becomes a separate Document with sheet metadata.
For CSV files: the entire file is processed as a single Document (unless chunked).
Args:
file: Path to CSV or Excel file, or file-like object
delimiter: CSV delimiter (Excel files ignore this)
quotechar: CSV quote character (Excel files ignore this)
name: Optional name override for the document
Returns:
List of Document objects
"""7. Dependency fix is correct but unrelatedIssue: The change from Suggestion: While the fix is valid and needed, consider mentioning in the PR description that this was discovered during CI testing and is a separate fix. This helps with git archaeology later. Nitpicks
Architecture AssessmentDesign decision to route Excel through CSVReader is sound: ✅ Pros:
Code Quality
Performance Notes
Breaking ChangesNone - this is purely additive functionality. RecommendationApprove with minor suggestions. The implementation is solid and well-tested. The issues identified are minor and can be addressed in follow-up PRs if desired. The core functionality works as designed and integrates cleanly with the existing Knowledge system. The whitespace preservation fix (follow-up comment) shows good attention to detail and consistency with CSV behavior. Optional Improvements for Future PRs
|
- Add excel_reader.py cookbook demonstrating Excel knowledge integration - Add sample_products.xlsx with multi-sheet product catalog data - Remove banner comments and BUG-007 references from test file
|
@claude review |
|
Claude finished @Mustafa-Esoofally's task in 2m 47s —— View job Code Review for PR #5862Reviewing Excel file support implementation for Knowledge. Review Checklist
SummaryThis PR successfully adds Excel (.xlsx/.xls) file support to the Knowledge system by routing Excel files through the existing CSV reader infrastructure. The implementation is well-designed, thoroughly tested, and follows repository conventions. The author made a pragmatic architectural decision to reuse Strengths1. Excellent Test Coverage (511 lines)The test suite in
2. Smart Architectural DecisionRouting Excel through
3. Clean ImplementationHelper functions are well-designed:
4. Proper Metadata HandlingEach sheet becomes a
5. Repository Conventions Followed
Issues & Recommendations1. Minor: Inconsistent Boolean Formatting (csv_reader.py:48-62)Location: csv_reader.py:58-60 in Issue: Booleans are converted to their Python string representation ( # Current behavior (test line 266-267):
assert lines[3] == "boolean_true, True" # Python repr
assert lines[4] == "boolean_false, False" # Python reprSuggestion: Consider using lowercase or numeric representation for consistency with spreadsheet applications: if isinstance(value, bool):
return "1" if value else "0" # or "true"/"false"Impact: Low - current behavior works, but might cause parsing issues with downstream systems expecting standard CSV boolean formats. 2. Minor: Hardcoded Delimiter (csv_reader.py:69)Issue: return ", ".join(values)This creates inconsistency: actual CSV files respect the delimiter parameter, but Excel-derived content always uses comma-space. Suggestion: Consider making this configurable or documenting this difference. Impact: Low - affects consistency when mixing Excel and CSV with custom delimiters. 3. Documentation: Missing Docstring UpdatesLocations:
Issue: The public API now supports Excel files, but docstrings don't mention this: Suggestion: Update docstrings to reflect Excel support: """
Read a CSV file or Excel workbook (.xlsx/.xls).
For Excel files: each sheet becomes a separate Document with sheet metadata.
For CSV files: the entire file is processed as a single Document (unless chunked).
Args:
file: Path to CSV or Excel file, or file-like object
delimiter: CSV delimiter (ignored for Excel files)
quotechar: CSV quote character (ignored for Excel files)
name: Optional name override for the document
Returns:
List of Document objects
"""Impact: Medium - users reading the code may not realize Excel is supported. 4. Code Quality: Silent Failure on Corrupted Files (csv_reader.py:184-187)Issue: When reading corrupted Excel files, errors are caught and logged, but an empty list is returned without distinguishing between "empty file" and "corrupted file": except Exception as e:
file_desc = getattr(file, "name", str(file)) if isinstance(file, IO) else file
log_error(f"Error reading {file_desc}: {e}")
return []Suggestion: Consider differentiating error types:
Impact: Medium - users may not realize their files are corrupted vs. empty. 5. Nitpick: Missing Type Hints (csv_reader.py:22-98)Issue: Helper functions lack complete type hints: def _excel_rows_to_documents(
*,
workbook_name: str,
sheets: Iterable[Tuple[str, Iterable[Sequence[Any]]]],
) -> List[Document]: # Good!
def _get_workbook_name(file: Union[Path, IO[Any]], name: Optional[str]) -> str: # Good!Most are complete, but consistency would help with type checking. Impact: Very Low - minor code quality improvement. 6. Note: Unrelated Dependency Fix (pyproject.toml:123)Observation: The change from Suggestion: This is fine to include, but worth noting it's a separate fix discovered during testing. Architecture AssessmentThe design decision to route Excel through ✅ Pros:
Potential future enhancements:
Test QualityThe test suite demonstrates exceptional quality:
Minor suggestion: Some tests could use parametrization to reduce duplication: @pytest.mark.parametrize("extension,writer", [(".xlsx", write_xlsx), (".xls", write_xls)])
def test_reader_handles_format(extension, writer, tmp_path):
# Test both formats with same logicCode Quality Checklist
Dependency ChangesThe changes to csv = ["aiofiles", "openpyxl", "xlrd"]
Performance Notes
Breaking ChangesNone - this is purely additive functionality. Missing: Cookbook ExampleObservation: The PR description mentions Recommendation: According to CLAUDE.md, "every pattern should have an example". Consider adding a cookbook example showing:
This would help users discover and understand the feature. Recommendation✅ APPROVE with minor suggestions The implementation is solid, well-tested, and production-ready. The issues identified are minor and mostly documentation/polish items that can be addressed in follow-up PRs if desired. The core functionality is excellent. Priority for follow-up:
The whitespace preservation fix (follow-up comment by @liqiongyu) shows excellent attention to detail and consistency with CSV behavior. Great work! This is a well-executed feature addition that maintains code quality and follows repository standards. |
- Add _convert_xls_cell_value() to properly convert xlrd booleans (1/0 → True/False) and dates (serial numbers → datetime) - Normalize line endings (\n, \r, \r\n) to spaces in _stringify_spreadsheet_cell_value() to preserve row integrity when cells contain multiline content - Apply stringify to CSV path so embedded newlines in CSV cells are also normalized - Use list+join pattern for CSV content (no trailing newline, consistent with Excel) - Add comprehensive tests for all edge cases across CSV, XLSX, and XLS formats
6a60329 to
4b8dc79
Compare
Summary
Adds first-class Excel (.xlsx/.xls) file ingestion for Knowledge by routing spreadsheets to the CSV reader and parsing workbooks per-sheet.
Rationale
Excel is treated as a spreadsheet/tabular source similar to CSV for Knowledge ingestion.
CSVReaderalready provides row-oriented text extraction and integrates with existing chunking strategies (e.g.RowChunking), so routing.xlsx/.xlsthrough it keeps behavior consistent and avoids introducing a new reader key / API surface. If we later need richer spreadsheet semantics (formulas, formatting, table region detection), we can extract a dedicatedExcelReader.Changes
Core Excel Support
.xlsxvia openpyxl and.xlsvia xlrd in CSVReadersheet_name,sheet_index).xlsx/.xls(+ common MIME types) to the csv reader in ReaderFactoryagno[csv]extraBug Fixes for Edge Cases
1/0integers; added_convert_xls_cell_value()to convert to properTrue/False\n,\r,\r\n) would break row parsing; now normalized to spaces"\n".join(lines)pattern (no trailing newline)Other
agno[memori]extra to depend onmemori==3.0.5(instead ofmemorisdk==3.0.5)Test Plan
./scripts/format.sh- passes./scripts/validate.sh- passes (no new mypy errors)pytest libs/agno/tests/unit/knowledge/test_excel_reader.py -v- 30 tests passpytest libs/agno/tests/unit/reader/test_csv_reader.py -v- 14 tests passpytest libs/agno/tests/unit/reader/test_csv_field_label_reader.py -v- 52 tests passNew Tests Added
test_csv_reader_xls_boolean_cellsTrue/Falsenot1/0test_csv_reader_xls_multiline_content_preserved_as_spacetest_csv_reader_xlsx_multiline_content_preserved_as_spacetest_csv_reader_xlsx_carriage_return_normalizedtest_csv_reader_xls_carriage_return_normalizedtest_csv_reader_csv_multiline_cells_normalizedtest_csv_reader_csv_carriage_return_normalizedtest_read_xls_datetime_handlingtest_read_xls_boolean_handlingtest_read_csv_carriage_return_normalizedtest_read_xlsx_carriage_return_normalizedFixes #4872