Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Conversation

@liqiongyu
Copy link
Contributor

@liqiongyu liqiongyu commented Dec 26, 2025

Summary

Adds first-class Excel (.xlsx/.xls) file ingestion for Knowledge by routing spreadsheets to the CSV reader and parsing workbooks per-sheet.

Rationale

Excel is treated as a spreadsheet/tabular source similar to CSV for Knowledge ingestion. CSVReader already provides row-oriented text extraction and integrates with existing chunking strategies (e.g. RowChunking), so routing .xlsx/.xls through it keeps behavior consistent and avoids introducing a new reader key / API surface. If we later need richer spreadsheet semantics (formulas, formatting, table region detection), we can extract a dedicated ExcelReader.

Changes

Core Excel Support

  • Parse .xlsx via openpyxl and .xls via xlrd in CSVReader
  • Each sheet becomes a separate Document with metadata (sheet_name, sheet_index)
  • Rows become content lines in CSV-like format
  • Route .xlsx/.xls (+ common MIME types) to the csv reader in ReaderFactory
  • Add openpyxl/xlrd to the existing agno[csv] extra

Bug Fixes for Edge Cases

  • Boolean handling: xlrd returns booleans as 1/0 integers; added _convert_xls_cell_value() to convert to proper True/False
  • Multiline content: Cells with embedded newlines (\n, \r, \r\n) would break row parsing; now normalized to spaces
  • CSV consistency: Applied same normalization to CSV path (cells can have embedded newlines too)
  • Trailing newline: Both CSV and Excel paths now use "\n".join(lines) pattern (no trailing newline)

Other

  • Fix agno[memori] extra to depend on memori==3.0.5 (instead of memorisdk==3.0.5)

Test Plan

  • ./scripts/format.sh - passes
  • ./scripts/validate.sh - passes (no new mypy errors)
  • pytest libs/agno/tests/unit/knowledge/test_excel_reader.py -v - 30 tests pass
  • pytest libs/agno/tests/unit/reader/test_csv_reader.py -v - 14 tests pass
  • pytest libs/agno/tests/unit/reader/test_csv_field_label_reader.py -v - 52 tests pass
  • End-to-end test with LanceDB knowledge base insertion and Agent queries

New Tests Added

Test Verifies
test_csv_reader_xls_boolean_cells True/False not 1/0
test_csv_reader_xls_multiline_content_preserved_as_space LF → space in .xls
test_csv_reader_xlsx_multiline_content_preserved_as_space LF → space in .xlsx
test_csv_reader_xlsx_carriage_return_normalized CR, CRLF → space in .xlsx
test_csv_reader_xls_carriage_return_normalized CR, CRLF → space in .xls
test_csv_reader_csv_multiline_cells_normalized LF → space in .csv
test_csv_reader_csv_carriage_return_normalized CR, CRLF → space in .csv
test_read_xls_datetime_handling ISO 8601 format in FieldLabeled
test_read_xls_boolean_handling Boolean fix in FieldLabeled
test_read_csv_carriage_return_normalized CR/CRLF in FieldLabeled CSV
test_read_xlsx_carriage_return_normalized CR/CRLF in FieldLabeled xlsx

Fixes #4872

@liqiongyu
Copy link
Contributor Author

Follow-up: addressed whitespace preservation for Excel ingestion.

  • Removed the final .strip() in _row_values_to_csv_line() so leading/trailing spaces in the first/last cell are preserved when chunk=False (closer to the CSV path).
  • Added a unit test to cover this behavior.

Note: we still trim trailing empty cells to avoid producing long ", , ," tails when Excel sheets have a large max_column with mostly empty cells.

@Mustafa-Esoofally Mustafa-Esoofally force-pushed the feat/4872-excel-knowledge branch from 3b93681 to e91a129 Compare January 20, 2026 18:57
Mustafa-Esoofally and others added 2 commits January 20, 2026 16:06
Add .xlsx and .xls support to FieldLabeledCSVReader, matching the
capability added to CSVReader. Each Excel row becomes a Document
with field-labeled content (e.g., "Name: Alice\nAge: 30").

Changes:
- Add _read_xlsx() and _read_xls() methods to FieldLabeledCSVReader
- Reuse helper functions from csv_reader.py for consistency
- Add async support using asyncio.to_thread()
- Add xlwt to dev dependencies for .xls test file generation
- Add 15 new tests for Excel support in FieldLabeledCSVReader

Co-Authored-By: Claude <[email protected]>
- Remove _EXCEL_SUFFIXES constant, inline the check
- Use ContentType enum directly (no .value needed for str enum)

Co-Authored-By: Claude <[email protected]>
@Mustafa-Esoofally
Copy link
Contributor

@claude review this

@claude
Copy link

claude bot commented Jan 20, 2026

Claude finished @Mustafa-Esoofally's task in 2m 3s —— View job


Code Review for PR #5862

Reviewing the Excel file support implementation for Knowledge.

Review Checklist

  • Read repository guidelines (CLAUDE.md)
  • Review core implementation (csv_reader.py, field_labeled_csv_reader.py)
  • Review reader factory changes
  • Review test coverage and quality
  • Review dependency changes
  • Assess design decisions and architecture
  • Provide comprehensive feedback

Summary

This PR successfully adds first-class Excel support (.xlsx/.xls) to the Knowledge system by routing Excel files through the existing CSV reader. The implementation is clean, well-tested, and follows the existing patterns in the codebase. The author's decision to reuse CSVReader is pragmatic and avoids API surface expansion.

Strengths

  1. Excellent test coverage - 511 lines of comprehensive tests covering:

    • Both .xlsx and .xls formats
    • Edge cases (empty sheets, trailing cells, special characters, unicode)
    • Data type handling (floats, booleans, datetime, None)
    • Async operations
    • Error conditions (missing dependencies, corrupted files)
    • Whitespace preservation (follow-up fix)
  2. Smart design decision - Routing Excel through CSVReader keeps behavior consistent and reuses existing chunking strategies (RowChunking, etc.) without introducing a new reader key.

  3. Proper separation of concerns - Helper functions are well-named and focused:

    • _get_workbook_name() - Extract workbook name with clear priority
    • _infer_file_extension() - Consistent extension detection
    • _stringify_spreadsheet_cell_value() - Type-aware cell conversion
    • _row_values_to_csv_line() - Row formatting with trailing cell trimming
  4. Both sync and async support - Follows repository conventions by providing both variants.

  5. Proper metadata handling - Each sheet becomes a Document with sheet_name, sheet_index metadata, enabling per-sheet processing and tracking.

  6. FieldLabeledCSVReader support - Excel support extends to both CSV readers, maintaining consistency across the Knowledge API.

Issues & Suggestions

1. Inconsistent boolean formatting (csv_reader.py:58-60)

Issue: Booleans are converted to their Python string representation ("True", "False"), which is inconsistent with how other data types are handled.

# Current behavior in tests (line 266-267):
assert lines[3] == "boolean_true, True"   # Python repr
assert lines[4] == "boolean_false, False" # Python repr

Suggestion: Consider converting to lowercase or numeric (0/1) for consistency with CSV conventions:

if isinstance(value, bool):
    return "1" if value else "0"  # or "true"/"false"

Why: Most spreadsheet applications and CSV processors use lowercase or numeric representations for booleans. The current approach might cause parsing issues downstream.

2. Missing type hints for private functions (csv_reader.py:22-98)

Issue: The helper functions lack complete type hints:

def _get_workbook_name(file: Union[Path, IO[Any]], name: Optional[str]) -> str:
def _excel_rows_to_documents(...) -> List[Document]:  # Missing parameter types

Suggestion: Add complete type hints to all helper functions for better IDE support and type checking.

3. Potential memory issue with large Excel files

Issue: Both _read_xlsx() and _read_xls() load entire sheets into memory before processing. The xls reader (csv_reader.py:331-341) creates a generator but immediately consumes it in _excel_rows_to_documents().

Code location: csv_reader.py:289-341

Suggestion: For large Excel files, consider implementing streaming:

  • Use openpyxl.load_workbook(filename, read_only=True) with iter_rows() (already done ✓)
  • Process rows in batches when chunking is enabled
  • Add a note in docstrings about memory usage for large files

Note: This is a minor issue since read_only=True is already used for .xlsx, which optimizes memory usage.

4. Silent failure on corrupted files (csv_reader.py:184-187)

Issue: When reading corrupted Excel files, errors are caught and logged, but an empty list is returned without distinguishing between "empty file" and "corrupted file".

except Exception as e:
    file_desc = getattr(file, "name", str(file)) if isinstance(file, IO) else file
    log_error(f"Error reading {file_desc}: {e}")
    return []

Suggestion: Consider differentiating error types:

  • Empty/no-data: Return []
  • File corruption: Raise exception or add metadata to indicate corruption
  • Import errors: Re-raise with helpful message (already done ✓)

This would help users debug issues rather than silently failing.

5. Hardcoded delimiter in output (csv_reader.py:69)

Issue: The _row_values_to_csv_line() function hardcodes ", " (comma-space) as the delimiter:

return ", ".join(values)

Suggestion: Consider making this configurable or consistent with the CSV reader's delimiter parameter. Currently, actual CSV files respect the delimiter parameter, but Excel files always use comma-space.

Impact: This inconsistency means Excel-derived content looks different from CSV-derived content when using custom delimiters.

6. Missing docstrings for public API changes

Issue: The read() and async_read() methods in CSVReader now support Excel files, but their docstrings haven't been updated to reflect this.

Code location:

  • csv_reader.py:123-125
  • csv_reader.py:189-207

Suggestion: Update docstrings to mention Excel support:

"""
Read a CSV file or Excel workbook (.xlsx/.xls).

For Excel files: each sheet becomes a separate Document with sheet metadata.
For CSV files: the entire file is processed as a single Document (unless chunked).

Args:
    file: Path to CSV or Excel file, or file-like object
    delimiter: CSV delimiter (Excel files ignore this)
    quotechar: CSV quote character (Excel files ignore this)
    name: Optional name override for the document
    
Returns:
    List of Document objects
"""

7. Dependency fix is correct but unrelated

Issue: The change from memorisdk==3.0.5 to memori>=3.0.5 (pyproject.toml:123) is correct but unrelated to Excel support.

Suggestion: While the fix is valid and needed, consider mentioning in the PR description that this was discovered during CI testing and is a separate fix. This helps with git archaeology later.

Nitpicks

  1. Test organization: The test file test_excel_reader.py is well-structured, but some tests could use parametrization to reduce duplication:

    @pytest.mark.parametrize("extension,writer", [(".xlsx", write_xlsx), (".xls", write_xls)])
    def test_reader_handles_format(extension, writer, tmp_path):
        # Test both formats with same logic
  2. Magic numbers: In _row_values_to_csv_line(), the trailing cell trimming could be explained with a comment:

    # Trim trailing empty cells to avoid "a, , , , , " when Excel reports large max_column
    while values and values[-1] == "":
        values.pop()
  3. Import organization: The try-except block for aiofiles (csv_reader.py:10-12) raises immediately. Consider moving imports to where they're actually needed, or document why eager failure is desired.

Architecture Assessment

Design decision to route Excel through CSVReader is sound:

Pros:

  • Reuses existing chunking infrastructure (RowChunking, CodeChunking, etc.)
  • No new reader key or API surface
  • Consistent behavior between CSV and Excel
  • Minimal code changes to Knowledge class

⚠️ Future considerations:
The PR description mentions "if we later need richer spreadsheet semantics (formulas, formatting, table region detection), we can extract a dedicated ExcelReader." This is a good exit strategy. Consider documenting where you'd add this if needed:

  • Formula evaluation: openpyxl supports data_only=False to preserve formulas
  • Formatting: Would need custom reader since formatting isn't relevant for Knowledge ingestion
  • Table regions: openpyxl has worksheet.tables API

Code Quality

  • ✅ Follows .cursorrules patterns (no emojis in code, proper error handling)
  • ✅ Both sync and async variants provided
  • ✅ Proper exception handling with helpful error messages
  • ✅ Tests follow pytest conventions and are well-named
  • ⚠️ Could use more type hints on private functions
  • ⚠️ Docstrings need updating for API changes

Performance Notes

  • The use of read_only=True in openpyxl is good for memory efficiency
  • Chunking support works correctly and preserves sheet metadata
  • Async operations use asyncio.to_thread() appropriately since openpyxl/xlrd are synchronous

Breaking Changes

None - this is purely additive functionality.

Recommendation

Approve with minor suggestions. The implementation is solid and well-tested. The issues identified are minor and can be addressed in follow-up PRs if desired. The core functionality works as designed and integrates cleanly with the existing Knowledge system.

The whitespace preservation fix (follow-up comment) shows good attention to detail and consistency with CSV behavior.

Optional Improvements for Future PRs

  1. Add a cookbook example showing Excel ingestion (following CLAUDE.md guidelines)
  2. Consider adding support for specific sheet selection (e.g., read(file, sheets=["Sheet1", "Data"]))
  3. Add benchmarks for large Excel files vs CSV files to document performance characteristics
  4. Consider adding a warning when an Excel file has many empty sheets

- Add excel_reader.py cookbook demonstrating Excel knowledge integration
- Add sample_products.xlsx with multi-sheet product catalog data
- Remove banner comments and BUG-007 references from test file
@Mustafa-Esoofally
Copy link
Contributor

@claude review

@claude
Copy link

claude bot commented Jan 21, 2026

Claude finished @Mustafa-Esoofally's task in 2m 47s —— View job


Code Review for PR #5862

Reviewing Excel file support implementation for Knowledge.

Review Checklist

  • Read repository guidelines (CLAUDE.md)
  • Review core implementation changes
  • Review test coverage and quality
  • Check dependency changes
  • Assess design decisions
  • Provide comprehensive feedback

Summary

This PR successfully adds Excel (.xlsx/.xls) file support to the Knowledge system by routing Excel files through the existing CSV reader infrastructure. The implementation is well-designed, thoroughly tested, and follows repository conventions. The author made a pragmatic architectural decision to reuse CSVReader rather than creating a separate Excel reader, which maintains API consistency and leverages existing chunking strategies.

Strengths

1. Excellent Test Coverage (511 lines)

The test suite in test_excel_reader.py is comprehensive and well-structured:

  • ✅ Both formats (.xlsx and .xls) covered
  • ✅ Edge cases: empty sheets, trailing cells, special characters, unicode
  • ✅ Data type handling (floats, booleans, datetime, None)
  • ✅ Async operations tested
  • ✅ Error conditions (missing dependencies, corrupted files)
  • ✅ Whitespace preservation (follow-up fix)
  • ✅ Chunking behavior validated
  • ✅ BytesIO and Path inputs tested

2. Smart Architectural Decision

Routing Excel through CSVReader is pragmatic:

  • Reuses existing chunking infrastructure (RowChunking, etc.)
  • No new reader key or API surface
  • Consistent behavior between CSV and Excel
  • Clear exit strategy if richer Excel semantics are needed later

3. Clean Implementation

Helper functions are well-designed:

  • _get_workbook_name() - Clear priority order for name resolution
  • _infer_file_extension() - Consistent extension detection
  • _stringify_spreadsheet_cell_value() - Type-aware cell conversion
  • _row_values_to_csv_line() - Proper trailing cell trimming
  • _excel_rows_to_documents() - Per-sheet document generation

4. Proper Metadata Handling

Each sheet becomes a Document with:

  • sheet_name - enables filtering and tracking
  • sheet_index - maintains sheet order
  • Sheet metadata preserved through chunking operations

5. Repository Conventions Followed

  • ✅ Both sync and async variants provided (CLAUDE.md requirement)
  • ✅ No emojis in code (.cursorrules)
  • ✅ Proper error handling with helpful messages
  • ✅ FieldLabeledCSVReader also supports Excel (consistency)

Issues & Recommendations

1. Minor: Inconsistent Boolean Formatting (csv_reader.py:48-62)

Location: csv_reader.py:58-60 in _stringify_spreadsheet_cell_value()

Issue: Booleans are converted to their Python string representation ("True", "False"), which differs from CSV conventions:

# Current behavior (test line 266-267):
assert lines[3] == "boolean_true, True"   # Python repr
assert lines[4] == "boolean_false, False" # Python repr

Suggestion: Consider using lowercase or numeric representation for consistency with spreadsheet applications:

if isinstance(value, bool):
    return "1" if value else "0"  # or "true"/"false"

Impact: Low - current behavior works, but might cause parsing issues with downstream systems expecting standard CSV boolean formats.

2. Minor: Hardcoded Delimiter (csv_reader.py:69)

Issue: _row_values_to_csv_line() hardcodes ", " (comma-space):

return ", ".join(values)

This creates inconsistency: actual CSV files respect the delimiter parameter, but Excel-derived content always uses comma-space.

Suggestion: Consider making this configurable or documenting this difference.

Impact: Low - affects consistency when mixing Excel and CSV with custom delimiters.

3. Documentation: Missing Docstring Updates

Locations:

  • csv_reader.py:123-125 (read() method)
  • csv_reader.py:189-207 (async_read() method)

Issue: The public API now supports Excel files, but docstrings don't mention this:

Suggestion: Update docstrings to reflect Excel support:

"""
Read a CSV file or Excel workbook (.xlsx/.xls).

For Excel files: each sheet becomes a separate Document with sheet metadata.
For CSV files: the entire file is processed as a single Document (unless chunked).

Args:
    file: Path to CSV or Excel file, or file-like object
    delimiter: CSV delimiter (ignored for Excel files)
    quotechar: CSV quote character (ignored for Excel files)
    name: Optional name override for the document
    
Returns:
    List of Document objects
"""

Impact: Medium - users reading the code may not realize Excel is supported.

Fix docstrings →

4. Code Quality: Silent Failure on Corrupted Files (csv_reader.py:184-187)

Issue: When reading corrupted Excel files, errors are caught and logged, but an empty list is returned without distinguishing between "empty file" and "corrupted file":

except Exception as e:
    file_desc = getattr(file, "name", str(file)) if isinstance(file, IO) else file
    log_error(f"Error reading {file_desc}: {e}")
    return []

Suggestion: Consider differentiating error types:

  • Empty/no-data: Return []
  • File corruption: Raise exception or add metadata indicating corruption
  • This helps users debug issues rather than silently failing

Impact: Medium - users may not realize their files are corrupted vs. empty.

5. Nitpick: Missing Type Hints (csv_reader.py:22-98)

Issue: Helper functions lack complete type hints:

def _excel_rows_to_documents(
    *,
    workbook_name: str,
    sheets: Iterable[Tuple[str, Iterable[Sequence[Any]]]],
) -> List[Document]:  # Good!

def _get_workbook_name(file: Union[Path, IO[Any]], name: Optional[str]) -> str:  # Good!

Most are complete, but consistency would help with type checking.

Impact: Very Low - minor code quality improvement.

6. Note: Unrelated Dependency Fix (pyproject.toml:123)

Observation: The change from memorisdk==3.0.5 to memori>=3.0.5 is correct and necessary for CI, but unrelated to Excel support.

Suggestion: This is fine to include, but worth noting it's a separate fix discovered during testing.

Architecture Assessment

The design decision to route Excel through CSVReader is sound:

Pros:

  • Reuses existing chunking infrastructure
  • No API surface expansion
  • Consistent behavior
  • Minimal code changes

⚠️ Future Considerations (as mentioned in PR description):
The PR author correctly identifies that if richer spreadsheet semantics are needed (formulas, formatting, table regions), a dedicated ExcelReader can be extracted. This is a good exit strategy.

Potential future enhancements:

  • Formula evaluation: openpyxl supports data_only=False to preserve formulas
  • Cell formatting: Would need custom reader (likely not relevant for Knowledge ingestion)
  • Table regions: openpyxl has worksheet.tables API
  • Specific sheet selection: Could add sheets=["Sheet1", "Data"] parameter

Test Quality

The test suite demonstrates exceptional quality:

  • ✅ Uses pytest fixtures effectively
  • ✅ Tests both sync and async paths
  • ✅ Covers error conditions with proper assertions
  • ✅ Uses pytest.importorskip() for optional dependencies
  • ✅ Creates real Excel files (not mocked) for integration testing
  • ✅ Tests BytesIO and Path inputs
  • ✅ Validates metadata propagation through chunking

Minor suggestion: Some tests could use parametrization to reduce duplication:

@pytest.mark.parametrize("extension,writer", [(".xlsx", write_xlsx), (".xls", write_xls)])
def test_reader_handles_format(extension, writer, tmp_path):
    # Test both formats with same logic

Code Quality Checklist

  • ✅ Follows .cursorrules patterns (no emojis, proper error handling)
  • ✅ Both sync and async variants provided (CLAUDE.md requirement)
  • ✅ Proper exception handling with helpful error messages
  • ✅ Tests follow pytest conventions
  • ⚠️ Docstrings need updating for API changes (see issue v_0_2_31 #3)
  • ✅ Type hints mostly complete
  • ✅ Performance: read_only=True used for memory efficiency
  • ✅ Async operations use asyncio.to_thread() appropriately

Dependency Changes

The changes to pyproject.toml are appropriate:

csv = ["aiofiles", "openpyxl", "xlrd"]
  • openpyxl for .xlsx files (modern format)
  • xlrd for .xls files (legacy format)
  • ✅ Both added to existing agno[csv] extra (logical grouping)
  • memori fix is correct (separate issue)

Performance Notes

  • read_only=True in openpyxl optimizes memory usage
  • ✅ Chunking support works correctly
  • ✅ Trailing cell trimming prevents "a, , , , ," issues
  • ✅ Empty sheets and rows are skipped efficiently

Breaking Changes

None - this is purely additive functionality.

Missing: Cookbook Example

Observation: The PR description mentions cookbook/07_knowledge/readers/excel_reader.py as ADDED, but it's not present in the git diff.

Recommendation: According to CLAUDE.md, "every pattern should have an example". Consider adding a cookbook example showing:

  • Basic Excel ingestion
  • Multi-sheet workbook handling
  • Excel with Knowledge and RAG
  • Comparison with CSV behavior

This would help users discover and understand the feature.

Recommendation

✅ APPROVE with minor suggestions

The implementation is solid, well-tested, and production-ready. The issues identified are minor and mostly documentation/polish items that can be addressed in follow-up PRs if desired. The core functionality is excellent.

Priority for follow-up:

  1. Update docstrings to mention Excel support (Medium)
  2. Add cookbook example per CLAUDE.md guidelines (Medium)
  3. Consider boolean formatting consistency (Low)
  4. Improve error handling for corrupted files (Low)

The whitespace preservation fix (follow-up comment by @liqiongyu) shows excellent attention to detail and consistency with CSV behavior.

Great work! This is a well-executed feature addition that maintains code quality and follows repository standards.


- Add _convert_xls_cell_value() to properly convert xlrd booleans (1/0 → True/False)
  and dates (serial numbers → datetime)
- Normalize line endings (\n, \r, \r\n) to spaces in _stringify_spreadsheet_cell_value()
  to preserve row integrity when cells contain multiline content
- Apply stringify to CSV path so embedded newlines in CSV cells are also normalized
- Use list+join pattern for CSV content (no trailing newline, consistent with Excel)
- Add comprehensive tests for all edge cases across CSV, XLSX, and XLS formats
@Mustafa-Esoofally Mustafa-Esoofally force-pushed the feat/4872-excel-knowledge branch from 6a60329 to 4b8dc79 Compare January 21, 2026 14:48
@kausmeows kausmeows merged commit dc64057 into agno-agi:main Jan 21, 2026
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Support for excel files on knowledge

3 participants