Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add a check for complex pdfs#4268

Open
aadland6 wants to merge 6 commits intomainfrom
check-complex-pdf
Open

Add a check for complex pdfs#4268
aadland6 wants to merge 6 commits intomainfrom
check-complex-pdf

Conversation

@aadland6
Copy link
Contributor

This checks if a pdf file is likely a complex document like mini-holistic-3-v1-Eng_Civil-Structural-Drawing_p001.pdf that is mostly vector graphics by comparing the ratio of vector images to text elements.

This limits the overhead to every file by setting a minimum file size before running the check.

@aadland6 aadland6 marked this pull request as ready for review February 26, 2026 19:42
@aadland6 aadland6 requested a review from qued February 26, 2026 19:43
@aadland6 aadland6 force-pushed the check-complex-pdf branch from dc5683d to 59e77ce Compare March 2, 2026 14:03
@aadland6 aadland6 requested a review from qued March 2, 2026 18:56
@qued
Copy link
Contributor

qued commented Mar 2, 2026

@claude , could you review this? Additionally, could you suggest how to unit test the function is_pdf_too_complex?

@claude
Copy link

claude bot commented Mar 2, 2026

Claude finished @qued's task —— View job


Code Review and Testing Suggestions

Todo List:

  • Read and analyze the changed files
  • Review the is_pdf_too_complex function implementation
  • Address previous review comments about regex compilation
  • Address formatting/line length issues
  • Suggest comprehensive unit testing approach
  • Provide final code review feedback

Code Review Feedback

Strengths

  • Good performance optimization: Early exits based on file size and stream size minimize overhead
  • Clear algorithm: Well-documented approach using graphics-to-text ratio for complexity detection
  • Robust error handling: Proper exception handling with logging and cursor position restoration
  • Defensive programming: Handles various file input types (bytes, file objects, filenames)

🔧 Issues to Address

1. Regex Compilation Performance (Previous Review Comment)

Location: unstructured/partition/pdf.py:95-105

The regex patterns are currently compiled inside the module scope, which is correct. However, looking at the previous review comment about compiling outside the function, I believe this was already addressed properly. The patterns GRAPHICS_OPS_PATTERN and TEXT_OPS_PATTERN are compiled at module level (lines 95-105), not inside the function.

2. Line Length Formatting Issues (Previous Review Comment)

Location: Multiple locations in unstructured/partition/pdf.py

Several lines exceed the 100-character limit specified in your pyproject.toml. Key examples:

  • Line 598: min_graphics_to_text_ratio: float = 20.0, (part of function signature that's too long)
  • Line 599: min_file_size_bytes: int = int(1 * 1024 * 1024), # 1 MB
  • Line 707-713: The logging message is split well but could be formatted more consistently

3. Magic Numbers Could Be Constants

Location: unstructured/partition/pdf.py:599-601

Consider defining module-level constants:

DEFAULT_MIN_FILE_SIZE_MB = 1 * 1024 * 1024  # 1 MB
DEFAULT_MIN_RAW_STREAM_BYTES = 100_000      # 100 KB

4. Type Hints Enhancement

Location: unstructured/partition/pdf.py:594

The function signature could benefit from more explicit typing for the file parameter:

def is_pdf_too_complex(
    filename: str = "",
    file: Optional[Union[bytes, IO[bytes]]] = None,
    # ...

5. Potential Division by Zero Edge Case

Location: unstructured/partition/pdf.py:704

While max(num_text_ops, 1) prevents division by zero, consider logging when num_text_ops == 0 as this might indicate an unusual PDF structure.

🧪 Comprehensive Unit Testing Strategy

Here's a detailed testing approach for the is_pdf_too_complex function:

Test Categories

  1. Basic Functionality Tests

    def test_simple_text_pdf_not_complex():
        """Test that a simple text PDF is not flagged as complex"""
        # Use a minimal PDF with mostly text content
        
    def test_complex_vector_pdf_detected():
        """Test that a CAD/vector-heavy PDF is correctly identified"""
        # Create or use a PDF with high graphics-to-text ratio
  2. Edge Case Tests

    def test_small_file_size_skipped():
        """Test that files below min_file_size_bytes are skipped"""
        # Create tiny PDF, ensure function returns False without analysis
        
    def test_empty_pdf_handled():
        """Test handling of empty or corrupted PDF files"""
        
    def test_pdf_with_no_content_streams():
        """Test PDFs where pages have no /Contents"""
        
    def test_small_content_streams_skipped():
        """Test that pages with small content streams are skipped"""
  3. Input Type Tests

    def test_filename_input():
        """Test function with filename string parameter"""
        
    def test_bytes_input():
        """Test function with bytes parameter"""
        
    def test_file_object_input():
        """Test function with file-like object parameter"""
        
    def test_file_cursor_position_preserved():
        """Ensure file cursor position is restored after analysis"""
  4. Threshold Boundary Tests

    def test_graphics_ops_threshold_boundary():
        """Test behavior at max_graphics_ops threshold boundaries"""
        # Test with exactly max_graphics_ops, max_graphics_ops+1, etc.
        
    def test_ratio_threshold_boundary():
        """Test behavior at ratio threshold boundaries"""
        # Test with ratios just above/below min_graphics_to_text_ratio
  5. Error Handling Tests

    def test_corrupted_pdf_handling():
        """Test graceful handling of corrupted PDF files"""
        
    def test_password_protected_pdf():
        """Test behavior with password-protected PDFs"""
        
    def test_file_access_errors():
        """Test handling when file cannot be accessed"""
  6. Performance Tests

    def test_large_file_performance():
        """Verify performance optimizations work on large files"""
        # Ensure function completes quickly on large files
        
    def test_many_pages_performance():
        """Test performance with PDFs containing many pages"""

Test Data Strategies

  1. Generated Test PDFs: Create minimal PDFs programmatically with controlled graphics/text ratios
  2. Real-world Samples: Include sanitized examples of CAD drawings, technical diagrams, and text documents
  3. Synthetic Content: Use pypdf to create PDFs with specific operator counts for precise threshold testing

Mock Testing for Isolation

def test_regex_pattern_matching():
    """Test regex patterns match expected PDF operators"""
    # Mock the content stream bytes and verify regex matches
    
def test_file_size_calculation():
    """Test file size determination logic across input types"""
    # Mock different input types and verify size calculation

Integration Testing Considerations

  • Test integration with the main partition_pdf_or_image function
  • Verify logging messages are appropriate and informative
  • Test that complex PDFs properly fall back to hi_res strategy

📋 Summary

The is_pdf_too_complex function is well-designed with good performance optimizations. The main issues are formatting-related (line length) and minor improvements around constants and logging. The testing strategy should focus on boundary conditions, different input types, and real-world PDF complexity scenarios.

Priority fixes:

  1. Address line length issues for consistency with project standards
  2. Consider extracting magic numbers to constants
  3. Add comprehensive unit tests as outlined above

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my eyes this looks good, but we should add unit tests.

I had Claude take a pass. I agree with Claude's point 3. (Magic Numbers Could Be Constants) and 4. (Type Hints Enhancement). Claude's suggestion for 5. (Potential Division by Zero Edge Case) I'd consider optional.

For the unit tests, I think categories 1. and 2. suggested by Claude would be good, but the others are overkill.

@aadland6 aadland6 requested a review from qued March 2, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants