Codestin Search App

aseembits93 · 2026-01-27T02:31:38Z

📄 30% (0.30x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

⏱️ Runtime : 329 milliseconds → 252 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 30% speedup through two key algorithmic improvements in aggregate_embedded_text_by_block and supplement_layout_with_ocr_elements:

Key Optimizations

1. Replaced `.sum(axis=1).astype(bool)` with `.any(axis=1)`

This change appears in both functions when computing boolean masks from the result of bboxes1_is_almost_subregion_of_bboxes2():

Why it's faster:

.sum(axis=1) creates an intermediate integer array by counting True values across columns, then converts to boolean
.any(axis=1) short-circuits on the first True value per row, avoiding the full summation
Eliminates the explicit .astype(bool) conversion overhead

Performance impact: Based on line profiler, the mask computation in aggregate_embedded_text_by_block dropped from ~234ms to ~222ms (5% faster), and the overall function improved from 551ms to 443ms (19.6% faster).

2. Avoided redundant slicing operations

In aggregate_embedded_text_by_block, the optimized code stores sliced = source_regions.slice(mask) once and reuses it, instead of calling source_regions.slice(mask) three separate times:

Why it's faster:

Each slice() operation creates a new object with coordinate and text array copies
Line profiler shows the original made 3 separate slice calls (48ms + 25ms + 34ms = 107ms total)
The optimized version makes 1 slice call (~28ms), saving ~79ms per invocation

3. Early exit with `mask.any()`

The optimized code checks if mask.any(): before processing, avoiding unnecessary work when no regions match:

Why it's faster:

Skips text joining, bbox extraction, and IOU calculations when mask is empty
Particularly beneficial for the 368 cases (31% of calls) where no matching regions exist

Impact Based on Test Results

The optimization is particularly effective for workloads with:

Many elements requiring text aggregation (10-41% speedup on tests with 100-500 elements)
- test_large_scale_many_elements_aggregated: 77ms → 67.2ms (14.6% faster)
- test_merge_large_number_of_elements: 43.8ms → 31.0ms (41.3% faster)
- test_merge_boundary_coordinates_large_scale: 87.3ms → 61.5ms (41.8% faster)
Documents with invalid text patterns (10-20% speedup)
- test_invalid_texts_are_replaced: 250μs → 222μs (12.4% faster)
- test_merge_with_all_invalid_text: 654μs → 554μs (18.1% faster)
Complex spatial matching scenarios (33-36% speedup)
- test_merge_with_overlapping_elements: 25.2ms → 18.9ms (33.3% faster)
- test_merge_with_varied_subregion_thresholds: 78.6ms → 57.6ms (36.4% faster)

Context Impact

The function merge_out_layout_with_ocr_layout is called from supplement_page_layout_with_ocr in OCR processing hot paths, specifically when ocr_mode == OCRMode.FULL_PAGE. Each page processed invokes this function once, making the 30% speedup directly translate to faster document processing throughput for PDF/image partitioning workflows.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 24 Passed
🌀 Generated Regression Tests	✅ 29 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_code`	2.55ms	2.26ms	12.6%✅
`partition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout`	2.32ms	2.05ms	13.2%✅

🌀 Click to see Generated Regression Tests

import numpy as np  # used to build arrays for coordinates and texts
from unstructured_inference.constants import IsExtracted
from unstructured_inference.inference.elements import TextRegions
from unstructured_inference.inference.layoutelement import LayoutElements

# imports
from unstructured.partition.pdf_image.ocr import merge_out_layout_with_ocr_layout


# Helper constructors: attempt several plausible constructor signatures for the domain classes.
# This keeps tests robust across minor constructor variations in the real codebase while still
# using the real classes (no stubs).
def _construct_layout_elements(element_coords, texts, sources=None):
    """
    Try to construct a LayoutElements object using a few reasonable constructor signatures.
    Raise RuntimeError with helpful debugging info if none work.
    """
    # Normalize inputs to numpy arrays where appropriate
    ec = np.asarray(element_coords, dtype=float)
    txts = np.asarray(texts, dtype=object)

    possible_args = [
        {
            "element_coords": ec,
            "texts": txts,
            "sources": np.asarray(sources or [], dtype=object),
            "element_class_ids": np.zeros(txts.shape),
            "element_class_id_map": {},
        },
        (ec, txts, np.asarray(sources or [], dtype=object)),
        (ec, txts),
    ]

    last_exc = None
    for args in possible_args:
        try:
            if isinstance(args, dict):
                return LayoutElements(**args)
            else:
                return LayoutElements(*args)
        except Exception as e:
            last_exc = e
            continue

    # If we get here, none of the constructors worked; dump useful info for debugging
    raise RuntimeError(
        "Could not instantiate LayoutElements with tried constructor signatures. "
        f"Last exception: {last_exc!r}"
    )


def _construct_text_regions(element_coords, texts, sources=None, is_extracted=None):
    """
    Try to construct a TextRegions instance with several likely constructor signatures.
    """
    ec = np.asarray(element_coords, dtype=float)
    txts = np.asarray(texts, dtype=object)
    sources_arr = np.asarray(sources or [], dtype=object)
    # default is_extracted flags if not provided
    if is_extracted is None:
        is_extracted = np.array([IsExtracted.TRUE] * txts.shape[0], dtype=object)
    else:
        is_extracted = np.asarray(is_extracted, dtype=object)

    possible_args = [
        {
            "element_coords": ec,
            "texts": txts,
            "sources": sources_arr,
            "is_extracted_array": is_extracted,
        },
        (ec, txts, sources_arr, is_extracted),
        (ec, txts, sources_arr),
    ]

    last_exc = None
    for args in possible_args:
        try:
            if isinstance(args, dict):
                return TextRegions(**args)
            else:
                return TextRegions(*args)
        except Exception as e:
            last_exc = e
            continue

    raise RuntimeError(
        "Could not instantiate TextRegions with tried constructor signatures. "
        f"Last exception: {last_exc!r}"
    )


def test_returns_out_layout_when_out_or_ocr_empty():
    """
    Basic: Ensure the function returns early/unmodified when either input collection is empty.
    - Case A: out_layout is empty (should return out_layout immediately even if ocr_layout not empty).
    - Case B: ocr_layout is empty (should return out_layout immediately).
    """
    # Build an empty LayoutElements (0 boxes)
    empty_coords = np.zeros((0, 4))
    empty_texts = np.array([], dtype=object)

    out_empty = _construct_layout_elements(empty_coords, empty_texts)
    # Build a tiny OCR layout with one element to test early-return in Case A
    ocr_coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    ocr_texts = np.array(["some text"], dtype=object)
    ocr_nonempty = _construct_text_regions(ocr_coords, ocr_texts)

    # Case A: out_layout empty -> should return out_layout (empty)
    codeflash_output = merge_out_layout_with_ocr_layout(out_empty, ocr_nonempty)
    result_a = codeflash_output  # 1.97μs -> 2.01μs (1.89% slower)

    # Case B: ocr_layout empty -> should return out_layout unchanged (non-empty out_layout)
    out_coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    out_texts = np.array(["valid"], dtype=object)
    out_nonempty = _construct_layout_elements(out_coords, out_texts)
    ocr_empty = _construct_text_regions(empty_coords, empty_texts)

    codeflash_output = merge_out_layout_with_ocr_layout(out_nonempty, ocr_empty)
    result_b = codeflash_output  # 1.56μs -> 1.42μs (9.70% faster)


def test_valid_texts_not_modified_when_supplement_false():
    """
    Basic: If out_layout.texts are already valid, the function should not modify them.
    Use supplement_with_ocr_elements=False to avoid additional OCR supplementation behavior.
    """
    # Single element in out layout with valid ASCII text
    coords = np.array([[0.1, 0.1, 0.5, 0.5]])
    texts = np.array(["Already good text"], dtype=object)
    out_layout = _construct_layout_elements(coords, texts)

    # OCR layout with some text (should be ignored as supplement_with_ocr_elements=False)
    ocr_coords = np.array([[0.1, 0.1, 0.5, 0.5]])
    ocr_texts = np.array(["ocr text"], dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts)

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 10.9μs -> 10.0μs (8.94% faster)


def test_invalid_texts_are_replaced_with_aggregated_ocr_text():
    """
    Edge: If out_layout contains invalid text (empty string or containing '(cid:'), the function
    should attempt to aggregate OCR text into that element.
    We create coordinates that exactly match so aggregation selects the OCR region.
    """
    # Create a single out element with invalid text (empty string)
    coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    invalid_texts = np.array([""], dtype=object)  # empty -> invalid per valid_text
    out_layout = _construct_layout_elements(coords, invalid_texts)

    # OCR layout contains a matching box with real text we expect to be aggregated
    ocr_coords = np.array([[0.0, 0.0, 1.0, 1.0]])
    ocr_texts = np.array(["Aggregated OCR text"], dtype=object)
    # Ensure OCR regions are marked as extracted (this helps make aggregate_embedded_text_by_block
    # set IsExtracted.TRUE conditions if logic depends on it).
    ocr_is_extracted = np.array([IsExtracted.TRUE], dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts, is_extracted=ocr_is_extracted)

    # Perform merge and verify the empty string was replaced by OCR text
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 250μs -> 222μs (12.4% faster)


def test_invalid_cid_texts_are_treated_as_invalid_and_replaced():
    """
    Edge: Strings containing '(cid:' should be considered invalid by valid_text and replaced
    with OCR-aggregated text.
    """
    coords = np.array([[0.2, 0.2, 0.8, 0.8]])
    # containing (cid: should be invalid
    invalid_texts = np.array(["(cid:1234)"], dtype=object)
    out_layout = _construct_layout_elements(coords, invalid_texts)

    ocr_coords = np.array([[0.2, 0.2, 0.8, 0.8]])
    ocr_texts = np.array(["Replaced text"], dtype=object)
    ocr_is_extracted = np.array([IsExtracted.TRUE], dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts, is_extracted=ocr_is_extracted)

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 258μs -> 218μs (18.3% faster)


def test_large_scale_many_elements_aggregated_without_supplementing():
    """
    Large-scale: Create many (but under 1000) out elements with initially invalid text (empty),
    and corresponding OCR regions with matching boxes and unique texts. Ensure merge handles
    a moderate sized input and replaces all invalid texts.
    This verifies scalability and that the aggregation loop works across many elements.
    """
    n = 100  # keep under the 1000-element limit specified in the instructions
    # Build n identical boxes for simplicity (exact match ensures subregion tests pass)
    out_coords = np.tile(np.array([[0.0, 0.0, 1.0, 1.0]]), (n, 1))
    out_texts = np.array([""] * n, dtype=object)  # all invalid initially
    out_layout = _construct_layout_elements(out_coords, out_texts)

    # Build corresponding OCR boxes and unique OCR texts for aggregation
    ocr_coords = np.tile(np.array([[0.0, 0.0, 1.0, 1.0]]), (n, 1))
    ocr_texts = np.array([f"ocr_text_{i}" for i in range(n)], dtype=object)
    # Mark all OCR regions as extracted to make the aggregated text more likely to be considered fully_filled
    ocr_is_extracted = np.array([IsExtracted.TRUE] * n, dtype=object)
    ocr_layout = _construct_text_regions(ocr_coords, ocr_texts, is_extracted=ocr_is_extracted)

    # Run the merge without supplementing (we only want aggregation)
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 77.0ms -> 67.2ms (14.6% faster)
    for i in range(n):
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import numpy as np
from unstructured_inference.inference.elements import TextRegions
from unstructured_inference.inference.layoutelement import LayoutElements

from unstructured.documents.elements import ElementType
from unstructured.partition.pdf_image.ocr import merge_out_layout_with_ocr_layout


def test_merge_empty_out_layout_returns_original():
    """Test that empty out_layout returns immediately without processing."""
    # Create empty out_layout and non-empty ocr_layout
    empty_out_layout = LayoutElements(
        element_coords=np.array([]).reshape(0, 4),
        texts=np.array([], dtype=object),
        sources=np.array([], dtype=object),
        element_class_ids=np.array([], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call function with supplement_with_ocr_elements=False to skip that logic
    codeflash_output = merge_out_layout_with_ocr_layout(
        empty_out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 2.33μs -> 2.30μs (1.57% faster)


def test_merge_empty_ocr_layout_returns_original():
    """Test that empty ocr_layout returns original out_layout without modification."""
    # Create non-empty out_layout and empty ocr_layout
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    empty_ocr_layout = TextRegions(
        element_coords=np.array([]).reshape(0, 4),
        texts=np.array([], dtype=object),
        sources=np.array([], dtype=object),
    )

    # Call function with supplement_with_ocr_elements=False
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, empty_ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 2.70μs -> 2.77μs (2.45% slower)


def test_merge_valid_text_skips_aggregation():
    """Test that elements with valid text are not modified by OCR aggregation."""
    # Create out_layout with valid text (no "(cid:" pattern)
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Valid text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR replacement"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 11.1μs -> 10.9μs (2.30% faster)


def test_merge_invalid_text_gets_replaced():
    """Test that elements with invalid text (containing '(cid:') trigger OCR aggregation."""
    # Create out_layout with invalid text containing "(cid:"
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["This (cid:1234) invalid"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 268μs -> 244μs (10.0% faster)


def test_merge_empty_string_is_invalid():
    """Test that empty strings are treated as invalid and trigger OCR aggregation."""
    # Create out_layout with empty text
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array([""], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR provides text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 265μs -> 220μs (20.4% faster)


def test_merge_multiple_elements_mixed_validity():
    """Test merging with multiple elements having mixed valid/invalid text."""
    # Create out_layout with multiple elements
    out_layout = LayoutElements(
        element_coords=np.array(
            [
                [10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120],
            ]
        ),
        texts=np.array(["Valid text", "Invalid (cid:999)", "Another valid"], dtype=object),
        sources=np.array([None, None, None], dtype=object),
        element_class_ids=np.array([0, 0, 0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array(
            [
                [50, 60, 70, 80],  # Overlaps with second element
            ]
        ),
        texts=np.array(["OCR replacement"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Call with supplement disabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 253μs -> 224μs (12.8% faster)


def test_merge_with_none_values_in_texts():
    """Test handling of None values in text arrays."""
    # Create out_layout with None values
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array([None], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Should handle None gracefully and aggregate from OCR
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 254μs -> 220μs (15.1% faster)


def test_merge_subregion_threshold_parameter():
    """Test that custom subregion_threshold is passed correctly to aggregation."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 100, 100]]),
        texts=np.array(["Invalid (cid:123)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[11, 21, 50, 50]]),  # Small region inside out_layout
        texts=np.array(["Small OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Use very high threshold (should not match)
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False, subregion_threshold=0.95
    )
    result = codeflash_output  # 249μs -> 230μs (8.16% faster)


def test_merge_with_supplement_false():
    """Test that supplement_with_ocr_elements=False does not add new OCR elements."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    # OCR layout with element not covered by out_layout
    ocr_layout = TextRegions(
        element_coords=np.array([[200, 300, 400, 500]]),
        texts=np.array(["Uncovered OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 10.7μs -> 10.0μs (7.28% faster)


def test_merge_with_supplement_true_adds_uncovered_ocr():
    """Test that supplement_with_ocr_elements=True adds uncovered OCR elements."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout text"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    # OCR layout with element not covered by out_layout
    ocr_layout = TextRegions(
        element_coords=np.array([[200, 300, 400, 500]]),
        texts=np.array(["Uncovered OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=True
    )
    result = codeflash_output  # 863μs -> 879μs (1.81% slower)


def test_merge_text_array_dtype_converted_to_object():
    """Test that text array dtype is converted to object before modification."""
    # Create out_layout with strings dtype
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40], [50, 60, 70, 80]]),
        texts=np.array(["Valid", "Invalid (cid:1)"], dtype=object),
        sources=np.array([None, None], dtype=object),
        element_class_ids=np.array([0, 0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[50, 60, 70, 80]]),
        texts=np.array(["OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    # Function should handle dtype conversion gracefully
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 252μs -> 225μs (12.0% faster)


def test_merge_with_all_invalid_text():
    """Test case where all out_layout elements have invalid text."""
    out_layout = LayoutElements(
        element_coords=np.array(
            [
                [10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120],
            ]
        ),
        texts=np.array(
            [
                "Invalid (cid:1)",
                "Also (cid:2) invalid",
                "(cid:3) starts with invalid",
            ],
            dtype=object,
        ),
        sources=np.array([None, None, None], dtype=object),
        element_class_ids=np.array([0, 0, 0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array(
            [
                [10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120],
            ]
        ),
        texts=np.array(["OCR1", "OCR2", "OCR3"], dtype=object),
        sources=np.array([None, None, None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 654μs -> 554μs (18.1% faster)
    for text in result.texts:
        pass


def test_merge_with_single_element():
    """Test merge operation with single element in layouts."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["(cid:invalid)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Single OCR"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    result = merge_out_layout_with_ocr_elements = False


def test_merge_with_large_cid_patterns():
    """Test handling of very long or complex (cid:) patterns."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Text with (cid:" + "1234567890" * 10 + ") pattern"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    ocr_layout = TextRegions(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Clean OCR text"], dtype=object),
        sources=np.array([None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 258μs -> 222μs (16.0% faster)


def test_merge_multiple_ocr_regions_for_single_layout_element():
    """Test aggregation when multiple OCR regions map to single layout element."""
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 100, 100]]),
        texts=np.array(["(cid:invalid)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )
    # Multiple OCR regions within the layout element
    ocr_layout = TextRegions(
        element_coords=np.array(
            [
                [15, 25, 40, 40],
                [50, 60, 80, 80],
            ]
        ),
        texts=np.array(["First", "Second"], dtype=object),
        sources=np.array([None, None], dtype=object),
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 263μs -> 231μs (14.1% faster)


def test_merge_large_number_of_elements():
    """Test merge performance with large number of layout elements."""
    # Create 500 layout elements with mixed valid/invalid text
    n_elements = 500
    coords = np.array([[i * 10, i * 10, i * 10 + 20, i * 10 + 20] for i in range(n_elements)])
    texts = np.array(
        ["Valid text" if i % 2 == 0 else f"Invalid (cid:{i})" for i in range(n_elements)],
        dtype=object,
    )
    sources = np.array([None] * n_elements, dtype=object)
    class_ids = np.array([0] * n_elements, dtype=float)

    out_layout = LayoutElements(
        element_coords=coords,
        texts=texts,
        sources=sources,
        element_class_ids=class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create OCR layout with fewer elements
    ocr_coords = np.array(
        [[i * 10 + 1, i * 10 + 1, i * 10 + 19, i * 10 + 19] for i in range(0, n_elements, 2)]
    )  # 250 elements
    ocr_texts = np.array([f"OCR text {i}" for i in range(len(ocr_coords))], dtype=object)
    ocr_sources = np.array([None] * len(ocr_coords), dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge function
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 43.8ms -> 31.0ms (41.3% faster)


def test_merge_large_ocr_layout():
    """Test merge with large OCR layout (500+ elements)."""
    # Create small out_layout
    out_layout = LayoutElements(
        element_coords=np.array([[0, 0, 1000, 1000]]),
        texts=np.array(["Invalid (cid:test)"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create large OCR layout with 600 elements
    n_ocr_elements = 600
    ocr_coords = np.array(
        [
            [i % 100 * 10, i // 100 * 100, i % 100 * 10 + 50, i // 100 * 100 + 50]
            for i in range(n_ocr_elements)
        ]
    )
    ocr_texts = np.array([f"OCR element {i}" for i in range(n_ocr_elements)], dtype=object)
    ocr_sources = np.array([None] * n_ocr_elements, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge function
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 3.31ms -> 3.04ms (8.83% faster)


def test_merge_with_supplement_large_uncovered_ocr():
    """Test supplement logic with many uncovered OCR elements."""
    # Create out_layout with only one element
    out_layout = LayoutElements(
        element_coords=np.array([[10, 20, 30, 40]]),
        texts=np.array(["Layout"], dtype=object),
        sources=np.array([None], dtype=object),
        element_class_ids=np.array([0], dtype=float),
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create large OCR layout with mostly uncovered elements
    n_uncovered = 450
    ocr_coords = np.array(
        [
            [1000 + i * 10, 1000 + i * 10, 1000 + i * 10 + 20, 1000 + i * 10 + 20]
            for i in range(n_uncovered)
        ]
    )
    ocr_texts = np.array([f"Uncovered OCR {i}" for i in range(n_uncovered)], dtype=object)
    ocr_sources = np.array([None] * n_uncovered, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge with supplement enabled
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=True
    )
    result = codeflash_output  # 4.87ms -> 4.84ms (0.619% faster)


def test_merge_with_overlapping_elements():
    """Test merge with multiple overlapping layout and OCR elements."""
    # Create out_layout with overlapping elements
    n_layout = 300
    out_coords = np.array([[i * 5, i * 5, i * 5 + 50, i * 5 + 50] for i in range(n_layout)])
    out_texts = np.array(
        ["Invalid (cid:x)" if i % 3 == 0 else f"Valid {i}" for i in range(n_layout)], dtype=object
    )
    out_sources = np.array([None] * n_layout, dtype=object)
    out_class_ids = np.array([0] * n_layout, dtype=float)

    out_layout = LayoutElements(
        element_coords=out_coords,
        texts=out_texts,
        sources=out_sources,
        element_class_ids=out_class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create overlapping OCR layout
    n_ocr = 250
    ocr_coords = np.array([[i * 7, i * 7, i * 7 + 40, i * 7 + 40] for i in range(n_ocr)])
    ocr_texts = np.array([f"OCR {i}" for i in range(n_ocr)], dtype=object)
    ocr_sources = np.array([None] * n_ocr, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Call merge
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 25.2ms -> 18.9ms (33.3% faster)
    # Invalid texts should be replaced or aggregated
    for i, text in enumerate(result.texts):
        pass


def test_merge_boundary_coordinates_large_scale():
    """Test merge with elements at document boundaries at large scale."""
    # Create layout elements distributed across large coordinate space
    n_elements = 400
    out_coords = np.array(
        [
            (
                [0, 0, 100, 100]
                if i == 0  # Top-left corner
                else (
                    [10000, 10000, 10100, 10100]
                    if i == 1  # Bottom-right corner
                    else [i * 50, i * 50, i * 50 + 50, i * 50 + 50]
                )
            )  # Others
            for i in range(n_elements)
        ]
    )
    out_texts = np.array(["(cid:invalid)" for _ in range(n_elements)], dtype=object)
    out_sources = np.array([None] * n_elements, dtype=object)
    out_class_ids = np.array([0] * n_elements, dtype=float)

    out_layout = LayoutElements(
        element_coords=out_coords,
        texts=out_texts,
        sources=out_sources,
        element_class_ids=out_class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # Create OCR with matching coordinates
    ocr_coords = out_coords[:300]  # Use same coordinates for subset
    ocr_texts = np.array([f"OCR text {i}" for i in range(300)], dtype=object)
    ocr_sources = np.array([None] * 300, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 87.3ms -> 61.5ms (41.8% faster)
    # No invalid patterns should remain
    for text in result.texts:
        if text:
            pass


def test_merge_with_varied_subregion_thresholds_many_elements():
    """Test merge with different subregion thresholds on large dataset."""
    n_elements = 300
    out_coords = np.array([[i * 20, i * 20, i * 20 + 100, i * 20 + 100] for i in range(n_elements)])
    out_texts = np.array(["Invalid (cid:test)" for _ in range(n_elements)], dtype=object)
    out_sources = np.array([None] * n_elements, dtype=object)
    out_class_ids = np.array([0] * n_elements, dtype=float)

    out_layout = LayoutElements(
        element_coords=out_coords,
        texts=out_texts,
        sources=out_sources,
        element_class_ids=out_class_ids,
        element_class_id_map={0: ElementType.UNCATEGORIZED_TEXT},
    )

    # OCR layout with small regions inside each out_layout element
    ocr_coords = np.array(
        [[i * 20 + 10, i * 20 + 10, i * 20 + 40, i * 20 + 40] for i in range(n_elements)]
    )
    ocr_texts = np.array([f"Small OCR {i}" for i in range(n_elements)], dtype=object)
    ocr_sources = np.array([None] * n_elements, dtype=object)

    ocr_layout = TextRegions(
        element_coords=ocr_coords,
        texts=ocr_texts,
        sources=ocr_sources,
    )

    # Test with different thresholds
    for threshold in [0.1, 0.5, 0.9]:
        codeflash_output = merge_out_layout_with_ocr_layout(
            out_layout,
            ocr_layout,
            supplement_with_ocr_elements=False,
            subregion_threshold=threshold,
        )
        result = codeflash_output  # 78.6ms -> 57.6ms (36.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-merge_out_layout_with_ocr_layout-mkrn264u and push.

The optimized code achieves a **30% speedup** through two key algorithmic improvements in `aggregate_embedded_text_by_block` and `supplement_layout_with_ocr_elements`: ## Key Optimizations ### 1. **Replaced `.sum(axis=1).astype(bool)` with `.any(axis=1)`** This change appears in both functions when computing boolean masks from the result of `bboxes1_is_almost_subregion_of_bboxes2()`: **Why it's faster:** - `.sum(axis=1)` creates an intermediate integer array by counting True values across columns, then converts to boolean - `.any(axis=1)` short-circuits on the first True value per row, avoiding the full summation - Eliminates the explicit `.astype(bool)` conversion overhead **Performance impact:** Based on line profiler, the mask computation in `aggregate_embedded_text_by_block` dropped from ~234ms to ~222ms (5% faster), and the overall function improved from 551ms to 443ms (19.6% faster). ### 2. **Avoided redundant slicing operations** In `aggregate_embedded_text_by_block`, the optimized code stores `sliced = source_regions.slice(mask)` once and reuses it, instead of calling `source_regions.slice(mask)` three separate times: **Why it's faster:** - Each `slice()` operation creates a new object with coordinate and text array copies - Line profiler shows the original made 3 separate slice calls (48ms + 25ms + 34ms = 107ms total) - The optimized version makes 1 slice call (~28ms), saving ~79ms per invocation ### 3. **Early exit with `mask.any()`** The optimized code checks `if mask.any():` before processing, avoiding unnecessary work when no regions match: **Why it's faster:** - Skips text joining, bbox extraction, and IOU calculations when mask is empty - Particularly beneficial for the 368 cases (31% of calls) where no matching regions exist ## Impact Based on Test Results The optimization is particularly effective for workloads with: 1. **Many elements requiring text aggregation** (10-41% speedup on tests with 100-500 elements) - `test_large_scale_many_elements_aggregated`: 77ms → 67.2ms (14.6% faster) - `test_merge_large_number_of_elements`: 43.8ms → 31.0ms (41.3% faster) - `test_merge_boundary_coordinates_large_scale`: 87.3ms → 61.5ms (41.8% faster) 2. **Documents with invalid text patterns** (10-20% speedup) - `test_invalid_texts_are_replaced`: 250μs → 222μs (12.4% faster) - `test_merge_with_all_invalid_text`: 654μs → 554μs (18.1% faster) 3. **Complex spatial matching scenarios** (33-36% speedup) - `test_merge_with_overlapping_elements`: 25.2ms → 18.9ms (33.3% faster) - `test_merge_with_varied_subregion_thresholds`: 78.6ms → 57.6ms (36.4% faster) ## Context Impact The function `merge_out_layout_with_ocr_layout` is called from `supplement_page_layout_with_ocr` in OCR processing hot paths, specifically when `ocr_mode == OCRMode.FULL_PAGE`. Each page processed invokes this function once, making the 30% speedup directly translate to faster document processing throughput for PDF/image partitioning workflows.

…_layout-mkrn264u

codeflash-ai bot and others added 3 commits January 24, 2026 01:36

changelog version update

091726c

Merge branch 'main' into codeflash/optimize-merge_out_layout_with_ocr…

0fd9e0a

…_layout-mkrn264u

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 30%#4212

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 30%#4212
aseembits93 wants to merge 3 commits intoUnstructured-IO:mainfrom
codeflash-ai:codeflash/optimize-merge_out_layout_with_ocr_layout-mkrn264u

aseembits93 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aseembits93 commented Jan 27, 2026

📄 30% (0.30x) speedup for merge_out_layout_with_ocr_layout in unstructured/partition/pdf_image/ocr.py

📝 Explanation and details

Key Optimizations

1. Replaced .sum(axis=1).astype(bool) with .any(axis=1)

2. Avoided redundant slicing operations

3. Early exit with mask.any()

Impact Based on Test Results

Context Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

📄 30% (0.30x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

1. Replaced `.sum(axis=1).astype(bool)` with `.any(axis=1)`

3. Early exit with `mask.any()`