feat: track text source #4112

qued · 2025-11-05T19:50:12Z

The purpose of this PR is to use the newly created is_extracted parameter in TextRegion (and the corresponding vector version is_extracted_array in TextRegions), flagging elements that were extracted directly from PDFs as such.

This also involved:

New tests
A version update to bring in the new unstructured-inference
An ingest fixtures update
An optimization from Codeflash that's not directly related

One important thing to review is that all avenues by which an element is extracted and ends up in the output of a partition are covered... fast, hi_res, etc.

… extracted

socket-security · 2025-11-05T19:52:00Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	unstructured-inference@1.0.5 ⏵ 1.1.1	⁺¹

View full report

This pull request includes updated ingest test fixtures. Please review and merge if appropriate.

codeflash-ai · 2025-11-05T21:03:13Z

⚡️ Codeflash found optimizations for this PR

📄 24% (0.24x) speedup for `_merge_extracted_into_inferred_when_almost_the_same` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 40.6 milliseconds → 32.6 milliseconds (best of 18 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _merge_extracted_into_inferred_when_almost_the_same by 24% in PR #4112 (feat/track-text-source) #4114

If you approve, it will be merged into this PR (branch feat/track-text-source).

…same` by 24% in PR #4112 (`feat/track-text-source`) (#4114) ## ⚡️ This pull request contains optimizations for PR #4112 If you approve this dependent PR, these changes will be merged into the original PR branch `feat/track-text-source`. >This PR will be automatically closed if the original PR is merged. ---- #### 📄 24% (0.24x) speedup for ***`_merge_extracted_into_inferred_when_almost_the_same` in `unstructured/partition/pdf_image/pdfminer_processing.py`*** ⏱️ Runtime : **`40.6 milliseconds`** **→** **`32.6 milliseconds`** (best of `18` runs) #### 📝 Explanation and details The optimized code achieves a **24% speedup** through two key optimizations: **1. Improved `_minimum_containing_coords` function:** - **What**: Replaced `np.vstack` with separate array creation followed by `np.column_stack` - **Why**: The original code created list comprehensions multiple times within `np.vstack`, causing redundant temporary arrays and inefficient memory access patterns. The optimized version pre-computes each coordinate array once, then combines them efficiently - **Impact**: Reduces function time from 1.88ms to 1.41ms (25% faster). Line profiler shows the costly list comprehensions in the original (lines with 27%, 14%, 13%, 12% of time) are replaced with more efficient array operations **2. Optimized comparison in `boxes_iou` function:** - **What**: Changed `(inter_area / denom) > threshold` to `inter_area > (threshold * denom)` - **Why**: Avoids expensive division operations by algebraically rearranging the inequality. Division is significantly slower than multiplication in NumPy, especially for large arrays - **Impact**: Reduces the final comparison from 19% to 5.8% of function time, while the intermediate denominator calculation takes 11.8% **3. Minor optimization in boolean mask creation:** - **What**: Replaced `boxes_almost_same.sum(axis=1).astype(bool)` with `np.any(boxes_almost_same, axis=1)` - **Why**: `np.any` short-circuits on the first True value and is semantically clearer, though the performance gain is minimal **Test case analysis shows the optimizations are particularly effective for:** - Large-scale scenarios (1000+ elements): 17-75% speedup depending on match patterns - Cases with no matches benefit most (74.6% faster) due to avoiding expensive division operations - All test cases show consistent 6-17% improvements, indicating robust optimization across different workloads The optimizations maintain identical functionality while reducing computational overhead through better NumPy usage patterns and mathematical rearrangement. ✅ **Correctness verification report:** | Test | Status | | --------------------------- | ----------------- | | ⏪ Replay Tests | 🔘 **None Found** | | ⚙️ Existing Unit Tests | 🔘 **None Found** | | 🔎 Concolic Coverage Tests | 🔘 **None Found** | | 🌀 Generated Regression Tests | ✅ **18 Passed** | |📊 Tests Coverage | 100.0% | <details> <summary>🌀 Generated Regression Tests and Runtime</summary> ```python import numpy as np # imports import pytest from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # --- Minimal class stubs and helpers to support the function under test --- class DummyLayoutElements: """ Minimal implementation of LayoutElements to support testing. - element_coords: np.ndarray of shape (N, 4) for bounding boxes. - texts: np.ndarray of shape (N,) for text strings. - is_extracted_array: np.ndarray of shape (N,) for boolean flags. """ def __init__(self, element_coords, texts=None, is_extracted_array=None): self.element_coords = np.array(element_coords, dtype=np.float32) self.texts = np.array(texts if texts is not None else [''] * len(element_coords), dtype=object) self.is_extracted_array = np.array(is_extracted_array if is_extracted_array is not None else [False] * len(element_coords), dtype=bool) def __len__(self): return len(self.element_coords) def slice(self, mask): # mask can be a boolean array or integer indices if isinstance(mask, (np.ndarray, list)): if isinstance(mask[0], bool): idx = np.where(mask)[0] else: idx = np.array(mask) else: idx = np.array([mask]) return DummyLayoutElements( self.element_coords[idx], self.texts[idx], self.is_extracted_array[idx] ) from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # --- Unit Tests --- # ----------- BASIC TEST CASES ----------- def test_no_inferred_elements_returns_false_mask(): # No inferred elements: all extracted should not be merged extracted = DummyLayoutElements([[0, 0, 1, 1], [1, 1, 2, 2]], texts=["a", "b"]) inferred = DummyLayoutElements([]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.9); mask = codeflash_output # 3.50μs -> 3.30μs (6.10% faster) def test_no_extracted_elements_returns_empty_mask(): # No extracted elements: should return empty mask extracted = DummyLayoutElements([]) inferred = DummyLayoutElements([[0, 0, 1, 1]]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.9); mask = codeflash_output # 2.30μs -> 2.31μs (0.475% slower) #------------------------------------------------ import numpy as np # imports import pytest from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # Minimal stubs for TextRegions and LayoutElements to enable testing class TextRegions: def __init__(self, coords, texts=None, is_extracted_array=None): self.x1 = coords[:, 0] self.y1 = coords[:, 1] self.x2 = coords[:, 2] self.y2 = coords[:, 3] self.texts = np.array(texts) if texts is not None else np.array([""] * len(coords)) self.is_extracted_array = np.array(is_extracted_array) if is_extracted_array is not None else np.zeros(len(coords), dtype=bool) self.element_coords = coords def __len__(self): return len(self.element_coords) def slice(self, mask): # mask can be bool array or indices if isinstance(mask, (np.ndarray, list)): if isinstance(mask, np.ndarray) and mask.dtype == bool: idx = np.where(mask)[0] else: idx = mask else: idx = [mask] coords = self.element_coords[idx] texts = self.texts[idx] is_extracted_array = self.is_extracted_array[idx] return TextRegions(coords, texts, is_extracted_array) class LayoutElements(TextRegions): pass from unstructured.partition.pdf_image.pdfminer_processing import \ _merge_extracted_into_inferred_when_almost_the_same # =========================== # Unit Tests # =========================== # ----------- BASIC TEST CASES ----------- def test_basic_exact_match(): # One extracted, one inferred, same box coords = np.array([[0, 0, 10, 10]]) extracted = LayoutElements(coords, texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(coords, texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 207μs -> 192μs (7.74% faster) def test_basic_no_match(): # Boxes do not overlap extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[20, 20, 30, 30]]), texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 163μs -> 151μs (7.85% faster) def test_basic_partial_overlap_below_threshold(): # Overlap, but below threshold extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[5, 5, 15, 15]]), texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 158μs -> 148μs (6.53% faster) def test_basic_partial_overlap_above_threshold(): # Overlap, above threshold extracted = LayoutElements(np.array([[0, 0, 10, 10]]), texts=["extracted"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[0, 0, 10, 10.1]]), texts=["inferred"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 191μs -> 176μs (8.22% faster) def test_basic_multiple_elements_some_match(): # Multiple extracted/inferred, some matches extracted = LayoutElements( np.array([[0, 0, 10, 10], [20, 20, 30, 30]]), texts=["extracted1", "extracted2"], is_extracted_array=[True, True] ) inferred = LayoutElements( np.array([[0, 0, 10, 10], [100, 100, 110, 110]]), texts=["inferred1", "inferred2"], is_extracted_array=[False, False] ) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 172μs -> 162μs (5.98% faster) # ----------- EDGE TEST CASES ----------- def test_edge_empty_extracted(): # No extracted elements extracted = LayoutElements(np.zeros((0, 4)), texts=[], is_extracted_array=[]) inferred = LayoutElements(np.array([[0,0,1,1]]), texts=["foo"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.08μs -> 2.06μs (0.969% faster) def test_edge_empty_inferred(): # No inferred elements extracted = LayoutElements(np.array([[0,0,1,1]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.zeros((0, 4)), texts=[], is_extracted_array=[]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.71μs -> 2.48μs (9.29% faster) def test_edge_all_elements_match(): # All extracted match inferred coords = np.array([[0,0,10,10], [20,20,30,30]]) extracted = LayoutElements(coords, texts=["A", "B"], is_extracted_array=[True, True]) inferred = LayoutElements(coords, texts=["X", "Y"], is_extracted_array=[False, False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 174μs -> 162μs (7.69% faster) def test_edge_threshold_zero(): # Threshold zero means all overlap counts extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[5,5,15,15]]), texts=["bar"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.0); mask = codeflash_output # 159μs -> 150μs (5.94% faster) def test_edge_threshold_one(): # Threshold one means only perfect overlap counts extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[0,0,10,10]]), texts=["bar"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 1.0); mask = codeflash_output # 155μs -> 145μs (7.01% faster) def test_edge_multiple_matches_first_match_wins(): # Extracted overlaps with multiple inferred, but only first match is updated extracted = LayoutElements(np.array([[0,0,10,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements( np.array([[0,0,10,10], [0,0,10,10]]), texts=["bar1", "bar2"], is_extracted_array=[False, False] ) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 168μs -> 156μs (7.25% faster) def test_edge_coords_are_updated_to_minimum_containing(): # Bounding boxes are updated to minimum containing box extracted = LayoutElements(np.array([[1,2,9,10]]), texts=["foo"], is_extracted_array=[True]) inferred = LayoutElements(np.array([[0,0,10,10]]), texts=["bar"], is_extracted_array=[False]) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 156μs -> 144μs (8.56% faster) # The new coords should be the minimum containing both expected = np.array([0,0,10,10]) # ----------- LARGE SCALE TEST CASES ----------- def test_large_scale_many_elements(): # 500 extracted, 500 inferred, all match N = 500 coords = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) extracted = LayoutElements(coords, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords.copy(), texts=[f"in{i}" for i in range(N)], is_extracted_array=[False]*N) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 2.90ms -> 2.79ms (3.78% faster) def test_large_scale_some_elements_match(): # 1000 extracted, 500 inferred, only first 500 match N = 1000 M = 500 coords_extracted = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) coords_inferred = coords_extracted[:M] extracted = LayoutElements(coords_extracted, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords_inferred.copy(), texts=[f"in{i}" for i in range(M)], is_extracted_array=[False]*M) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 6.49ms -> 5.56ms (16.6% faster) # First 500 should be merged, rest not expected_mask = np.zeros(N, dtype=bool) expected_mask[:M] = True def test_large_scale_no_elements_match(): # 1000 extracted, 500 inferred, none match N = 1000 M = 500 coords_extracted = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) coords_inferred = coords_extracted[:M] + 10000 # Far away extracted = LayoutElements(coords_extracted, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords_inferred, texts=[f"in{i}" for i in range(M)], is_extracted_array=[False]*M) codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 8.91ms -> 5.11ms (74.6% faster) def test_large_scale_performance(): # Test that the function runs efficiently for 1000 elements N = 1000 coords = np.stack([np.arange(N), np.arange(N), np.arange(N)+10, np.arange(N)+10], axis=1) extracted = LayoutElements(coords, texts=[f"ex{i}" for i in range(N)], is_extracted_array=[True]*N) inferred = LayoutElements(coords.copy(), texts=[f"in{i}" for i in range(N)], is_extracted_array=[False]*N) import time start = time.time() codeflash_output = _merge_extracted_into_inferred_when_almost_the_same(extracted, inferred, 0.99); mask = codeflash_output # 20.6ms -> 17.6ms (17.1% faster) elapsed = time.time() - start # codeflash_output is used to check that the output of the original code is the same as that of the optimized code. ``` </details> To edit these changes `git checkout codeflash/optimize-pr4112-2025-11-05T21.03.01` and push. [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai) ![Static Badge](https://img.shields.io/badge/🎯_Optimization_Quality-high-green)  --- > [!NOTE] > Speeds up layout merging by optimizing bounding-box aggregation, boolean mask creation, and IOU comparison to avoid divisions. > > - **Performance optimizations in `unstructured/partition/pdf_image/pdfminer_processing.py`**: > - `/_minimum_containing_coords`: > - Precomputes `x1/y1/x2/y2` arrays and uses `np.column_stack` to build output; removes extra transpose. > - `/_merge_extracted_into_inferred_when_almost_the_same`: > - Replaces `sum(...).astype(bool)` with `np.any(..., axis=1)` for match mask. > - `/boxes_iou`: > - Computes denominator once and replaces division `(x/y) > t` with `x > t*y` to avoid divisions. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 8a0335f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

codeflash-ai · 2025-11-06T20:55:25Z

This PR is now faster! 🚀 @qued accepted my optimizations from:

⚡️ Speed up function _merge_extracted_into_inferred_when_almost_the_same by 24% in PR #4112 (feat/track-text-source) #4114

codeflash-ai · 2025-11-06T20:55:28Z

This PR is now faster! 🚀 codeflash-ai[bot] accepted my code suggestion above.

qued · 2025-11-06T20:59:49Z

For the optimizations that codeflash found, I didn't really see any existing tests that I felt comfortable would validate the equivalence, so I ran these tests to make sure they were equivalent:

def create_random_bboxes(num_bboxes: int, average_width: int, average_height: int):
    bboxes = []
    for _ in range(num_bboxes):
        x1 = np.random.randint(0, 100)
        y1 = np.random.randint(0, 100)
        width = np.random.randint(average_width - 10, average_width + 10)
        height = np.random.randint(average_height - 10, average_height + 10)
        x2 = x1 + width
        y2 = y1 + height
        bboxes.append(TextRegion.from_coords(x1, y1, x2, y2))
    return TextRegions.from_list(bboxes)

def test_minimum_containing_coords():
    bboxes1 = TextRegions.from_list([TextRegion.from_coords(0, 0, 10, 10), TextRegion.from_coords(5, 5, 15, 15)])
    bboxes2 = TextRegions.from_list([TextRegion.from_coords(10, 10, 20, 20), TextRegion.from_coords(0, 0, 10, 10)])
    np.testing.assert_array_equal(_minimum_containing_coords(bboxes1, bboxes2), np.array([[0, 0, 20, 20], [0, 0, 15, 15]]))

def test_boxes_almost_same_methods_are_equivalent():
    bboxes1 = create_random_bboxes(20, 40, 40)
    bboxes2 = create_random_bboxes(20, 40, 40)
    boxes_almost_same = boxes_iou(
        bboxes1.element_coords,
        bboxes2.element_coords,
        threshold=0.75,
    )
    assert np.any(boxes_almost_same, axis=1).any()
    assert not np.all(boxes_almost_same, axis=1).all()

    np.testing.assert_array_equal(boxes_almost_same.sum(axis=1).astype(bool), np.any(boxes_almost_same, axis=1))

def test_iou_computations_are_equivalent():
    round_to = 15
    threshold_max = 0.98
    EPSILON_AREA = 0.01
    bboxes1 = create_random_bboxes(20, 40, 40)
    bboxes2 = create_random_bboxes(20, 40, 40)
    coords1 = get_coords_from_bboxes(bboxes1.element_coords, round_to=round_to)
    coords2 = get_coords_from_bboxes(bboxes2.element_coords, round_to=round_to)
    for threshold in np.linspace(0.01, threshold_max, 50):
        inter_area, boxa_area, boxb_area = areas_of_boxes_and_intersection_area(
            coords1, coords2, round_to=round_to
        )
        iou1 = (inter_area / np.maximum(EPSILON_AREA, boxa_area + boxb_area.T - inter_area)) > threshold
        denom = np.maximum(EPSILON_AREA, boxa_area + boxb_area.T - inter_area)
        # Instead of (x/y) > t, use x > t*y for memory & speed with same result
        iou2 = inter_area > (threshold * denom)
        np.testing.assert_array_equal(iou1, iou2)

I'm not sure it's really useful to add these to the test suite, since they are specific to this change.

codeflash-ai · 2025-11-06T21:16:00Z

⚡️ Codeflash found optimizations for this PR

📄 13% (0.13x) speedup for `boxes_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 49.9 milliseconds → 44.2 milliseconds (best of 69 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function boxes_iou by 13% in PR #4112 (feat/track-text-source) #4116

If you approve, it will be merged into this PR (branch feat/track-text-source).

Main Changes: 1. Removed Clarifai Dependency - Completely removed the clarifai dependency which is no longer used in the codebase - Removed clarifai from the unstructured-ingest extras list in requirements/ingest/ingest.txt:1 - Removed clarifai test script reference from test_unstructured_ingest/test-ingest-dest.sh:23 2. Updated Dependencies to Resolve CVEs - pypdf: Updated from 6.1.1 → 6.1.3 (fixes GHSA-vr63-x8vc-m265) - pip: Added explicit upgrade to >=25.3 in Dockerfile (fixes GHSA-4xh5-x5gv-qwph) - uv: Addressed GHSA-8qf3-x8v5-2pj8 and GHSA-pqhf-p39g-3x64 3. Dockerfile Security Enhancements (Dockerfile:17,28-29) - Added Alpine package upgrade for py3.12-pip - Added explicit pip upgrade step before installing Python dependencies 4. General Dependency Updates Ran pip-compile across all requirement files, resulting in updates to: - cryptography: 46.0.2 → 46.0.3 - psutil: 7.1.0 → 7.1.3 - rapidfuzz: 3.14.1 → 3.14.3 - regex: 2025.9.18 → 2025.11.3 - wrapt: 1.17.3 → 2.0.0 - Plus many other transitive dependencies across all extra requirement files 5. Version Bump - Updated version from 0.18.16 → 0.18.17 in unstructured/__version__.py:1 - Updated CHANGELOG.md with security fixes documentation Impact: This PR resolves 4 CVEs total without introducing breaking changes, making it a pure security maintenance release. --------- Co-authored-by: Claude <[email protected]>

qued added 9 commits November 4, 2025 10:05

Add test to check behavior of is_extracted metadata during normalization

89d6fed

test element merge behavior for extracted text metadata

130c867

support is_extracted metadata for elements

1a78d06

Add merge logic for is_extracted

abcc4f3

Add test that pdfminer processed file layouelements are recognized as…

7e159c4

… extracted

merge array elements while retaining extracted status

ae8f1a1

formatting

d7fc5a0

update deps

22fc9b3

format

2f59dc0

qued changed the title ~~Feat/track text source~~ feat: track text source Nov 5, 2025

qued and others added 2 commits November 5, 2025 13:54

Update changelog and version

9b96a95

feat: track text source <- Ingest test fixtures update (#4113)

bb5ff8b

This pull request includes updated ingest test fixtures. Please review and merge if appropriate.

codeflash-ai bot mentioned this pull request Nov 5, 2025

⚡️ Speed up function _merge_extracted_into_inferred_when_almost_the_same by 24% in PR #4112 (feat/track-text-source) #4114

Merged

reduce comment length for linting

ea47d20

codeflash-ai bot mentioned this pull request Nov 6, 2025

⚡️ Speed up function boxes_iou by 13% in PR #4112 (feat/track-text-source) #4116

Open

luke-kucing and others added 3 commits November 6, 2025 16:30

Merge branch 'main' into feat/track-text-source

413caf6

Merge branch 'main' into feat/track-text-source

b6383d3

qued marked this pull request as ready for review November 7, 2025 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: track text source #4112

feat: track text source #4112

qued commented Nov 5, 2025 •

edited

Loading

Uh oh!

socket-security bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

codeflash-ai bot commented Nov 5, 2025

⚡️ Speed up function `_merge_extracted_into_inferred_when_almost_the_same` by 24% in PR #4112 (`feat/track-text-source`) #4114

Uh oh!

codeflash-ai bot commented Nov 6, 2025

Uh oh!

codeflash-ai bot commented Nov 6, 2025

Uh oh!

qued commented Nov 6, 2025

Uh oh!

codeflash-ai bot commented Nov 6, 2025

⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: track text source #4112

Are you sure you want to change the base?

feat: track text source #4112

Conversation

qued commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codeflash-ai bot commented Nov 5, 2025

⚡️ Codeflash found optimizations for this PR

📄 24% (0.24x) speedup for _merge_extracted_into_inferred_when_almost_the_same in unstructured/partition/pdf_image/pdfminer_processing.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function _merge_extracted_into_inferred_when_almost_the_same by 24% in PR #4112 (feat/track-text-source) #4114

Uh oh!

codeflash-ai bot commented Nov 6, 2025

Uh oh!

codeflash-ai bot commented Nov 6, 2025

Uh oh!

qued commented Nov 6, 2025

Uh oh!

codeflash-ai bot commented Nov 6, 2025

⚡️ Codeflash found optimizations for this PR

📄 13% (0.13x) speedup for boxes_iou in unstructured/partition/pdf_image/pdfminer_processing.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function boxes_iou by 13% in PR #4112 (feat/track-text-source) #4116

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qued commented Nov 5, 2025 •

edited

Loading

socket-security bot commented Nov 5, 2025 •

edited

Loading

📄 24% (0.24x) speedup for `_merge_extracted_into_inferred_when_almost_the_same` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⚡️ Speed up function `_merge_extracted_into_inferred_when_almost_the_same` by 24% in PR #4112 (`feat/track-text-source`) #4114

📄 13% (0.13x) speedup for `boxes_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⚡️ Speed up function `boxes_iou` by 13% in PR #4112 (`feat/track-text-source`) #4116