Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.17.2
Choose a base ref
...
head repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
  • 14 commits
  • 117 files changed
  • 13 contributors

Commits on Mar 21, 2025

  1. Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of …

    …standard file names (#3959)
    
    Instead of looking for presence of `word/document.xml` ,
    `ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and
    XLSX files, we look for prefix `word/document*.xml`,
    `ppt/presentation*.xml` and `xl/workbook*.xml` as certain files
    generated from office365 has files with different names.
    Fixes #3937
    
    ---------
    
    Co-authored-by: Yao You <[email protected]>
    srisudarsan and badGarnet authored Mar 21, 2025
    Configuration menu
    Copy the full SHA
    3497281 View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2025

  1. manual trigger of workflows to publish new image and new vers tag in … (

    #3965)
    
    …quay
    
    There were some open CVEs in the base-image. Those are resolved so
    triggering a workflow with updated version tag
    luke-kucing authored Mar 25, 2025
    Configuration menu
    Copy the full SHA
    347a4e5 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2025

  1. chore: deprecate stage_for_label_studio (#3968)

    This PR is to address [a
    CVE](GHSA-rgv9-w7jp-m23g) that appeared in
    a recent scan.
    
    The CVE has to do with the package `label_studio_sdk`. This relates to
    the tool Label Studio, a data labeling platform. We built a staging
    function that takes a list of elements and converts it to a format
    suitable for passing to the LabelStudio platform.
    
    We don't use the package with the vulnerability in the actual function,
    we only use it to test the output of the function against the Label
    Studio API schema.
    
    Even the test where we use it is sort of questionable in value, since
    it's really testing the schema against an old version of the LabelStudio
    API (we are testing against a recording of the Label Studio API's
    responses stored using `vcrpy`).
    
    Label Studio has fixed the vulnerability as of version 1.0.10 of their
    SDK, but we're stuck on 1.0.5 because 1.0.6 and above require
    `numpy<2.0.0`.
    
    This leaves us with several choices of resolution, some of which are:
    1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to
    resolve the CVE
    2. Drop `label_studio_sdk` by either removing or rewriting the test.
    3. Drop test and dev dependencies from the `unstructured` image.
    
    We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a
    follow-on PR.
    
    Here we add a deprecation notice to `stage_for_label_studio` and remove
    the offending test. Normally good practice would be to add a warning of
    future deprecation to the function for a reasonable amount of time, but
    in order to address the CVE immediately, we're deprecating it right
    away.
    
    ### Testing
    Install the dependencies (`make install`) into a fresh environment, and
    `pip list | grep label` should have no results. The scan artifact in CI
    should contain no "high" or "critical" CVEs.
    qued authored Mar 26, 2025
    Configuration menu
    Copy the full SHA
    3f07840 View commit details
    Browse the repository at this point in the history

Commits on Mar 27, 2025

  1. build: remove test and dev deps from docker image (#3969)

    Removed the dependencies contained in `test.txt`, `dev.txt`, and
    `constraints.txt` from the things that get installed in the docker
    image. In order to keep testing the image (running the tests), I added a
    step to the `docker-test` make target to install `test.txt` and
    `dev.txt`. Thus we presumably get a smaller image (probably not much
    smaller), reduce the dependency chain or our images, and have less
    exposure to vulnerabilities while still testing as robustly as before.
    
    Incidentally, I removed the `Dockerfile` for our ubuntu image, since it
    made reference to non-existent make targets, which tells me it's stale
    and wasn't being used.
    
    ### Review:
    - Reviewer should ensure the dev and test dependencies are not being
    installed in the docker image. One way to check is to check the logs in
    CI, and note, e.g. that
    [this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700)
    is the first reference to `pytest` in the docker build and test logs,
    after the image build is completed.
    - Reviewer should ensure docker image is still being tested in CI and is
    passing.
    qued authored Mar 27, 2025
    Configuration menu
    Copy the full SHA
    9a239fa View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2025

  1. feat: convenience unstructured-get-json.sh update (#3971)

    * script now supports:
       * the --vlm flag, to process the document with the VLM strategy
       * optionally takes --vlm-model, --vlm-provider args
    * optionally also writes .html outputs by converting unstructured .json
    output
       * optionally opens those .html outputs in a browser
       
    Tested with:
       ```
    unstructured-get-json.sh --write-html --open-html --fast
    layout-parser-paper-p2.pdf
    unstructured-get-json.sh --write-html --open-html --hi-res
    layout-parser-paper-p2.pdf
    unstructured-get-json.sh --write-html --open-html --ocr-only
    layout-parser-paper-p2.pdf
    unstructured-get-json.sh --write-html --open-html --vlm
    layout-parser-paper-p2.pdf
    unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider
    openai --vlm-model gpt-4o layout-parser-paper-p2.pdf
    unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider
    vertexai --vlm-model gemini-2.0-flash-001 layout-parser-paper-p2.pdf
    unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider
    anthropic --vlm-model claude-3-5-sonnet-20241022
    layout-parser-paper-p2.pdf
    
    ```
    
    [layout-parser-paper-p2.pdf](https://github.com/user-attachments/files/19514007/layout-parser-paper-p2.pdf)
    cragwolfe authored Mar 31, 2025
    Configuration menu
    Copy the full SHA
    19fc1fc View commit details
    Browse the repository at this point in the history

Commits on Apr 1, 2025

  1. Configuration menu
    Copy the full SHA
    c6b8ed4 View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2025

  1. chore: add html path to ingest-test-fixtures-update-pr (#3977)

    This should allow the `Ingest Test Fixtures Update PR` workflow to also
    update expected html outputs.
    
    E.g., before the change, the .html files would be left unmodified:
    
    ![image](https://github.com/user-attachments/assets/fa14c1a5-39bd-4e32-b4b9-9552eb312de1)
    
    
    https://github.com/Unstructured-IO/unstructured/actions/runs/14234877547/job/39892334672
    cragwolfe authored Apr 3, 2025
    Configuration menu
    Copy the full SHA
    8fc4181 View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2025

  1. Configuration menu
    Copy the full SHA
    dfa17bd View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2025

  1. Fix sort_page_element. ensures that sorting is stable and not random. (

    …#3978)
    
    The sort_page_element() use the element id to sort the elements.
    Two executions of the same code, on the same file, produce different
    results. The order of the elements is random.
    This makes it impossible to write stable unit tests, for example, or to
    obtain reproducible results.
    pprados authored Apr 7, 2025
    Configuration menu
    Copy the full SHA
    d570f46 View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2025

  1. Update pdfminer_utils.py (#3974)

    Fix for 'PSSyntaxError' import error:
    "cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'"
    
    Latest pdfminer-six doesn't import PSSyntaxError into
    `pdfminer.pdfparser` anymore. It must now be directly imported from its
    source (`pdfminer.psexceptions`)
    Nathan-GoSupply authored Apr 8, 2025
    Configuration menu
    Copy the full SHA
    27f503c View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2025

  1. fix cve (#3989)

    fix critical cve for h11. supposedly 0.16.0 fixes it.
    
    ---------
    
    Co-authored-by: Yao You <[email protected]>
    Co-authored-by: Austin Walker <[email protected]>
    Co-authored-by: ryannikolaidis <[email protected]>
    Co-authored-by: badGarnet <[email protected]>
    5 people authored Apr 29, 2025
    Configuration menu
    Copy the full SHA
    fd9d796 View commit details
    Browse the repository at this point in the history
  2. fix: Add missing diffstat command to test_json_to_html CI job (#3992)

    Removed some additional html fixtures. The original json fixtures from
    which html ones were generated, were removed some time ago.
    mpolomdeepsense authored Apr 29, 2025
    Configuration menu
    Copy the full SHA
    b585df1 View commit details
    Browse the repository at this point in the history
  3. fix: failing build (#3993)

    Successful build and test:
    https://github.com/Unstructured-IO/unstructured/actions/runs/14730300234/job/41342657532
    
    Failing test_json_to_html CI job fix here:
    #3992
    mpolomdeepsense authored Apr 29, 2025
    Configuration menu
    Copy the full SHA
    604c4a7 View commit details
    Browse the repository at this point in the history

Commits on May 5, 2025

  1. fix: properly handle the case when an element's text is None (#3995)

    Some elements, like `Image`, can have `None` as its `text` attribute's
    value. In that case current chunking logic fails because it expects the
    field to always have a length or can be split. The fix is to update the
    logic as `element.text or ""` for checking length and add flow control
    to early exit to avoid calling split on `None`.
    badGarnet authored May 5, 2025
    Configuration menu
    Copy the full SHA
    b814ece View commit details
    Browse the repository at this point in the history
Loading