-
Notifications
You must be signed in to change notification settings - Fork 922
Comparing changes
Open a pull request
base repository: Unstructured-IO/unstructured
base: 0.17.2
head repository: Unstructured-IO/unstructured
compare: main
- 14 commits
- 117 files changed
- 13 contributors
Commits on Mar 21, 2025
-
Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of …
…standard file names (#3959) Instead of looking for presence of `word/document.xml` , `ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and XLSX files, we look for prefix `word/document*.xml`, `ppt/presentation*.xml` and `xl/workbook*.xml` as certain files generated from office365 has files with different names. Fixes #3937 --------- Co-authored-by: Yao You <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3497281 - Browse repository at this point
Copy the full SHA 3497281View commit details
Commits on Mar 25, 2025
-
manual trigger of workflows to publish new image and new vers tag in … (
#3965) …quay There were some open CVEs in the base-image. Those are resolved so triggering a workflow with updated version tag
Configuration menu - View commit details
-
Copy full SHA for 347a4e5 - Browse repository at this point
Copy the full SHA 347a4e5View commit details
Commits on Mar 26, 2025
-
chore: deprecate stage_for_label_studio (#3968)
This PR is to address [a CVE](GHSA-rgv9-w7jp-m23g) that appeared in a recent scan. The CVE has to do with the package `label_studio_sdk`. This relates to the tool Label Studio, a data labeling platform. We built a staging function that takes a list of elements and converts it to a format suitable for passing to the LabelStudio platform. We don't use the package with the vulnerability in the actual function, we only use it to test the output of the function against the Label Studio API schema. Even the test where we use it is sort of questionable in value, since it's really testing the schema against an old version of the LabelStudio API (we are testing against a recording of the Label Studio API's responses stored using `vcrpy`). Label Studio has fixed the vulnerability as of version 1.0.10 of their SDK, but we're stuck on 1.0.5 because 1.0.6 and above require `numpy<2.0.0`. This leaves us with several choices of resolution, some of which are: 1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to resolve the CVE 2. Drop `label_studio_sdk` by either removing or rewriting the test. 3. Drop test and dev dependencies from the `unstructured` image. We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a follow-on PR. Here we add a deprecation notice to `stage_for_label_studio` and remove the offending test. Normally good practice would be to add a warning of future deprecation to the function for a reasonable amount of time, but in order to address the CVE immediately, we're deprecating it right away. ### Testing Install the dependencies (`make install`) into a fresh environment, and `pip list | grep label` should have no results. The scan artifact in CI should contain no "high" or "critical" CVEs.
Configuration menu - View commit details
-
Copy full SHA for 3f07840 - Browse repository at this point
Copy the full SHA 3f07840View commit details
Commits on Mar 27, 2025
-
build: remove test and dev deps from docker image (#3969)
Removed the dependencies contained in `test.txt`, `dev.txt`, and `constraints.txt` from the things that get installed in the docker image. In order to keep testing the image (running the tests), I added a step to the `docker-test` make target to install `test.txt` and `dev.txt`. Thus we presumably get a smaller image (probably not much smaller), reduce the dependency chain or our images, and have less exposure to vulnerabilities while still testing as robustly as before. Incidentally, I removed the `Dockerfile` for our ubuntu image, since it made reference to non-existent make targets, which tells me it's stale and wasn't being used. ### Review: - Reviewer should ensure the dev and test dependencies are not being installed in the docker image. One way to check is to check the logs in CI, and note, e.g. that [this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700) is the first reference to `pytest` in the docker build and test logs, after the image build is completed. - Reviewer should ensure docker image is still being tested in CI and is passing.
Configuration menu - View commit details
-
Copy full SHA for 9a239fa - Browse repository at this point
Copy the full SHA 9a239faView commit details
Commits on Mar 31, 2025
-
feat: convenience unstructured-get-json.sh update (#3971)
* script now supports: * the --vlm flag, to process the document with the VLM strategy * optionally takes --vlm-model, --vlm-provider args * optionally also writes .html outputs by converting unstructured .json output * optionally opens those .html outputs in a browser Tested with: ``` unstructured-get-json.sh --write-html --open-html --fast layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --hi-res layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --ocr-only layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider openai --vlm-model gpt-4o layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider vertexai --vlm-model gemini-2.0-flash-001 layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider anthropic --vlm-model claude-3-5-sonnet-20241022 layout-parser-paper-p2.pdf ``` [layout-parser-paper-p2.pdf](https://github.com/user-attachments/files/19514007/layout-parser-paper-p2.pdf)
Configuration menu - View commit details
-
Copy full SHA for 19fc1fc - Browse repository at this point
Copy the full SHA 19fc1fcView commit details
Commits on Apr 1, 2025
-
Configuration menu - View commit details
-
Copy full SHA for c6b8ed4 - Browse repository at this point
Copy the full SHA c6b8ed4View commit details
Commits on Apr 3, 2025
-
chore: add html path to ingest-test-fixtures-update-pr (#3977)
This should allow the `Ingest Test Fixtures Update PR` workflow to also update expected html outputs. E.g., before the change, the .html files would be left unmodified:  https://github.com/Unstructured-IO/unstructured/actions/runs/14234877547/job/39892334672
Configuration menu - View commit details
-
Copy full SHA for 8fc4181 - Browse repository at this point
Copy the full SHA 8fc4181View commit details
Commits on Apr 4, 2025
-
Configuration menu - View commit details
-
Copy full SHA for dfa17bd - Browse repository at this point
Copy the full SHA dfa17bdView commit details
Commits on Apr 7, 2025
-
Fix sort_page_element. ensures that sorting is stable and not random. (…
…#3978) The sort_page_element() use the element id to sort the elements. Two executions of the same code, on the same file, produce different results. The order of the elements is random. This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.
Configuration menu - View commit details
-
Copy full SHA for d570f46 - Browse repository at this point
Copy the full SHA d570f46View commit details
Commits on Apr 8, 2025
-
Update pdfminer_utils.py (#3974)
Fix for 'PSSyntaxError' import error: "cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'" Latest pdfminer-six doesn't import PSSyntaxError into `pdfminer.pdfparser` anymore. It must now be directly imported from its source (`pdfminer.psexceptions`)
Configuration menu - View commit details
-
Copy full SHA for 27f503c - Browse repository at this point
Copy the full SHA 27f503cView commit details
Commits on Apr 29, 2025
-
fix critical cve for h11. supposedly 0.16.0 fixes it. --------- Co-authored-by: Yao You <[email protected]> Co-authored-by: Austin Walker <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: badGarnet <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for fd9d796 - Browse repository at this point
Copy the full SHA fd9d796View commit details -
fix: Add missing diffstat command to test_json_to_html CI job (#3992)
Removed some additional html fixtures. The original json fixtures from which html ones were generated, were removed some time ago.
Configuration menu - View commit details
-
Copy full SHA for b585df1 - Browse repository at this point
Copy the full SHA b585df1View commit details -
Successful build and test: https://github.com/Unstructured-IO/unstructured/actions/runs/14730300234/job/41342657532 Failing test_json_to_html CI job fix here: #3992
Configuration menu - View commit details
-
Copy full SHA for 604c4a7 - Browse repository at this point
Copy the full SHA 604c4a7View commit details
Commits on May 5, 2025
-
fix: properly handle the case when an element's text is None (#3995)
Some elements, like `Image`, can have `None` as its `text` attribute's value. In that case current chunking logic fails because it expects the field to always have a length or can be split. The fix is to update the logic as `element.text or ""` for checking length and add flow control to early exit to avoid calling split on `None`.
Configuration menu - View commit details
-
Copy full SHA for b814ece - Browse repository at this point
Copy the full SHA b814eceView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff 0.17.2...main