Tags: leSullivan/unstructured-api
Tags
version `0.0.81`; bump to `unstructured==0.15.13` (Unstructured-IO#463) ### Summary Bumps to `unstructured==0.15.13` to apply security patches.
version 0.0.80; bump to unstructured 0.15.10 (Unstructured-IO#458) ### Summary Bumps to `unstructured==0.15.10`.
version 0.0.79; bump to unstructured 0.0.79 (Unstructured-IO#454) ### Summary Bumps to `unstructured==0.15.7`.
feat: enhance API filetype detection (Unstructured-IO#445) # Use the library for filetype detection The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that `Filetype None is not supported`. The library has a detect_filetype that actually inspects the file bytes, so let's reuse this. # Add a `content_type` param to override filetype detection Add an optional `content_type` param that allows the user to override the filetype detection. We'll use this value if it's set, or take the `file.content_type` which is based on the multipart `Content-Type` header. This provides an alternative when clients are unable to modify the header. # Testing The important thing is that `test_happy_path_all_types` passes in the docker smoke test - this contains all filetypes that we want the API to support. To test manually, you can try sending files to the server with and without the filename/content_type defined. Check out this branch and run `make run-web-app`. Example sending with no extension in filename. This correctly processes a pdf. ``` import requests filename = "sample-docs/layout-parser-paper-fast.pdf" url = "http://localhost:8000/general/v0/general" with open(filename, 'rb') as f: files = {'files': ("sample-doc", f)} response = requests.post(url, files=files) print(response.text) ``` For the new param, you can try modifying the content type for a text based file. Verify that you can change the `metadata.filetype` of the response using the new param: ``` curl --location 'http://localhost:8000/general/v0/general' \ --form 'files=@"sample-docs/family-day.eml"' \ --form 'content_type="text/plain"' [ { "type": "UncategorizedText", "element_id": "5cafe1ce2b0a96f8e3eba232e790db19", "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"", "metadata": { "filename": "family-day.eml", "languages": [ "eng" ], "filetype": "text/plain" } }, ... ] ```
build(deps): remove dependency constraint on `safetensors` (Unstructu… …red-IO#443) ### Summary Removes a constraint on `safetensors` from version `0.0.38` that was preventing us from resolving a low CVE in `transformers`.
build: bump to `0.0.74`; bump dependencies (Unstructured-IO#442) ### Summary Bumps dependencies and prepares files for the `0.0.74` release.
build(deps): bump to `unstructured==0.14.10` (Unstructured-IO#438) ### Summary Bumps to `unstructured==0.14.10`.
fix/Fix MS Office filetype errors and harden docker smoketest (Unstru… …ctured-IO#436) # Changes **Fix for docx and other office files returning `{"detail":"File type None is not supported."}`** After moving to the wolfi base image, the `mimetypes` lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a `filetypes.py` and moved `get_validated_mimetype` over. When this file is imported, we'll call `mimetypes.add_type` for all file extensions we support. **Update smoke test coverage** This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated `test_happy_path` to test against the container with and without passing `content_type`. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing. # Testing The new smoke test will verify that all filetypes are working. You can also `make docker-build && make docker-start-api`, and test out the docx in the sample docs dir. On `main`, this file will give you the error above. ``` curl 'http://localhost:8000/general/v0/general' \ --form 'files=@"fake.docx"' ```
build(deps): bump dependency versions (Unstructured-IO#434) ### Summary Bumps dependency versions for the API. Closes Unstructured-IO#432.
build(deps): version bumps for maintenance (Unstructured-IO#424) ### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. - bump `unstructured` to `0.14.6` - bump `unstructured-inference` to `0.7.35`
PreviousNext