Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dependabot
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Jul 22, 2024

Bumps unstructured from 0.14.10 to 0.15.0.

Release notes

Sourced from unstructured's releases.

0.15.0

Enhancements

  • Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
  • Bump unstructured.paddleocr to 2.8.0.1.
  • Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
  • Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
  • CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

  • Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
  • Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

  • Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
  • Move Astra embedded_dimension to write config
Changelog

Sourced from unstructured's changelog.

0.15.0

Enhancements

  • Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
  • Bump unstructured.paddleocr to 2.8.0.1.
  • Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
  • Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
  • CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

  • Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
  • Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

  • Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
  • Move Astra embedded_dimension to write config
Commits
  • ec59abf enhancement: improve text clearing process in email partitioning (#3422)
  • 1df7908 feat: save file id for all fsspec connectors if present (#3405)
  • 0eb461a refactor: restructure PDF/Image example document organization (#3410)
  • 5d38703 bugfix: google drive connector metadata safegaurds (#3407)
  • e99e5a8 rfctr(file): make FileType enum a file-type descriptor (#3411)
  • 35ee6bf bugfix: conform all connectors to be added to registry (#3408)
  • a5c9a36 rfctr(file): improve file-type auto-detect (#3409)
  • 48bdf94 feat: partition_pdf() support language specification for PaddleOCR (#3400)
  • 6b1d5f2 rfctr: move astra arg (#3383)
  • 56ca39c rfctr(file): improve filetype tests (#3402)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot will merge this PR once it's up-to-date and CI passes on it, as requested by @DonnieBLT.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Jul 22, 2024
github-actions[bot]
github-actions bot previously approved these changes Jul 22, 2024
DonnieBLT
DonnieBLT previously approved these changes Jul 22, 2024
Copy link
Collaborator

@DonnieBLT DonnieBLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dependabot merge

@dependabot dependabot bot dismissed stale reviews from DonnieBLT and github-actions[bot] via 538bb9d July 22, 2024 01:00
@dependabot dependabot bot force-pushed the dependabot/pip/unstructured-0.15.0 branch from 1ccc23b to 538bb9d Compare July 22, 2024 01:00
github-actions[bot]
github-actions bot previously approved these changes Jul 22, 2024
DonnieBLT
DonnieBLT previously approved these changes Jul 22, 2024
Copy link
Collaborator

@DonnieBLT DonnieBLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dependabot merge

Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.14.10 to 0.15.0.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.14.10...0.15.0)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot dismissed stale reviews from DonnieBLT and github-actions[bot] via 0b1f982 July 22, 2024 01:10
@dependabot dependabot bot force-pushed the dependabot/pip/unstructured-0.15.0 branch from 538bb9d to 0b1f982 Compare July 22, 2024 01:10
Copy link
Collaborator

@DonnieBLT DonnieBLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dependabot merge

@dependabot dependabot bot merged commit c3f6403 into main Jul 22, 2024
@dependabot dependabot bot deleted the dependabot/pip/unstructured-0.15.0 branch July 22, 2024 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants