Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@eshwarprasadS
Copy link
Contributor

This PR intends to bump the version of docling from docling>=2.4.2,<=2.8.3 to docling>=2.18.0. This is to bring in the fix for the particular docling chunking failure issue on markdowns with unescaped special characters (docling-project/docling#823)

The primary changes are:

  • updates to requirements.txt
  • updates to CI environment handling in tox.ini and chunkers.py
  • Removing legacy patterns using bare docling.parse in taxonomy.py, since the pdf parsed doc content is not necessary to be passed to DocumentChunker

@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure dependencies Pull requests that update a dependency file labels Mar 18, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 18, 2025
@mergify mergify bot removed the ci-failure label Mar 18, 2025
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify mergify bot added the one-approval label Mar 19, 2025
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cloned this and ran it locally with some markdowns that were erroring out previously due to a Docling bug around unescaped headings. With this updated Docling version, those markdowns are now chunking properly.

Also, I looked at the change to the chunking test and it looks reasonable.

The only thing I'd ask, which could be done as a follow-up PR, is that we add a simplified example of markdown that failed with our previous docling version and that will pass with this new docling version. That's just to prevent regression here, but we don't have to hold up merging this PR itself for that unless that's quick and easy.

@mergify mergify bot removed the one-approval label Mar 19, 2025
Copy link
Member

@khaledsulayman khaledsulayman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just a few nits but won't block on these. Thanks!

Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 20, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 21, 2025
Signed-off-by: Khaled Sulayman <[email protected]>
@ktdreyer
Copy link
Contributor

@khaledsulayman noticed e2e fails here. Thanks @courtneypacheco for looking into this.

Since @eshwarprasadS created this PR from his fork, GitHub will not take the changes to .github/workflows/e2e-nvidia-t4-x1.yml into account.

To get the e2e tests to run with proper credentials on this PR's changes:

  1. Create a new work-in-progress branch within this repo (not a fork). The wip branch should be based on main. You can name it docling-version-bump.
  1. Merge this PR's contents to docling-version-bump.
  2. Open a new PR from docling-version-bump to main.

Then you should be able to run the CI changes in this PR before merging to main.

@bbrowning
Copy link
Contributor

We should be able to remove the constraints.txt and workflow changes here, as there was a bug in Python SetupTools 77.0.3 with DeepSpeed that is now resolved with a newer Python SetupTools that our most recent CI builds are picking up. See deepspeedai/DeepSpeed#7165 for other reports of this, but I've since seen our CI pass as the jobs are now picking up SetupTools 78.x.

@mergify
Copy link
Contributor

mergify bot commented Mar 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @eshwarprasadS please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added needs-rebase and removed ci-failure labels Mar 26, 2025
Signed-off-by: Eshwar Prasad Sivaramakrishnan <[email protected]>
@mergify mergify bot removed the needs-rebase label Mar 26, 2025
@eshwarprasadS eshwarprasadS merged commit 2cc9889 into instructlab:main Mar 26, 2025
28 checks passed
@bbrowning
Copy link
Contributor

@Mergifyio backport release-v0.7

@mergify
Copy link
Contributor

mergify bot commented Mar 31, 2025

backport release-v0.7

✅ Backports have been created

Details

bbrowning added a commit that referenced this pull request Mar 31, 2025
Update Docling version and improve OCR options handling with new docling ver. (backport #574)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants