Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@dmartinol
Copy link
Contributor

@dmartinol dmartinol commented Feb 25, 2025

Adding metadata to document chunks, following guidance from docling-haystack package.
Reference code here.
Note: we cannot integrate the package as-is because it depends on docling = "^2.9.0" while we are forced to docling>=2.4.2,<=2.8.3 from instructlab-sdg.

Metadata fields:
All the DocMeta fields apart from: schema_name, version and doc_items

Issue resolved by this Pull Request:
Closes #3192

Verifying the generated schema:
Sequence of commands to validate the schema of the default in-memory store:

ilab rag ingest --input-dir _YOUR_DOCS_DIR_
cat $(ilab config show -k rag.document_store.uri) | jq > embeddings.json
jq -r 'paths | map(if type=="number" then "[*]" else tostring end) | join(".")' embeddings.json | sort -u

Sample output (edited to show only the relevant fields):

documents.[*].content
documents.[*].embedding
documents.[*].embedding.[*]
documents.[*].id
documents.[*].meta
documents.[*].meta.headings
documents.[*].meta.headings.[*]
documents.[*].meta.origin
documents.[*].meta.origin.binary_hash
documents.[*].meta.origin.filename
documents.[*].meta.origin.mimetype
documents.[*].score
documents.[*].sparse_embedding

And a snippet of a chunk metafdata from the JSON document:

...
  "documents": [
    {
      "id": "794e8193117c68bb07ad5c58ef55f6111ecb7d8e449ec3f968769ebbfb4b5321",
      "content": "InstructLab 🐶 (ilab)\n❓ What is InstructLab Core\ngraph TD;\n download-->chat\n chat[Chat with the LLM]-->add\n add[Add new knowledge<br/>or skill to taxonomy]-->generate[generate new<br/>synthetic training data]\n generate-->train\n train[Re-train]-->|Chat with<br/>the re-trained LLM<br/>to see the results|chat\nFor an overview of the full workflow, see the workflow diagram.\n[!IMPORTANT]",
      "dataframe": null,
      "blob": null,
      "meta": {
        "headings": [
          "InstructLab 🐶 (ilab)",
          "❓ What is InstructLab Core"
        ],
        "origin": {
          "mimetype": "text/markdown",
          "binary_hash": 10151905962323472718,
          "filename": "README.md"
        }
      },
      "score": null,
      "embedding": [
        -0.04702334851026535,
    ...

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@dmartinol dmartinol marked this pull request as ready for review February 25, 2025 14:49
@jwm4 jwm4 requested a review from a team February 25, 2025 14:50
@courtneypacheco
Copy link
Contributor

Hey @dmartinol!

In reference to your comment about docling versioning: Is this PR waiting for instructlab/sdg#557 to get merged? That PR updates docling to docling>=2.9.0.

@dmartinol
Copy link
Contributor Author

Hey @dmartinol!

In reference to your comment about docling versioning: Is this PR waiting for instructlab/sdg#557 to get merged? That PR updates docling to docling>=2.9.0.

Thanks for pointing this out @courtneypacheco !
IMO, once the SDG PR is merged and the new version integrated in ilab, we'll need a separate issue to integrate the docling- haystack package and review the RAG implementation to adapt to the SDG changes.
See this comment, for instance.
I would track a separate issue to track this need and then we can prioritize it. WDYT?

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Feb 26, 2025
@mergify mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing dependencies Relates to dependencies labels Feb 26, 2025
@dmartinol dmartinol force-pushed the json-store branch 2 times, most recently from 1ca00f7 to db7ba31 Compare February 26, 2025 15:24
@mergify mergify bot added the ci-failure PR has at least one CI failure label Feb 26, 2025
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Feb 26, 2025
@dmartinol
Copy link
Contributor Author

@jwm4 updated the list of excluded fields as per your request

@mergify mergify bot added the ci-failure PR has at least one CI failure label Feb 26, 2025
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Feb 26, 2025
Copy link
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a substantial improvement. It would be nice to get this in as soon as possible.

@booxter booxter self-requested a review February 27, 2025 18:49
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Do we want a test for this? Or is that already covered?

@mergify mergify bot added the one-approval PR has one approval from a maintainer label Mar 6, 2025
@nathan-weinberg nathan-weinberg added the hold In-progress PR. Tag should be removed before merge. label Mar 6, 2025
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Mar 6, 2025
@booxter booxter removed their request for review March 6, 2025 16:18
Signed-off-by: Daniele Martinoli <[email protected]>
@dmartinol
Copy link
Contributor Author

Looks good to me. Do we want a test for this? Or is that already covered?

added UT, thanks

@nathan-weinberg nathan-weinberg removed the hold In-progress PR. Tag should be removed before merge. label Mar 7, 2025
@mergify mergify bot merged commit 66c7265 into instructlab:main Mar 7, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration dependencies Relates to dependencies documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include metadata in ingested document chunks

6 participants