Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented May 2, 2025

Intent: add ability to materialize from a list of doc_ids

doc_ids = ["doc1", "doc2", ...]
ctx.read.materialize(path={"root": "materializedir", "filter": DocIdFilter(doc_ids)})

Resulting changes:

  1. Group functions to translate between document, doc_id, materialize filename to what's compatible with what.
  2. Plumb it through materialize and mrr
  3. Make mrr work with a pre-existing materialize filter
  4. Make mrr keep metadata documents (unrelated to this particular thing but I think the correct behavior)
  5. A couple more hacks

@HenryL27 HenryL27 requested a review from dhruvkaliraman7 May 2, 2025 20:58
HenryL27 added 2 commits May 2, 2025 14:17
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Copy link
Contributor

@dhruvkaliraman7 dhruvkaliraman7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, nice work!

assert len(self.metadata["lineage_links"]["from_ids"]) > 0

self.data["doc_id"] = mkdocid()
if "doc_id" not in self.data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o/w metadata documents get a new docid every time they're materialized (as materialize call the constructor) which I'm pretty sure is the incorrect behavior


# Create a function that fails for specific documents
def failing_map(doc):
# logger.info(doc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

ctx = sycamore.init(exec_mode=self.exec_mode)

with tempfile.TemporaryDirectory() as tmpdir1, tempfile.TemporaryDirectory() as tmpdir2:
docs = make_docs(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of this test can we assert that we have 11 docs? The make_docs should give you 10 Documents and 1 MetadataDocument

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on how docs get batched through noop I can't guarantee that there will be 11 but I can assert that there are at least one

return d


logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

ret = []
count = 0
for fi in self._fshelper.list_files(self._root):
# logger.info(fi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27 HenryL27 merged commit b6ba67d into main May 5, 2025
11 of 15 checks passed
@HenryL27 HenryL27 deleted the hml-materialize-docids branch August 30, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants