fix: ignore NotFound error of the non-first list during iter dir #4891

stayrascal · 2025-08-01T13:06:42Z

Changes Made

Ignore the NotFound error during iter dir if got empty response after the first list operation.

Related Issues

As the above picture shows, globed the parent folder got a `Not found` error, but globed the sub folder succeed.

The Not Found error was thrown during the second list request with next continue token from the previous list response, but got empty response, and then throw Not Found error, and then pop up the error to downstream, cause the glob process failed.

The reason why got empty response of the second list request with a continue token from prev response is that the listV2 of S3-like object store is trying to solve the timeout problem of listing behavior comparing to listv1.
Assume we are trying to list 1000 keys among abundant objects, especially if we did much delete operations before listing which might lead to delete holes problem via delete tombstone that will impact the list performance.

The listV1 might be hanged and waiting for 1000 objects return in this situation and then the list http request will failed since timeout.
the listV2 will try to retrieve objects matched the request condition as much as possible within timeout, which means the response objects might less than 1000, the response contains the next continue token and is_truncated=true flag indicate the response is truncate and then client should continue to list the remaining objects, but listv2 doesn't ensure the existence of remaining objects because the previous scan operation of kv store is cut down, so the later list request might get empty response.

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

greptile-apps

Greptile Summary

This PR fixes a race condition in the S3-like object store directory iteration functionality within Daft's I/O layer. The change modifies the iter_dir method in src/daft-io/src/object_io.rs to handle NotFound errors that can occur during paginated listing operations.

The core issue stems from S3's ListObjectsV2 API behavior. When dealing with large object stores that have experienced many delete operations, the API may return continuation tokens even when no more objects remain to be listed. This happens because S3's ListObjectsV2 was designed to prevent timeouts by returning partial results within time limits, but the presence of "delete tombstones" (markers for deleted objects) can cause subsequent requests with continuation tokens to return empty results and throw NotFound errors.

The fix wraps the continuation token-based listing calls in error handling logic that specifically catches NotFound errors during pagination and treats them as an indication that there are no more objects to list, rather than propagating them as failures. This allows the directory iteration to complete successfully by breaking out of the pagination loop when encountering these race conditions.

This change is part of Daft's object storage abstraction layer, which provides unified access to various cloud storage systems. The modification ensures that glob operations and other directory-based file discovery operations remain robust when working with S3-compatible storage systems that exhibit this specific API behavior.

PR Description Notes:

Minor typo: "contine token" should be "continue token"
Missing space in "listv1" and "listv2" (should be "list v1" and "list v2")

Confidence score: 4/5

This is a targeted fix for a well-understood S3 API behavior issue with clear error handling logic.
The change appropriately handles the specific race condition without affecting normal operation paths.
The error handling logic is conservative and only ignores NotFound errors during continuation token requests, not initial requests.

_{1 file reviewed, 1 comment}

_{Edit Code Review Bot Settings | Greptile}

src/daft-io/src/object_io.rs

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

desmondcheongzx

Thanks for the fix! I have some thoughts here. I want to be careful that we're not just papering over some incorrect code we have at a lower level.

desmondcheongzx · 2025-08-12T18:24:46Z

src/daft-io/src/object_io.rs

+                    },
+                    Err(err) => {
+                        if matches!(err, super::Error::NotFound { .. }) {
+                            continuation_token = None;


I did a little digging to see where the NotFound error originates from. It seems we're throwing it ourselves in S3LikeSource's ls method.

What strikes me as a little odd is that we ignore the possibility that the request could have a continuation token when the error is thrown. There's no reason that we can't have a series of empty responses with continuation tokens.

Here's another proposal. Let's handle all continuation tokens to ensure proper pagination. After all pagination is handled, if we didn't hit a result, we return a NotFound error. So something like:

diff --git a/src/daft-io/src/object_io.rs b/src/daft-io/src/object_io.rs index 06ce45b7f..d879f246f 100644 --- a/src/daft-io/src/object_io.rs +++ b/src/daft-io/src/object_io.rs @@ -227,9 +227,11 @@ pub trait ObjectSource: Sync + Send { io_stats: Option<IOStatsRef>, ) -> super::Result<BoxStream<super::Result<FileMetadata>>> { let uri = uri.to_string(); + let mut found_any = false; let s = stream! { let lsr = self.ls(&uri, posix, None, page_size, io_stats.clone()).await?; for fm in lsr.files { + found_any = true; yield Ok(fm); } @@ -238,9 +240,17 @@ pub trait ObjectSource: Sync + Send { let lsr = self.ls(&uri, posix, continuation_token.as_deref(), page_size, io_stats.clone()).await?; continuation_token.clone_from(&lsr.continuation_token); for fm in lsr.files { + found_any = true; yield Ok(fm); } } + + if !found_any { + yield Err(super::Error::NotFound { + path: uri, + source: Box::new(std::io::Error::new(std::io::ErrorKind::NotFound, "Path not found")), + }); + } }; Ok(s.boxed()) } diff --git a/src/daft-io/src/s3_like.rs b/src/daft-io/src/s3_like.rs index 2b6aa15b7..376d38e7a 100644 --- a/src/daft-io/src/s3_like.rs +++ b/src/daft-io/src/s3_like.rs @@ -1276,7 +1269,7 @@ impl ObjectSource for S3LikeSource { is.mark_list_requests(1); } - if lsr.files.is_empty() && key.contains(S3_DELIMITER) { + if lsr.files.is_empty() && lsr.continuation_token.is_none() && key.ends_with(S3_DELIMITER) { let permit = self .connection_pool_sema .acquire() @@ -1301,11 +1294,6 @@ impl ObjectSource for S3LikeSource { } let target_path = format!("{scheme}://{bucket}/{key}"); lsr.files.retain(|f| f.filepath == target_path); - - if lsr.files.is_empty() { - // Isn't a file or a directory - return Err(Error::NotFound { path: path.into() }.into()); - } Ok(lsr) } else { Ok(lsr)

Would this resolve the issue you're seeing?

Thanks @desmondcheongzx for reviewing this PR.

I think your proposal is a workable solution, but it depends on what's the semantic/protocol of ls method of ObjectSource since we lack of clear documentation right now, especially for the the case that the path is not exist.

From the current implementation of each object store, the semantic/protocol of ls might be return NotFound error if the dir/file is not exist, so the caller to handle the type of Result.

If we change the semantic/protocol of ls method to return empty ListResult, all the object source implementation need to change the implementation logic and it's better to document the semantic/protocol of each method of ObjectSource

@desmondcheongzx may i check how do you think that whether we change the semantic of ls method to return empty ListResult instead of return NotFound error?

@stayrascal ideally I'd like to fix ls's behaviour. But I think you're right that because of the lack of documentation and the different object sources, this is a larger task. And I don't want this to keep blocking you from getting the fix you need (I do apologize for the delay here).

How about this - tbh I think the current change is safe enough to merge as is. We can create another ticket to solve the deeper issue, but let's unblock you for now.

cool，Thanks a lot, yeah we can improve the ls's behavior of all object source later and give a clear documentation.

desmondcheongzx

As mentioned in the discussion above, this fix looks safe. I do think there's a deeper issue to fix, which we can track here (#4982), but let's not keep blocking this fix.

src/daft-io/src/object_io.rs

codecov · 2025-08-15T22:19:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (105580b) to head (8362bd6).
⚠️ Report is 59 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4891      +/-   ##
==========================================
- Coverage   79.46%   76.71%   -2.75%     
==========================================
  Files         896      918      +22     
  Lines      125473   127428    +1955     
==========================================
- Hits        99702    97758    -1944     
- Misses      25771    29670    +3899

Files with missing lines	Coverage Δ
src/daft-io/src/object_io.rs	`76.06% <100.00%> (+3.25%)`	⬆️

... and 259 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

* chore: Add luban build script * docs: Add docs for custom data sources and sinks (Eventual-Inc#5115) * perf(flotilla): Use Worker Affinity with Pre-Shuffle Merge (Eventual-Inc#5112) * feat(embed_text): Support LM Studio as a provider (Eventual-Inc#5103) * fix: Add nulls in json reads if a line doesn't contain the field from the schema (Eventual-Inc#4993) * chore: Remove docs codeowners (Eventual-Inc#5111) * feat: Implement embed_image() (Eventual-Inc#5101) * docs: add dark mode support for Algolia DocSearch (Eventual-Inc#5109) * docs: add noindex tag to non-stable pages (Eventual-Inc#5105) * docs: Add text guide (Eventual-Inc#5102) * ci: Do not skip postmerge tests (Eventual-Inc#5096) * docs: Improve installation instructions (Eventual-Inc#5094) * docs: More fixes to the overview page in light mode (Eventual-Inc#5095) * chore: Clean up write_turbopuffer guide (Eventual-Inc#5093) * fix: Check if UDFs are Serializable (Eventual-Inc#5091) * docs: Document write_turbopuffer in the user guide (Eventual-Inc#5092) * feat!: revert daft.func behavior on literal arguments (Eventual-Inc#5087) * perf: Split UDFs from Filters (Eventual-Inc#5070) * fix: nightly property test (Eventual-Inc#5076) * fix: Handle Unserializable Errors in Process UDFs (Eventual-Inc#5075) * revert!: "revert: Temporarily revert "Remove deprecated APIs for 0.6" (Eventual-Inc#5084) * perf(embed_text): Let Sentence Transformers select the best available device (Eventual-Inc#5082) * feat: Automatically grab embedding dimensions for sentence transformers (Eventual-Inc#5078) * feat: add mcap datasource reader (Eventual-Inc#4727) * fix: Implement Multi-Column Aggregations with List-like columns (Eventual-Inc#5017) * fix: Fix venv command for windows build (Eventual-Inc#5073) * fix: add setuptools_scm to build wheel requirements (Eventual-Inc#5072) * fix: Use cachebusting and range request fallback for HTTP requests to Hugging Face CDNs (Eventual-Inc#5061) * fix: Use async for starting and calling udf actors in flotilla (Eventual-Inc#5000) * fix: Always refresh tqdm when updating total (Eventual-Inc#5033) * feat: Add uv.lock to git (Eventual-Inc#5065) * revert: Temporarily revert "Remove deprecated APIs for 0.6" (Eventual-Inc#5068) * docs: Make overview page legible for light mode (Eventual-Inc#5067) * docs: Move custom python code higher up in docs (Eventual-Inc#5064) * fix: Fix docs build (Eventual-Inc#5066) * docs: Add better description in overview page (Eventual-Inc#5063) * docs: remove core_concepts.md and broken anchor link references (Eventual-Inc#5062) * ci: Don't run pr test suite on non-code changes fr (Eventual-Inc#5057) * fix: require uv as prerequisite for development setup (Eventual-Inc#5059) * feat: Add Hash Function Support for Decimal128, Time, Timestamp, Timestamptz Datatypes (Eventual-Inc#5026) * docs: fix formatting (Eventual-Inc#4994) * chore: Remove deprecated APIs for 0.6 (Eventual-Inc#5050) * fix: Add missing source command in Makefile install-docs-deps target (Eventual-Inc#5060) * feat: pushdown for lance scan (Eventual-Inc#4710) * feat: add lance merge_column task (Eventual-Inc#5008) * fix: Mermaid syntax error when enable explain analyze for Native Runner (Eventual-Inc#5052) * fix: clean notebook output before running tests & tweak doc proc notebook (Eventual-Inc#5055) * docs: remove runllm widget (Eventual-Inc#5056) * chore: disable hugging face library progress bars (Eventual-Inc#5040) * fix: correct Modin query optimizer value in comparison tables (Eventual-Inc#4983) * feat: Make the max parallel of scan tasks configurable for Native Runner (Eventual-Inc#5018) * chore: relax assertion in flaky sharding distribution test (Eventual-Inc#5053) * chore(dev): use pyproject.toml to manage the dev dependencies (Eventual-Inc#4849) * chore: random the counter during creating DistributedActorPoolProject… (Eventual-Inc#5039) * fix: skip credentialed tests if not from main (Eventual-Inc#5048) * feat: basic generator udf (Eventual-Inc#5036) * docs: add reo script to docs (Eventual-Inc#5049) * fix: subprocess UDF inherits current process env (Eventual-Inc#5047) * fix: sql/spark read_iceberg and read_deltalake (Eventual-Inc#5035) * refactor!: use struct datatype as daft representation of tuples (Eventual-Inc#5030) * feat: implements an openai provider with embed_text (Eventual-Inc#4997) * fix(blc): Disabled pipefail (Eventual-Inc#5031) * docs: fix broken UDF link due to core_concepts.md redirect (Eventual-Inc#5022) * docs: fix typo "Github" --> "GitHub" (Eventual-Inc#5025) * feat: daft.File object store support (Eventual-Inc#5002) * docs: fix `df.limit` link in quickstart.md (Eventual-Inc#5013) * docs: Add audio transcription example card (Eventual-Inc#5020) * feat: Propagate morsel size top-down in swordfish (Eventual-Inc#4894) * fix(blc): Attempt to fix the broken link checker. (Eventual-Inc#5010) * docs: improve audio transcription example (Eventual-Inc#4990) * docs: Spice up the examples page (Eventual-Inc#5019) * feat: DataFrame.write_huggingface (Eventual-Inc#5015) * chore: Only test MacOS on Python 3.11 in CI (Eventual-Inc#5014) * fix: Print UDF stdout and Daft logs above the progress bar (Eventual-Inc#4861) * feat: support count(1) in dataframe and choose the cheap column (Eventual-Inc#4977) * fix: Can translate sort in flotilla (Eventual-Inc#5005) * feat: add clickhouse data sink (Eventual-Inc#4850) * fix: Lazily import pil in infer dtype (Eventual-Inc#5004) * feat: implement distributed sort in flotilla engine (Eventual-Inc#4991) * chore: Snapshot Testing Optimizations (Eventual-Inc#4995) * feat!: RowWiseUdf.eval for eager evaluation (Eventual-Inc#4998) * feat: basic read_huggingface functionality (Eventual-Inc#4996) * feat: support using max() and min() on list of boolean values (Eventual-Inc#4989) * fix: Lazily import pyarrow when importing daft (Eventual-Inc#4999) * fix: lance schema does not work (Eventual-Inc#4940) * ci: Don't run pr test suite on non-code changes (Eventual-Inc#4992) * refactor: make DaftExtension class definition static (Eventual-Inc#4968) * feat: Flotilla pre-shuffle merge (Eventual-Inc#4873) * docs: fix grammar in CONTRIBUTING.md setup instructions (Eventual-Inc#4986) * fix: correct possessive apostrophe typo in README (Eventual-Inc#4984) * fix: correct GitHub capitalization and add missing period in README (Eventual-Inc#4985) * ci: No progress bar in CI (Eventual-Inc#4988) * feat: Flotilla into partitions (Eventual-Inc#4963) * fix: ignore NotFound error of the non-first list during iter dir (Eventual-Inc#4891) * feat(optimizer): Add Lance count() pushdown optimization (Eventual-Inc#4969) * feat: adds video frame streaming source (Eventual-Inc#4979) * feat: Add offset support to Spark Connect (Eventual-Inc#4962) * feat: new `daft.File` datatype (Eventual-Inc#4959) * fix: S3 multipart upload redirect to correct region (Eventual-Inc#4865) * feat: unify all Daft type to Python type conversions (Eventual-Inc#4972) See merge request: !1

fix: ignore NotFound error of the non-first list during iter dir

b2544f4

github-actions bot added the fix label Aug 1, 2025

greptile-apps bot reviewed Aug 1, 2025

View reviewed changes

src/daft-io/src/object_io.rs Outdated Show resolved Hide resolved

stayrascal and others added 2 commits August 1, 2025 21:16

Apply suggestions from code review

c6e3f4e

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Merge branch 'main' into iter-dir

0b29c68

desmondcheongzx self-requested a review August 4, 2025 16:31

desmondcheongzx reviewed Aug 12, 2025

View reviewed changes

desmondcheongzx approved these changes Aug 15, 2025

View reviewed changes

desmondcheongzx reviewed Aug 15, 2025

View reviewed changes

src/daft-io/src/object_io.rs Show resolved Hide resolved

Update src/daft-io/src/object_io.rs

8362bd6

desmondcheongzx enabled auto-merge (squash) August 15, 2025 06:52

desmondcheongzx merged commit fe0d4bf into Eventual-Inc:main Aug 15, 2025
94 of 96 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ignore NotFound error of the non-first list during iter dir #4891

fix: ignore NotFound error of the non-first list during iter dir #4891

Uh oh!

stayrascal commented Aug 1, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

desmondcheongzx left a comment

Uh oh!

desmondcheongzx Aug 12, 2025

Uh oh!

stayrascal Aug 13, 2025

Uh oh!

stayrascal Aug 15, 2025

Uh oh!

desmondcheongzx Aug 15, 2025

Uh oh!

stayrascal Aug 18, 2025

Uh oh!

desmondcheongzx left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: ignore NotFound error of the non-first list during iter dir #4891

fix: ignore NotFound error of the non-first list during iter dir #4891

Uh oh!

Conversation

stayrascal commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

desmondcheongzx left a comment

Choose a reason for hiding this comment

Uh oh!

desmondcheongzx Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

stayrascal Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

stayrascal Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

desmondcheongzx Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

stayrascal Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

desmondcheongzx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 15, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stayrascal commented Aug 1, 2025 •

edited

Loading