csi: skip not-ready cephfs subvolumes in ls and summarize errors #399

OdedViner · 2025-10-19T13:08:27Z

avoid aborting when ceph reports eagain for subvolume info
list healthy entries; skip not-ready ones and keep going
collect skipped items and print a brief summary warning
tidy error text in getSubvolumeState to reduce extra noise

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

BlaineEXE · 2025-10-21T18:33:13Z

pkg/filesystem/subvolume.go

-						logging.Warning("Found pending clone: %q", sv.Name)
-						logging.Warning("Please delete the pending pv if any before deleting the subvolume %s", sv.Name)
-						logging.Warning("To avoid stale resources, please scale down the cephfs deployment before deleting the subvolume.")
+					if isSubvolumeNotReady(err) || errors.Is(err, syscall.EAGAIN) {


Is there a reason not to move the errors.Is(err, syscall.EAGAIN) into the isSubvolumeNotReady() function? It seems like it still makes sense as part of the not-ready detection.

BlaineEXE · 2025-10-21T18:57:48Z

pkg/filesystem/subvolume.go

+	if strings.Contains(msg, "Error EAGAIN") {
+		return true
+	}
+	if strings.Contains(msg, "is not ready for operation") {
+		return true
+	}
+	if strings.Contains(msg, "exit code 11") {
+		// Ceph often maps EAGAIN to exit code 11 for subvolume 'info'
+		return true
+	}


I have a couple thoughts related here:

As a general statement, error handling is fragile when it relies on specific error message text, and we should avoid it where it's not required

Ceph doesn't have very good error code returns, and so there are times when Rook does need to do message text checking

Overall, what I would prefer to see is us checking return code ints or error types as much as possible. When we can't, we should try string matching for specific err code names that are not likely to change across Ceph versions or with different localized Ceph builds in a different language.

If we include the outer errors.Is(err, syscall.EAGAIN), we actually have the start of a good pattern

if errors.Is(err, syscall.EAGAIN) { return true } if strings.Contains(msg, "EAGAIN") { // I removed "Error" since "EAGAIN" is the critical string return true }

In this pattern, we do our best to identify the error using the error type using errors.Is(). If that somehow fails, we can fall back to looking for the EAGAIN string.

Also of note, EAGAIN is code 11 -- sometimes output as -11 by Ceph.

In order to make Rook as robust as possible, we should try to do as much non-string checking as possible before resorting to string checking.

Things we can do better here:

Leave string equality checking for last always

Try to figure out what error type(s) is actually returned. This isn't always straightforward. When repro'ing an issue, I print the type of the returned error using %T. That will give one data point. I also inspect the called code (even dependencies) and see what errors they build. I tried that here but didn't see anything very helpful. With all that, it's possible to have a reasonable idea of what different error types might be getting returned. If any of them codify the integer form of the return code, we should try to get it. For example.

At the end, the rough flow might look something like:

If error Is(err, syscall.EAGAIN)

If error.(type).Code == 11 || error.(type).Code == -11 (if code extraction is possible)

If error.Error() string contains "EAGAIN"

If error.Error() string contains exit code 11

1 and 2 are the most ideal checks we can do. 3 is a decent fallback. 4 isn't great, but I am guessing that exit code %d is a wrapped message from inside a dependency somewhere that isn't likely to change over time.

I would suggest removing the string check for is not ready for operation unless we really, really need that fallback. That text seems somewhat likely to change over enough time.

I manually tested exitCodeFromError function.

To force a non-existent subcommand and verify exit-code extraction, I changed the args:

// Before args := []string{"fs", "subvolume", "info", fsName, SubVol, SubvolumeGroup, "--format", "json"} // After (intentional typo to trigger failure) args := []string{"fs", "subvolume", "info1", fsName, SubVol, SubvolumeGroup, "--format", "json"}

Result: the helper correctly extracted exit code 22.
errvol (trimmed):

no valid command found; 10 closest matches: fs subvolume ls <vol_name> [<group_name>] fs subvolume create <vol_name> <sub_name> ... ... failed to run command. command terminated with exit code 22 (underlying: k8s.io/client-go/util/exec.CodeExitError)

ExitStatus():

k8s.io/client-go/util/exec.CodeExitError{Code:22}

Parsed code:
22

This demonstrates exitCodeFromError handles wrapped errors (e.g., CodeExitError) and reliably returns the numeric exit code.

avoid aborting when ceph reports eagain for subvolume info list healthy entries; skip not-ready ones and keep going collect skipped items and print a brief summary warning tidy error text in getSubvolumeState to reduce extra noise Signed-off-by: Oded Viner <[email protected]>

BlaineEXE

The error handling code here looks much more robust. I am wondering if you were able to determine which code path is followed when running the command in manual testing. That isn't of critical importance as long as we know it works, but it could be good to have a note of.

I also see this comment describing the testing. Very thorough notes; thank you.

LGTM

OdedViner requested review from subhamkrai and yati1998 October 19, 2025 13:08

BlaineEXE reviewed Oct 21, 2025

View reviewed changes

OdedViner force-pushed the skip_subvol_get_info_err branch 2 times, most recently from db2bcfd to 6bd9130 Compare October 23, 2025 13:28

OdedViner force-pushed the skip_subvol_get_info_err branch from 6bd9130 to 8430d50 Compare October 23, 2025 13:29

BlaineEXE reviewed Oct 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

csi: skip not-ready cephfs subvolumes in ls and summarize errors #399

csi: skip not-ready cephfs subvolumes in ls and summarize errors #399

OdedViner commented Oct 19, 2025

Uh oh!

BlaineEXE Oct 21, 2025

Uh oh!

OdedViner Oct 23, 2025

Uh oh!

BlaineEXE Oct 21, 2025

Uh oh!

OdedViner Oct 23, 2025 •

edited

Loading

Uh oh!

BlaineEXE left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csi: skip not-ready cephfs subvolumes in ls and summarize errors #399

Are you sure you want to change the base?

csi: skip not-ready cephfs subvolumes in ls and summarize errors #399

Conversation

OdedViner commented Oct 19, 2025

Uh oh!

BlaineEXE Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

OdedViner Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

BlaineEXE Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

OdedViner Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlaineEXE left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OdedViner Oct 23, 2025 •

edited

Loading

BlaineEXE left a comment •

edited

Loading