Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@OdedViner
Copy link
Contributor

  • avoid aborting when ceph reports eagain for subvolume info
  • list healthy entries; skip not-ready ones and keep going
  • collect skipped items and print a brief summary warning
  • tidy error text in getSubvolumeState to reduce extra noise

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

logging.Warning("Found pending clone: %q", sv.Name)
logging.Warning("Please delete the pending pv if any before deleting the subvolume %s", sv.Name)
logging.Warning("To avoid stale resources, please scale down the cephfs deployment before deleting the subvolume.")
if isSubvolumeNotReady(err) || errors.Is(err, syscall.EAGAIN) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to move the errors.Is(err, syscall.EAGAIN) into the isSubvolumeNotReady() function? It seems like it still makes sense as part of the not-ready detection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 348 to 358
if strings.Contains(msg, "Error EAGAIN") {
return true
}
if strings.Contains(msg, "is not ready for operation") {
return true
}
if strings.Contains(msg, "exit code 11") {
// Ceph often maps EAGAIN to exit code 11 for subvolume 'info'
return true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple thoughts related here:

  1. As a general statement, error handling is fragile when it relies on specific error message text, and we should avoid it where it's not required
  2. Ceph doesn't have very good error code returns, and so there are times when Rook does need to do message text checking

Overall, what I would prefer to see is us checking return code ints or error types as much as possible. When we can't, we should try string matching for specific err code names that are not likely to change across Ceph versions or with different localized Ceph builds in a different language.

If we include the outer errors.Is(err, syscall.EAGAIN), we actually have the start of a good pattern

if errors.Is(err, syscall.EAGAIN) {
   return true
}
if strings.Contains(msg, "EAGAIN") { // I removed "Error" since "EAGAIN" is the critical string
   return true
}

In this pattern, we do our best to identify the error using the error type using errors.Is(). If that somehow fails, we can fall back to looking for the EAGAIN string.

Also of note, EAGAIN is code 11 -- sometimes output as -11 by Ceph.

In order to make Rook as robust as possible, we should try to do as much non-string checking as possible before resorting to string checking.

Things we can do better here:

  1. Leave string equality checking for last always
  2. Try to figure out what error type(s) is actually returned. This isn't always straightforward. When repro'ing an issue, I print the type of the returned error using %T. That will give one data point. I also inspect the called code (even dependencies) and see what errors they build. I tried that here but didn't see anything very helpful. With all that, it's possible to have a reasonable idea of what different error types might be getting returned. If any of them codify the integer form of the return code, we should try to get it. For example.

At the end, the rough flow might look something like:

  1. If error Is(err, syscall.EAGAIN)
  2. If error.(type).Code == 11 || error.(type).Code == -11 (if code extraction is possible)
  3. If error.Error() string contains "EAGAIN"
  4. If error.Error() string contains exit code 11

1 and 2 are the most ideal checks we can do. 3 is a decent fallback. 4 isn't great, but I am guessing that exit code %d is a wrapped message from inside a dependency somewhere that isn't likely to change over time.

I would suggest removing the string check for is not ready for operation unless we really, really need that fallback. That text seems somewhat likely to change over enough time.

Copy link
Contributor Author

@OdedViner OdedViner Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

I manually tested exitCodeFromError function.

To force a non-existent subcommand and verify exit-code extraction, I changed the args:

// Before
args := []string{"fs", "subvolume", "info", fsName, SubVol, SubvolumeGroup, "--format", "json"}

// After (intentional typo to trigger failure)
args := []string{"fs", "subvolume", "info1", fsName, SubVol, SubvolumeGroup, "--format", "json"}

Result: the helper correctly extracted exit code 22.
errvol (trimmed):

no valid command found; 10 closest matches:
fs subvolume ls <vol_name> [<group_name>]
fs subvolume create <vol_name> <sub_name> ...
...
failed to run command. command terminated with exit code 22
(underlying: k8s.io/client-go/util/exec.CodeExitError)

ExitStatus():

k8s.io/client-go/util/exec.CodeExitError{Code:22}

Parsed code:
22

This demonstrates exitCodeFromError handles wrapped errors (e.g., CodeExitError) and reliably returns the numeric exit code.

@OdedViner OdedViner force-pushed the skip_subvol_get_info_err branch 2 times, most recently from db2bcfd to 6bd9130 Compare October 23, 2025 13:28
avoid aborting when ceph reports eagain for subvolume info
list healthy entries; skip not-ready ones and keep going
collect skipped items and print a brief summary warning
tidy error text in getSubvolumeState to reduce extra noise

Signed-off-by: Oded Viner <[email protected]>
@OdedViner OdedViner force-pushed the skip_subvol_get_info_err branch from 6bd9130 to 8430d50 Compare October 23, 2025 13:29
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling code here looks much more robust. I am wondering if you were able to determine which code path is followed when running the command in manual testing. That isn't of critical importance as long as we know it works, but it could be good to have a note of.

I also see this comment describing the testing. Very thorough notes; thank you.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants