-
Notifications
You must be signed in to change notification settings - Fork 33
csi: skip not-ready cephfs subvolumes in ls and summarize errors #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
pkg/filesystem/subvolume.go
Outdated
| logging.Warning("Found pending clone: %q", sv.Name) | ||
| logging.Warning("Please delete the pending pv if any before deleting the subvolume %s", sv.Name) | ||
| logging.Warning("To avoid stale resources, please scale down the cephfs deployment before deleting the subvolume.") | ||
| if isSubvolumeNotReady(err) || errors.Is(err, syscall.EAGAIN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to move the errors.Is(err, syscall.EAGAIN) into the isSubvolumeNotReady() function? It seems like it still makes sense as part of the not-ready detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/filesystem/subvolume.go
Outdated
| if strings.Contains(msg, "Error EAGAIN") { | ||
| return true | ||
| } | ||
| if strings.Contains(msg, "is not ready for operation") { | ||
| return true | ||
| } | ||
| if strings.Contains(msg, "exit code 11") { | ||
| // Ceph often maps EAGAIN to exit code 11 for subvolume 'info' | ||
| return true | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple thoughts related here:
- As a general statement, error handling is fragile when it relies on specific error message text, and we should avoid it where it's not required
- Ceph doesn't have very good error code returns, and so there are times when Rook does need to do message text checking
Overall, what I would prefer to see is us checking return code ints or error types as much as possible. When we can't, we should try string matching for specific err code names that are not likely to change across Ceph versions or with different localized Ceph builds in a different language.
If we include the outer errors.Is(err, syscall.EAGAIN), we actually have the start of a good pattern
if errors.Is(err, syscall.EAGAIN) {
return true
}
if strings.Contains(msg, "EAGAIN") { // I removed "Error" since "EAGAIN" is the critical string
return true
}In this pattern, we do our best to identify the error using the error type using errors.Is(). If that somehow fails, we can fall back to looking for the EAGAIN string.
Also of note, EAGAIN is code 11 -- sometimes output as -11 by Ceph.
In order to make Rook as robust as possible, we should try to do as much non-string checking as possible before resorting to string checking.
Things we can do better here:
- Leave string equality checking for last always
- Try to figure out what error type(s) is actually returned. This isn't always straightforward. When repro'ing an issue, I print the type of the returned error using
%T. That will give one data point. I also inspect the called code (even dependencies) and see what errors they build. I tried that here but didn't see anything very helpful. With all that, it's possible to have a reasonable idea of what different error types might be getting returned. If any of them codify the integer form of the return code, we should try to get it. For example.
At the end, the rough flow might look something like:
- If error
Is(err, syscall.EAGAIN) - If
error.(type).Code== 11 ||error.(type).Code== -11 (if code extraction is possible) - If error.Error() string contains "EAGAIN"
- If error.Error() string contains
exit code 11
1 and 2 are the most ideal checks we can do. 3 is a decent fallback. 4 isn't great, but I am guessing that exit code %d is a wrapped message from inside a dependency somewhere that isn't likely to change over time.
I would suggest removing the string check for is not ready for operation unless we really, really need that fallback. That text seems somewhat likely to change over enough time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually tested exitCodeFromError function.
To force a non-existent subcommand and verify exit-code extraction, I changed the args:
// Before
args := []string{"fs", "subvolume", "info", fsName, SubVol, SubvolumeGroup, "--format", "json"}
// After (intentional typo to trigger failure)
args := []string{"fs", "subvolume", "info1", fsName, SubVol, SubvolumeGroup, "--format", "json"}
Result: the helper correctly extracted exit code 22.
errvol (trimmed):
no valid command found; 10 closest matches:
fs subvolume ls <vol_name> [<group_name>]
fs subvolume create <vol_name> <sub_name> ...
...
failed to run command. command terminated with exit code 22
(underlying: k8s.io/client-go/util/exec.CodeExitError)
ExitStatus():
k8s.io/client-go/util/exec.CodeExitError{Code:22}
Parsed code:
22
This demonstrates exitCodeFromError handles wrapped errors (e.g., CodeExitError) and reliably returns the numeric exit code.
db2bcfd to
6bd9130
Compare
avoid aborting when ceph reports eagain for subvolume info list healthy entries; skip not-ready ones and keep going collect skipped items and print a brief summary warning tidy error text in getSubvolumeState to reduce extra noise Signed-off-by: Oded Viner <[email protected]>
6bd9130 to
8430d50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling code here looks much more robust. I am wondering if you were able to determine which code path is followed when running the command in manual testing. That isn't of critical importance as long as we know it works, but it could be good to have a note of.
I also see this comment describing the testing. Very thorough notes; thank you.
LGTM
Checklist: