-
Notifications
You must be signed in to change notification settings - Fork 177
zedagent: refactor LPS handling into LocalCmdAgent component #5191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zedagent: refactor LPS handling into LocalCmdAgent component #5191
Conversation
ba5f340 to
4b3b3fc
Compare
4b3b3fc to
5929bd9
Compare
|
@christoph-zededa Btw. in my refactoring I added return statement here: https://github.com/milan-zededa/eve/blob/zedagent-lps-refactor/pkg/pillar/cmd/zedagent/localcommand.go#L43 |
Sounds good. Once it is merged, I will do a backport of that fix. |
5929bd9 to
4cbb4cd
Compare
issue has been found here: lf-edge#5191 (comment) Signed-off-by: Christoph Ostarek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New docs are super useful, thanks! Could you please link them in https://github.com/lf-edge/eve/blob/master/docs/mkdocs/mkdocs.yml ?
4cbb4cd to
030506a
Compare
Sure, added link to mkdocs.yml |
030506a to
e5afc0c
Compare
| throttled bool | ||
| } | ||
|
|
||
| // newTaskTicker creates a new taskTicker with a randomized firing interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The research paper from the 1980's says you need 0.5 to 1.5 to avoid synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put there anything, in this PR I just kept what was already there:
- https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/localinfo.go#L49-L50
- https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/zedagent/localinfo.go#L521-L522
- etc. (at least it is now only in one place for all LPS tasks, so we do not need to change it in many places)
| if lc.beforeStart != nil { | ||
| lc.beforeStart() | ||
| } | ||
| locked := lc.taskMx.TryRLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we hold a (read) lock across http invocations in the old/current implementation?
Means the read lock can be held for minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't (but then we had some potential race conditions in the old implementation).
But I enhanced this now so that Pause will not be blocked by HTTP request. I added function:
// runInterruptible temporarily releases the task lock to allow Pause() to proceed,
// runs the provided callback, then re-acquires the lock. Returns true if a pause
// was triggered while the callback was running, indicating the caller should
// discard or retry the operation.
func (lc *taskControl) runInterruptible(callback func()) (wasPaused bool) {
...
Which is now used to run HTTP operations without lock being held and discard results if there was a Pause while the HTTP request was running.
| // GetLocalAppRestartCmd returns the most recent locally issued restart | ||
| // command for the given app, or an empty command if none exists. | ||
| func (lc *LocalCmdAgent) GetLocalAppRestartCmd(appUUID uuid.UUID) types.AppInstanceOpsCmd { | ||
| lc.appCommandsMx.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the callers ok with this potentially blocking for minutes when the http call needs to time out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not locked during HTTP call execution.
Only taskMx was, but I changed it to avoid Pause being blocked by HTTP operations: #5191 (comment)
| // from the primary controller is being applied. Or vice versa. | ||
| getconfigCtx.sideController.Lock() | ||
| defer getconfigCtx.sideController.Unlock() | ||
| resume := getconfigCtx.localCmdAgent.Pause() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this block for minutes if a local task is waiting for http to time out while holding the read lock (from startTask)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -- fixed: #5191 (comment)
e5afc0c to
03d083b
Compare
784a59e to
fe8511c
Compare
f0143a7 to
fd53736
Compare
issue has been found here: lf-edge#5191 (comment) Signed-off-by: Christoph Ostarek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but please test with a config where the http post/get times due to unreachable/blocked connectivity to the server.
issue has been found here: #5191 (comment) Signed-off-by: Christoph Ostarek <[email protected]>
|
Hmmm... LPS/LOC tests fails: --- FAIL: TestEdenScripts/dev_local_info (1549.20s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOC/LPS tests are failing. Requesting a change to prevent occasional merging.
I'm unable to reproduce these failures locally. So I temporarily added new commit with some extra log messages that might help. Could you please restart eden tests? |
ddf1d8c to
b780715
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kicking the tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run the tests
b780715 to
efecb3e
Compare
efecb3e to
cda6d4f
Compare
|
@uncleDecart / @OhmSpectator When you get a chance, please restart the eden tests again. I submitted a small change which should fix the failing tests (but for now keeping the temporary commit with extra logs). |
|
They should start once the build is done |
cda6d4f to
f9edbaa
Compare
|
Looks like the failing test is fixed, I removed the extra logs for troubleshooting. |
- Refactored the LPS-handling code inside zedagent into a separate component, LocalCmdAgent. This helps mitigate the complexity and unorganized structure of the already large zedagent package. - No functional changes were introduced; the code was reorganized, cleaned up, and better documented. - Improved thread-safety for variables that previously lacked proper protection, reducing potential race conditions. - Added user-facing documentation for the Local Profile Server (LPS). - Added initial developer-facing documentation for zedagent, with focus on LPS handling. Signed-off-by: Milan Lenco <[email protected]>
f9edbaa to
1bc9707
Compare
|
@OhmSpectator Finally everything is green. When you get a chance, please merge. |
done |
Description
zedagentinto a separate component,LocalCmdAgent. This helps mitigate the complexity and unorganized structure of the already largezedagentpackage.zedagent, with focus on LPS handling.How to test and validate this PR
No functional changes were made, but the entire LPS functionality should be retested.
Deploy and configure LPS, then test all the endpoints: https://github.com/lf-edge/eve-api/blob/main/PROFILE.md
The same commands should be also tested with the Local Operator Console (LOC) in air-gaped environment.
Note that I performed all these tests and additionally we have a small test suite in eden for LPS: https://github.com/lf-edge/eden/blob/master/tests/workflow/lps-loc.tests.txt
Changelog notes
zedagentmicroservice into a dedicatedLocalCmdAgentcomponent.PR Backports
This is just refactoring and does not need to be backported.
Checklist