Fix single APP panic in kernel scheduler #3498

qazwsxedcrfvtg14 · 2023-06-21T05:11:25Z

Pull Request Overview

This pull request fixed a bug that the kernel will crash if we only have one APP and that APP is not ready.

In that kind of case, the scheduler will panic when doing next.unwrap():

Testing Strategy

This pull request was tested by running a few combos of apps / and single app test cases on our platform(ti50).

TODO or Help Wanted

N/A

Documentation Updated

Updated the relevant files in /docs, or no updates are required.

Formatting

Ran make prepush.

We should always usethe head of the queue instead of using the next of the current node. Because we will change the location of the current node.

lschuermann

I'm trying to understand the intricacies of this change.

processes = List {
    head:  ListLink(Some(RoundRobinProcessNode {
        proc: Some($PROCESS),
        next: ListLink(None)
    }),
}

In the current version using for node in self.processes.iter() {, we

create a ListIterator with cur: Some(RoundRobinProcessNode { proc: Some($PROCESS), next: ListLink(None) })
(implicitly) pull the first element from the ListIterator: Iterator through Iterator::next(). This will set cur: None and returns a Some(RoundRobinProcessNode { proc: Some($PROCESS), next: ListLink(None) }) for the loop iteration as node.
Given that first_head is None, set first_head = Some(node).
Check node.proc, which is Some(proc). However, proc.ready() = false. We perform self.processes.push_tail(self.processes.pop_head().unwrap()), which will first set head of the list to None, then set next of proc to None, and finally set head of the list to proc.
This translates to the list effectively not being changed.
Do step 2 & 3.
Given that first_head is Some(first_head), and first_head == node, break out of the loop, returning SchedulingDecision::TrySleep.

In contrast, the newly proposed changes would do the following:

Run self.processes.head(), which returns Some(RoundRobinProcessNode { proc: Some($PROCESS), next: ListLink(None) }) as Some(node).
Do step 3 & 4 from above.
Do step 1.
Do step 6 from above.

From reading over this code, I can't really make out the difference between two approaches. I suspect I may be missing something which probably has to do with the mutation going on of the list while iterating over it.

Could you perhaps try to clarify exactly what is going on here?

In general though, this code is very complicated to understand. I tried to address this some time ago, along with the inefficiencies we have here through our use of push_tail: #2845. Maybe that's worth going back to? The fact that this scheduling code if of quadratic complexity is pretty crazy.

bradjc

Good find, and this really emphasizes why we should just not allow code that uses .unwrap(), no matter how much we trust the programmer. It's just too hard to write correct code!

I think what is happening here is that today if there is only one entry in the process array, the .iter() loop only executes once, and so the loop ends with both a) next == None and return Sleep never being called.

With this patch, the loop will run twice when the processes array is 1 item long, so the return will be hit the second time.

Merging this seems good to me, but I of course would be in favor of removing that .unwrap().

qazwsxedcrfvtg14 · 2023-06-22T17:34:39Z

IIUC, the "original" idea of the scheduler is looping on the queued elements "forever", and breaks the loop when it sees an element twice.

This patch is just making sure the code aligns with the original idea.

For the quadratic complexity part, I think the quickest way to fix that is fixing the kernel/src/collections/list.rs

    pub fn push_tail(&self, node: &'a T) {
        node.next().0.set(None);
        match self.iter().last() {
            Some(last) => last.next().0.set(Some(node)),
            None => self.push_head(node),
        }
    }

IIUC, the complexity of self.iter().last() is linear, but that can be fixed if we store an extra reference to the current tail.

Another possibility is replacing the list with a static buffer, and storing a reference/index to the current head. (a simple ring buffer.)
But this method would not benefit the mlfq scheduler.

Prevent using iterator and pop at the same time

1b70a22

We should always usethe head of the queue instead of using the next of the current node. Because we will change the location of the current node.

github-actions bot added the kernel label Jun 21, 2023

lschuermann reviewed Jun 21, 2023

View reviewed changes

bradjc approved these changes Jun 21, 2023

View reviewed changes

lschuermann approved these changes Jun 23, 2023

View reviewed changes

bradjc added this pull request to the merge queue Jun 23, 2023

Merged via the queue into tock:master with commit f989880 Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix single APP panic in kernel scheduler #3498

Fix single APP panic in kernel scheduler #3498

Uh oh!

qazwsxedcrfvtg14 commented Jun 21, 2023

Uh oh!

lschuermann left a comment

Uh oh!

bradjc left a comment

Uh oh!

qazwsxedcrfvtg14 commented Jun 22, 2023

Uh oh!

Uh oh!

Uh oh!

Fix single APP panic in kernel scheduler #3498

Fix single APP panic in kernel scheduler #3498

Uh oh!

Conversation

qazwsxedcrfvtg14 commented Jun 21, 2023

Pull Request Overview

Testing Strategy

TODO or Help Wanted

Documentation Updated

Formatting

Uh oh!

lschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

bradjc left a comment

Choose a reason for hiding this comment

Uh oh!

qazwsxedcrfvtg14 commented Jun 22, 2023

Uh oh!

Uh oh!