-
Notifications
You must be signed in to change notification settings - Fork 7
RFC-15: Support CHERI/Morello in seL4 #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
See https://sel4.atlassian.net/browse/RFC-15 Signed-off-by: Gerwin Klein <[email protected]>
|
Moving the higher-level discussion here, we agreed at the seL4 summit that I submit a hybrid kernel then re-iterate.
I am copying the main questions from the PR here to discuss:
|
|
I would like to raise an additional higher-level issue that so far has not been clear to me from the RFC description and only clarified from the description in the PR seL4/seL4#1344: The proposal seems to be to only support userspace in CHERI C (purecap) mode. It is unclear to me what the value would be of running seL4 in such a setting. I can see value in seL4 isolating traditional non-CHERI user-space processes (written in any programming language) from purecap user-space processes that want more fine-grained security. But if everything in user space is purecap anyway, you don't really need seL4, you don't even need a kernel, you could just run a threading library and have very similar properties. Conversely, requiring everything in userspace to be purecap, is a massive restriction on what languages, libraries and applications can be run on top of seL4. Certainly not something I think the TSC should endorse for seL4. It makes sense to explore the purecap-only option as a research project, because that is likely the harder part to implement, but for an RFC to change seL4, supporting only purecap user space is non-starter for me. A proposal to change seL4 must support both to be useful. Happy to be convinced otherwise if there are good arguments in the other direction. |
|
Hi Gerwin, Thanks for raising that. We totally agree with you that supporting "legacy" (e.g., unmodified non-CHERI source-code and/or binaries) makes sense to run side-by-side with purecap CHERI tasks. However, we don't think supporting "hybrid" as in allowing users to manually annotate pointers, adds much value. We're happy to include "legacy" support to the RFC and investigate the implementation efforts this may require as well. Is that what you're looking for or do you mean something else? The whole reason to run this project is to allow system builders to have access to both a formally verified separation for tasks/VMs and also strong memory safety in tasks and guest VMs. i.e., we very much see seL4 as complementary technologies. This is one of the reasons that we also see legacy task support as essential, since you might well want (for example) a configuration in which you have formally verified isolation between a legacy 64-bit Arm VM and a hardened CHERI-enabled task or VM. |
|
I am trying to modify the RFC file to integrate adding support for legacy code. However, since this is just a PR, I can't fork and submit PRs to modify this file. What's the recommended way if I want to modify the RFC proposal? |
|
Given that we're happy to add legacy support to this RFC, it would be great if we can define the "blockers" for this RFC so that we aim to address them separately from implementation. i.e., what's currently preventing this RFC from getting accepted, if we ignore the PRs? |
|
I think the main thing missing is a sane security model for how CHERI is supposed to be managed by seL4 user space. |
Does it let you make a pr against the branch that this PR is from: 0150-morello-support? If you check out this branch, commit some changes and push to your fork of this repo and then when you open a PR it should let you change the target branch from main to 0150-morello-support |
Yes, that should work, and when we then merge, the changes will show up here. Apologies for the awkwardness, this is an artefact of me copying things over from Jira. It'll not be necessary for new RFCs. |
Indeed, that works for me. I submitted a PR here #28 |
This is now a generic RFC for CHERI that includes both Morello and CHERI-RISC-V. Signed-off-by: Hesham Almatary <[email protected]>
Address comments from Kent and Gerwin on having to support legacy non-CHERI code side-by-side with CHERI C purecap code. Signed-off-by: Hesham Almatary <[email protected]>
I’ve been giving some thought about a simple starting-point restrictive security policy/scenario for the purpose of this RFC that I’d like to ask for your opinions on. Microkit seems like an intuitive system for that purpose. TL;DRThe monitor has full access to the CHERI capabilities during boot time. It derives/constructs CHERI capabilities for each protection domain. After the monitor finishes its bootstrapping, no CHERI capabilities are allowed to propagate across protection domains, and the monitor invalidates its “almighty” CHERI capabilities. Further, no IPC buffer is allowed hold valid CHERI capabilities after boot. DescriptionMicrokit-based CHERI userspace policy Who’s trusted?The loader, kernel, and monitor (until it has finished initialization) are trusted. What’s an almighty root CHERI capability?An almighty capability is a valid CHERI capability that has full permissions and full address range (e.g., from 0 to 2^64 - 1 on 64-bit systems). CHERI hardware starts running (on reset) with 2 almighty capabilities: DDC and PCC. Both are used to derive and create further capabilities with less permissions and address space ranges. Typically there’s one or two when systems like: firmware, hypervisor, OS, linkers/loaders, and userspace. Note that no almighty CHERI capability can bypass an existing MMU protection. Kernel: Since the kernel is trusted, it could boot and keep almighty CHERI capabilities. During boot time, it creates one (or more) root capabilities for the user’s root task (monitor). The kernel does not need to keep almighty capabilities afterwards for a static system like Microkit. Who gets almighty CHERI capabilities?loader, kernel and the root task only. Only the kernel could keep them. Almighty capabilities are used to create/derive other capabilities for different address spaces, but only during bootstrapping time. Who can create/derive CHERI capabilities from an almighty user capability for inter-AS protection domains?Only the monitor task, only during bootstrapping time. Monitor scenario
Please let me know what you think. |
|
(I'm answering your questions with a more dynamic system in mind than Microkit.)
How does it clean up derived CHERI pointers? After startup, the stack and maybe the heap will be littered with stale, but valid CHERI pointers.
How is the initialisation problem solved on BSD/Linux? Does the dynamic loader have unbound DDC/PCC? If so, how does it assure that all copies are gone after init? For that matter, how is the heap implemented? Does the heap manager have a reserved virtual address range which it can use and one pointer for the whole range?
How are tasks that handle another task's cross-address-space pointers, e.g. debuggers, implemented on such platforms? Does ptrace pass valid CHERI pointers or does it construct them? This BSD ptrace page says they are constructed, why can't we do the same for seL4? That would solve the whole security problem of having CHERI pointers in the wrong address space. What is the added value of handling valid CHERI pointers in a task with a different address space, instead of having a syscall that creates a valid CHERI pointer in a task's register? It only seems to have security downsides, at best it's just more convenient.
The concern here is that it can be attacked to exploit (stale) CHERI pointers meant for a different address space (mostly in a scenario where you do more dynamic stuff after init like restarting tasks or reloading programs). That might sound paranoid, but I think it's the right level of wariness people using CHERI on seL4 would have.
I'm getting more and more convinced that handling CHERI pointers for another address space with valid CHERI pointers is a broken design. I think everything is much simpler and safer if you add a syscall that can construct valid CHERI caps in a task's register from non-CHERI integer arguments. It would require a TCB cap and perhaps also a VSpace cap. That would make it possible to do everything you need, without all the complications. seL4's changes would actually be minimal, both code-wise as semantically. For completeness you also want a syscall that can read the CHERI pointers in a safe, deconstructed way. These syscalls can then safely be used to manipulate the new CHERI system registers and create both non-CHERI and CHERI-enabled user space tasks, without the task doing that being able to gain "almighty" DDC/PCC for itself. Cross-address space CHERI pointer passing by accident would be impossible, it could only be done explicitly by tasks with the right permissions via the syscalls, or by enabling the CHERI PTE bits on shared memory. The problem of using PTE bits for IPC buffers is that it doesn't give explicit control, it's an all or nothing solution. If tasks A and B need to pass CHERI pointers, and task A and C, then very quickly B and C need to be able to pass CHERI pointers too. Worse, the stale pointers for B or C stay dormant in A's IPC buffer and might be used by attackers to gain access to A's address space. If B or C is the attacker, then they probably have control over that pointer value. My other concern of using CHERI PTE bits on IPC buffers as policy mechanism is that you will just enable them all everywhere to simplify user space porting, creating a huge security hole. As for passing CHERI pointers via IPC calls between threads in the same address space, I think that would be okay if we limit it to registers only (so no IPC buffer passing), preferably via explicit syscall wrappers, to make it an explicit operation by both the receiver and the sender. If user space wants to pass more than 4 CHERI pointers at once, it can do that via memory itself. |
|
Hi Indan, Thanks for replying.
Sure, happy to discuss your concerns. I just wanted to mainly stick with a simple current static system that exists and is deployed/used, for the purpose of this RFC and its implementation/evaluation, and just for a start. All your other concerns are definitely valid, but they may take some effort and time to think about, evaluate and reach an agreement on.
I imagine that's the same with the current seL4 root tasks (without CHERI). It will still currently hold all capabilities that the kernel gave it. How does it currently address this issue? I don't know what it does in Microkit at the moment. But I imagine it could suspend/delete itself and become a zombie after it has finished bootstrapping the system? This could also work for CHERI, if there's no way to reach/use those capabilities after the root task is suspended/deleted. Hence, if the root task isn't part of the TCB, I could think of some solutions here:
CheriBSD currently keeps almighty/super capabilities in the kernel to construct further capabilities for the user if needed. It doesn't currently revoke or invalidate them and their copies after init. For example, when
Debuggers/tracers don't pass valid pointers, yes. It could (ask the kernel to) construct them using some system call. We can do the same in seL4. My colleagues working on CheriBSD told me there are currently 3 modes for the kernel to construct/write a capability to a target process using sysctl:
That's a valid concern. I agree. Legit question though, how is the root task currently viewed in the currently implemented seL4 systems? Is it part of the TCB? Is it designed and implemented to be part of the attack vector and could be attacked after it boots?
I totally agree with you, for a complete secure design with dynamic seL4 systems, this should (and could) be prevented. Hence I discussed solutions like sealing etc before. I still think, from a practical/implementation PoV, a static system like Microkit with a hybrid/legacy root task that, itself, doesn't care about or need CHERI's memory-safety, but just creates CHERI caps for protection domains (where they never propagate/receive valid CHERI caps after boot nor have inter-AS CHERI caps), this is also a secure implementation.
That's a good suggestion, and I've been keeping it in the backlog for a while, but now I've started experimenting with its implementation and I'd like your feedback. I imagine that would only be one system call to construct+write valid CHERI pointers that doesn't use the IPC buffer at all, like:
All current "ReadRegs" syscall just return untagged capabilities. I also suggest for each VSpace seL4 capability, new permissions bits could be added analogous to the above What do you think?
|
The main way of dealing with this in seL4 is by modularisation and delegation so that tasks don't have more capabilities than strictly necessary. In seL4, if you delete a capability, all corresponding objects get destroyed, so you need to keep references somewhere. But those references are in CSpace, not littered on the stack and heap. If the CSpace slot becomes invalid, all stale CPath values on stack and heap are unusable. Another difference is that to misuse seL4 capabilities you need to trick a trusted task to execute specific syscalls on specific capabilities, which is much harder to exploit than arbitrary memory locations. If it handles a bunch of TCBs, its own TCB cap will not be in that list. With CHERI you can use any cheri pointer meant for another task to attack the trusted task itself, because the domain it applies to is not stored in the cheri pointer itself. For seL4 capabilities that would be like being able to use any TCB cap to manipulate the root task itself.
Probably it doesn't do much after init. Even restarting faulting tasks is currently not done as far as I know.
That would work, yes.
Talking about trusted computing base is the wrong way of looking at it: The question isn't whether it's trusted, the question is whether it's attackable, not only now, but also in the future. People generally put too much trust into their own software.
Fine, but perhaps impractical.
Theoretically sound, practically very hard to get right if the software gets slightly more complicated, as it will need valid cheri tags to function.
Doesn't add anything to 1) if the memory isn't shared. If the memory is shared, then the problem spreads to all the tasks sharing the memory.
Agreed. But ask yourself: Why are you using a CHERI system to begin with, if your most privileged task is not protected by it? That is fine to do if you delegate the real work to less privileged tasks, but then those have the same stale CHERI pointer problem.
So user space can't create arbitrary valid CHERI pointers for different address spaces directly. Then why would we do that in seL4? I guess CheriBSD only support a
The target process could do that itself though, why would another task need to do that? What is the use case?
Makes sense for BSD/Linux. In seL4 user space has more direct control over memory mappings, so the added value is less there. The check could be replaced with requiring a vspace cap instead, that avoids the need for a relatively complicated page table walk.
In seL4 this would require a vspace cap, as that's the closest thing to full control over a task's address space.
That depends on the system. For very simple systems that don't care about restarting faulting tasks or logging any debug info when it happens, there is no way to interact with the root task, and hence it's secure. It can be as buggy and insecure software as you like, but it can't be used to attack the system as it's unreachable. I don't know whether you would call it part of the trusted compute base or not. If you do restart tasks, or if you have some communication channel, then things become more precarious. Usually the root task won't do that itself directly, but it would create other threads or processes and delegate to them. At that point the problem moves from the root task to those other tasks. In that sense there is nothing special about the root task, it's just the task that has access to most capabilities. That's why I keep using a fault handler task as example: Doesn't matter whether it's the root task or something else, if it wants to restart or debug it needs to set or get CHERI pointers. It has interaction with potentially untrusted tasks, so it can be attacked. Same for VM managers. But with seL4 all the attack vectors are self-induced, you get what you create and there shouldn't be any unexpected interactions (well, ignoring HW based attacks like rowhammer). The main way of securing it is by keeping it as simple as functionally possible.
Good point about not needing vspace for writing invalid cheri pointers, which could be useful for other reasons.
I would probably split this up in two word_t width parameters: One for the register value and one for the CHERI capability value. That avoids any confusion whether this is a valid CHERI pointer or not and would reduce the seL4 kernel changes.
We can do this by not passing a vspace.
If we add this, I'd prefer an explicit source register argument instead of trying to guess in the kernel.
I think having a vspace cap is enough, no need to also check whether there is a compatible virtual mapping. User space should know what that mapping is already and could check it in user space. I'd rather not walk the whole page table and check all entries for compatibility. (Does Cheri BSD simplify this by limiting the range to one page or something?) It also makes it possible to create the necessary cheri pointers before the mappings are done. We can't invalidate existing cheri pointers if a mapping disappears either.
Could be replaced with tcb + vspace.
I would split this up in two word_t values too, to keep it consistent.
This is making things unnecessarily complicated with very little gain. Keep in mind you need both a tcb and a vspace cap already. If we need more fine grained permissions, it's on tcb caps, not vspace caps.
It's finally moving in the right direction. |
You'll still be able to enhance your system's security by creating less privileged memory-safe Microkit protection domains ("children of the root task"). If the root task gets attacked after bootstrapping, best it could do is violate its own memory-safety (if it had/required it in the first place, being purecap), but that won't give it any extra powers (beyond what seL4 caps give it) over the purecap protection domains after boot, given we completely prevent CHERI caps propagation between protection domains in a system like Microkit.
I am not suggesting to do so if we introduce the new system call.
Debuggers: to set a target process'
I think it has use cases (especially less restrictive ones) in seL4 as well.
My colleagues also said CheriBSD doesn't do PTW for it, but rather walk the vm_map_entry structures/reservation
Yeah but imagine a purecap root task or another task having both TCB and VSpace capabilities to themselves. You still might not want them to call
Yeah, I thought it'd be useful for backward compatibility as well (e.g., non-CHERI tasks that might want to use this syscall).
It doesn't really matter if the user passes it tagged or not. It'll always get untagged by the kernel. This is just to 1) enforce CHERI ABI with hardware-width registers and format actually being of
Sure. Just to confirm a source register index in the target process, right? If so, do we still want to give the user the option to iterate over GPRs if the source register isn't valid or big enough? Or if the user doesn't want to or know what source arg to choose? We could probably add a flag to
It's just an extra option/guarantee to only construct a capability of there's a backed up page mapping for it. It could be useful as I mentioned above, if the current target process' GPR snapshot doesn't happen to have a capability to derive from, but it still has a mapping for it. It could reduce a privilege for a debugger not to request using a root capability at all (the confinement problem?) that allows it to construct arbitrary capabilities in the target process. I don't have a strong bias to add/support that at the moment, so maybe just 1 and 2. But I could imagine (and I think it's actually necessarily if we completely ditch inter-AS super capabilities in user) a more sophisticated fine-grained option to allow the user to pass a set of seL4 page capabilities to this syscall and construct a single CHERI capability out of them. This could solve an issue for a user-level
Yeah but let me know about the situation above where a purecap task might have a tcb + vspace itself but you still don't want it to create arbitrary super CHERI capabilities?
I'd like to stick with enforcing the CHREI ABI by returning an untagged CHERI-width register, and probably return an additional byte for the tag and extra info if needed. This is what CheriBSD does in ptrace.
I provided some reasons why those permissions might be necessary above, please let me know what you think. I also thought these new permissions could just be in the tcb cap as well, but then I thought it makes more sense to bind it to a VSPace and really a CHERI capability is bound to an AS/VSpace, rather than a TCB.
|
|
I anticipate if we decide to implement this new system call, we will also need a similar one (or integrate it in the above syscall) for writing tagged capabilities in the target process' AS memory, besides writing to registers. This is, for example, to enable user-level ELF loaders and process creators to initialise CHERI's capability table for each process/ELF/protection domain before it starts executing. We will also likely need to stick those permissions to VSpace and probably also pass page capabilities to those system calls in scenarios like:
Giving a "forging" permission to a VSpace to arbitrarily manufacture/write any CHERI capability (i.e., option #3 above, even if they have a VSpace cap) could be considered less secure. This is compared to restricting "deriving" new CHERI capabilities to having existing GPRs to derive valid CHERI caps from (which doesn't apply in the previous two scenarios), or from VSpace/page capabilities. |
Fair enough. Any examples where you would want to use a different source register than the destination one? If not, we could make this the default behaviour if no vspace cap is given.
I'm not convinced it's worth it.
Yes, that's exactly the part that user space is supposed to handle in seL4. So if you want something similar, you have to implement it in user space. The seL4 kernel has no additional metadata for mappings, other than what's stored in page caps. For the kernel to do a range check, it needs to do a depth first walk of the page table and check all permissions along the way.
Not really. Everything you say also applies to other vspace operations, I don't see why CHERI would be an exception here.
Ideally the new syscalls would be generic individual register read/write syscalls, with enough flexibility to implement what CHERI needs. That is, the non-CHERI generic syscalls should be added too. the only difference would be additional arguments for CHERI. That's another reason to keep everything the same with non tagged pointer arguments.
This is exactly what I'm trying to avoid.
None of this is performance critical and one or two cycles extra won't make any difference anyway.
Of course.
No? Do you mean you want to combine multiple overlapping/adjacent CHERI pointers into one that covers both ranges? Does CheriBSD support that? That seems very obscure functionality.
The user can retrieve all register values and iterate over them itself if it wants to, it doesn't need the kernel to do that.
This is overcomplicating things unnecessarily, which is what I am trying to avoid.
Again, overcomplication with not much gain. For user-level
Then I'll probably complain about it when I review the code, except if you manage to keep it very simple. |
My initial reaction is: Absolutely not. (If we do this, it's by adding generic cross-AS memory read/write syscalls, again with an extension for CHERI.)
How is this problem solved by CheriBSD and why can't we use the same solution?
This can easily be solved by passing those via registers during task startup, no need for cross-AS memory writes. You can pass quite a few registers at launch and then run special code that does all the memory initialisation you need before normal startup. This can be special code that gets mapped before at load and unmapped when launch is done, making it fully invisible to normal user space code, having any ABI you want there. Alternatively, you have a launcher task with almighty CHERI permission, which can write the memory itself directly. As all it does is loading ELF files, it has no direct interaction with untrusted tasks. Even if it does get compromised, there is nothing sensitive in its own address space, nor anything to exploit, as all it does is loading an ELF file and you can already control its behaviour via the ELF file you feed it. This is probably closest to how CheriBSD does this. |
No examples come to mind. But I am thinking what if the caller just wants to invalidate the tag of the target register while keeping its value.
No that's not what I meant. It's just to find a big enough CHERI GPR to construct the new capability from.
Not trying to avoid vspace. It will still require TCB, VSpace, and page caps; you can still have a VSpace cap but without page caps to map to this vspace. My argument is, having a VSpace cap might not be enough (of a privilege) to construct valid arbitrary CHERI caps from to cover a big range of contiguous virtual memory. In other words, can an seL4 thread (e.g., a debugger) that has a VSpace (and only that) for different thread/AS read/write any of its memory without page mappings/caps? Similarly for constructing CHERI caps only with a VSpace. In any case, we need a way to construct valid CHERI capabilities for bigger-than-a-page mapping, for both a debugger and a target thread (e.g., when creating a new protection domain from scratch, without having any existing valid CHERI GPRs).
|
Yeah that's exactly what I meant. I don't mean for a thread to hold inter-AS capabilities, but just exactly the same as constructing valid register caps for a target thread, we construct valid CHERI caps for a target thread and save it to its memory. I don't care if it's the same syscall or a separate one.
ELF loading is done by the kernel on fork/exec, unlike here. The kernel has root caps that it constructs CHERI caps from (for each ELF segment), then passes them to the user in its stack as expected by a UNIX process'
|
That seems obscure functionality. But you can already do this by writing the register twice, both times with no vspace given: Once to put it out of range and hence invalidate it, second time to restore the original value. Or the valid tag can be an extra parameter to the new syscall.
Of course not. VSpace gives control over someone's else virtual memory space, it doesn't automatically gives you access to it yourself. It's for managing another task's memory, having that right doesn't automatically grant you access to the same memory. This was one of my main arguments against managing cross-AS Cheri pointers the way you originally proposed. And this is also why adding cross-task memory read/write syscalls is not straightforward, as it has the same problem. But having a tcb and vspace cap only gives you right to manage another task's Cheri register tags, it doesn't give you access to valid Cheri pointers. It is the tcb cap that gives you read and write access to the other task's registers, the vspace cap is an additional requirement we are going to add to manage Cheri tags. But you're still managing registers, not memory. Perhaps, as an exception, we could allow tasks to create Cheri pointers for themselves if they have a vspace cap but no tcb cap. But not sure if that is a good idea.
Of course, but we have the new syscall for that and if you have a tcb and vspace cap you can create any arbitrary ranges, it's not limited to page sizes. What's missing is a good reason to add fine grained permissions to vpaces for Cheri, or to limit tag creation to the page table mapping. Debuggers are a non-issue, as they would need tcb caps already. Anything creating new processes already has tcb and vspace caps. So I don't see a strong argument in favour of complicating things without much gain. (I mean on the kernel side, I'm fine with complicating user space startup if it keeps the kernel simpler.)
What about dynamically linked applications? I know Linux starts the linker up in that case instead of loading all libraries itself.
Nothing about seL4 is portable, whatever you do you'll need seL4 specific launch code. And like I said, if it doesn't fit in registers, you can always have a launcher that has super/root pointers that creates what's necessary in its own address space, just like monolithic kernels do. Or add a launch protocol where the init code can request more Cheri pointers when it's done with the current set and needs more to complete init. There are many (admittedly annoying) ways of solving this in user space. |
Original Jira issue and discussion.