Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lsf37
Copy link
Member

@lsf37 lsf37 commented Jun 14, 2024

@heshamelmatary
Copy link

Moving the higher-level discussion here, we agreed at the seL4 summit that I submit a hybrid kernel then re-iterate.
The current discussions on the hybrid kernel PR, as discussed is summarised as follows:

  1. We concluded that we globally, by default, forbid passing tagged pointers inter-AS.
  2. I suggested we also give the user a config option (and PTE bits) to control passing tagged pointers in the IPC buffer for different threat models, given that they know what they're doing, at their own risk.
  3. Even if the user allows 2, it still has the option to have strict CHERI security and avoid pointer aliasing by having non-overlapping address spaces for mutually distrusting (and untrustworthy compartments).

I am copying the main questions from the PR here to discuss:

  1. Any CHERI design blockers that must be addressed before this RFC/PR gets accepted.
  2. Whether the verification-visible changes in non-CHERI shared code are minimal enough to be tolerated.
  3. Future optional CHERI design trade-offs that aren't blockers for this PR to get merged, but important to think about.
  4. anything else?

@lsf37
Copy link
Member Author

lsf37 commented Feb 20, 2025

I would like to raise an additional higher-level issue that so far has not been clear to me from the RFC description and only clarified from the description in the PR seL4/seL4#1344:

The proposal seems to be to only support userspace in CHERI C (purecap) mode.

It is unclear to me what the value would be of running seL4 in such a setting. I can see value in seL4 isolating traditional non-CHERI user-space processes (written in any programming language) from purecap user-space processes that want more fine-grained security. But if everything in user space is purecap anyway, you don't really need seL4, you don't even need a kernel, you could just run a threading library and have very similar properties. Conversely, requiring everything in userspace to be purecap, is a massive restriction on what languages, libraries and applications can be run on top of seL4. Certainly not something I think the TSC should endorse for seL4.

It makes sense to explore the purecap-only option as a research project, because that is likely the harder part to implement, but for an RFC to change seL4, supporting only purecap user space is non-starter for me. A proposal to change seL4 must support both to be useful.

Happy to be convinced otherwise if there are good arguments in the other direction.

@heshamelmatary
Copy link

Hi Gerwin,

Thanks for raising that. We totally agree with you that supporting "legacy" (e.g., unmodified non-CHERI source-code and/or binaries) makes sense to run side-by-side with purecap CHERI tasks. However, we don't think supporting "hybrid" as in allowing users to manually annotate pointers, adds much value. We're happy to include "legacy" support to the RFC and investigate the implementation efforts this may require as well. Is that what you're looking for or do you mean something else?

The whole reason to run this project is to allow system builders to have access to both a formally verified separation for tasks/VMs and also strong memory safety in tasks and guest VMs. i.e., we very much see seL4 as complementary technologies. This is one of the reasons that we also see legacy task support as essential, since you might well want (for example) a configuration in which you have formally verified isolation between a legacy 64-bit Arm VM and a hardened CHERI-enabled task or VM.

@heshamelmatary
Copy link

I am trying to modify the RFC file to integrate adding support for legacy code. However, since this is just a PR, I can't fork and submit PRs to modify this file. What's the recommended way if I want to modify the RFC proposal?

@heshamelmatary
Copy link

Given that we're happy to add legacy support to this RFC, it would be great if we can define the "blockers" for this RFC so that we aim to address them separately from implementation. i.e., what's currently preventing this RFC from getting accepted, if we ignore the PRs?

@Indanz
Copy link
Contributor

Indanz commented Mar 3, 2025

I think the main thing missing is a sane security model for how CHERI is supposed to be managed by seL4 user space.

@kent-mcleod
Copy link
Member

I am trying to modify the RFC file to integrate adding support for legacy code. However, since this is just a PR, I can't fork and submit PRs to modify this file. What's the recommended way if I want to modify the RFC proposal?

Does it let you make a pr against the branch that this PR is from: 0150-morello-support? If you check out this branch, commit some changes and push to your fork of this repo and then when you open a PR it should let you change the target branch from main to 0150-morello-support

@lsf37
Copy link
Member Author

lsf37 commented Mar 3, 2025

Does it let you make a pr against the branch that this PR is from: 0150-morello-support? If you check out this branch, commit some changes and push to your fork of this repo and then when you open a PR it should let you change the target branch from main to 0150-morello-support

Yes, that should work, and when we then merge, the changes will show up here. Apologies for the awkwardness, this is an artefact of me copying things over from Jira. It'll not be necessary for new RFCs.

@heshamelmatary
Copy link

Does it let you make a pr against the branch that this PR is from: 0150-morello-support? If you check out this branch, commit some changes and push to your fork of this repo and then when you open a PR it should let you change the target branch from main to 0150-morello-support

Yes, that should work, and when we then merge, the changes will show up here. Apologies for the awkwardness, this is an artefact of me copying things over from Jira. It'll not be necessary for new RFCs.

Indeed, that works for me. I submitted a PR here #28

This is now a generic RFC for CHERI that includes both Morello and CHERI-RISC-V.

Signed-off-by: Hesham Almatary <[email protected]>
Address comments from Kent and Gerwin on having to support legacy non-CHERI
code side-by-side with CHERI C purecap code.

Signed-off-by: Hesham Almatary <[email protected]>
@heshamelmatary
Copy link

I think the main thing missing is a sane security model for how CHERI is supposed to be managed by seL4 user space.

I’ve been giving some thought about a simple starting-point restrictive security policy/scenario for the purpose of this RFC that I’d like to ask for your opinions on. Microkit seems like an intuitive system for that purpose.

TL;DR

The monitor has full access to the CHERI capabilities during boot time. It derives/constructs CHERI capabilities for each protection domain. After the monitor finishes its bootstrapping, no CHERI capabilities are allowed to propagate across protection domains, and the monitor invalidates its “almighty” CHERI capabilities. Further, no IPC buffer is allowed hold valid CHERI capabilities after boot.

Description

Microkit-based CHERI userspace policy

Who’s trusted?

The loader, kernel, and monitor (until it has finished initialization) are trusted.

What’s an almighty root CHERI capability?

An almighty capability is a valid CHERI capability that has full permissions and full address range (e.g., from 0 to 2^64 - 1 on 64-bit systems). CHERI hardware starts running (on reset) with 2 almighty capabilities: DDC and PCC. Both are used to derive and create further capabilities with less permissions and address space ranges. Typically there’s one or two when systems like: firmware, hypervisor, OS, linkers/loaders, and userspace. Note that no almighty CHERI capability can bypass an existing MMU protection.

Kernel: Since the kernel is trusted, it could boot and keep almighty CHERI capabilities. During boot time, it creates one (or more) root capabilities for the user’s root task (monitor). The kernel does not need to keep almighty capabilities afterwards for a static system like Microkit.
User: The kernel and/or construct root user CHERI capabilities that don’t have ASR permission (Access System Registers) and cover the entire user’s address space, it’s handed over to the monitor.

Who gets almighty CHERI capabilities?

loader, kernel and the root task only. Only the kernel could keep them. Almighty capabilities are used to create/derive other capabilities for different address spaces, but only during bootstrapping time.

Who can create/derive CHERI capabilities from an almighty user capability for inter-AS protection domains?

Only the monitor task, only during bootstrapping time.

Monitor scenario

  • The root task (monitor), is going to be given full CHERI capability permissions/range that cover the entire user address space. It’s trusted to properly create a security policy and derive further CHERI capabilities for each protection domain.
  • It creates CHERI capabilities for each “mr” mapping.
  • The monitor/root-task itself has permissions to write valid CHERI capabilities to its IPC buffer to perform protection domain setups.
  • The monitor will remap its IPC buffer without CHERI permissions to load/store valid CHERI capabilities after it boots.
  • It creates CHERI capabilities for code and each protection domain to set up its CHERI captable.
  • It disables any tagged CHERI capability propagation across protection domains (and address spaces) by creating IPC buffers without CHERI Page permissions
  • It invalidates any almighty CHERI capabilities after performing the load process.

Please let me know what you think.

@Indanz
Copy link
Contributor

Indanz commented Apr 7, 2025

(I'm answering your questions with a more dynamic system in mind than Microkit.)

and the monitor invalidates its “almighty” CHERI capabilities.

How does it clean up derived CHERI pointers? After startup, the stack and maybe the heap will be littered with stale, but valid CHERI pointers.

An almighty capability is a valid CHERI capability that has full permissions and full address range (e.g., from 0 to 2^64 - 1 on 64-bit systems). CHERI hardware starts running (on reset) with 2 almighty capabilities: DDC and PCC. Both are used to derive and create further capabilities with less permissions and address space ranges. Typically there’s one or two when systems like: firmware, hypervisor, OS, linkers/loaders, and userspace. Note that no almighty CHERI capability can bypass an existing MMU protection.

How is the initialisation problem solved on BSD/Linux? Does the dynamic loader have unbound DDC/PCC? If so, how does it assure that all copies are gone after init? For that matter, how is the heap implemented? Does the heap manager have a reserved virtual address range which it can use and one pointer for the whole range?

Who gets almighty CHERI capabilities?

loader, kernel and the root task only. Only the kernel could keep them. Almighty capabilities are used to create/derive other capabilities for different address spaces, but only during bootstrapping time.

How are tasks that handle another task's cross-address-space pointers, e.g. debuggers, implemented on such platforms? Does ptrace pass valid CHERI pointers or does it construct them? This BSD ptrace page says they are constructed, why can't we do the same for seL4? That would solve the whole security problem of having CHERI pointers in the wrong address space.

What is the added value of handling valid CHERI pointers in a task with a different address space, instead of having a syscall that creates a valid CHERI pointer in a task's register? It only seems to have security downsides, at best it's just more convenient.

* The root task (monitor), is going to be given full CHERI capability permissions/range that cover the entire user address space. It’s trusted to properly create a security policy and derive further CHERI capabilities for each protection domain.

The concern here is that it can be attacked to exploit (stale) CHERI pointers meant for a different address space (mostly in a scenario where you do more dynamic stuff after init like restarting tasks or reloading programs). That might sound paranoid, but I think it's the right level of wariness people using CHERI on seL4 would have.

* It creates CHERI capabilities for each “mr” mapping.

* The monitor/root-task itself has permissions to write valid CHERI capabilities to its IPC buffer to perform protection domain setups.

* The monitor will remap its IPC buffer without CHERI permissions to load/store valid CHERI capabilities after it boots.

* It creates CHERI capabilities for code and each protection domain to set up its CHERI captable.

* It disables any tagged CHERI capability propagation across protection domains (and address spaces) by creating IPC buffers without CHERI Page permissions

* It invalidates any almighty CHERI capabilities after performing the load process.

Please let me know what you think.

I'm getting more and more convinced that handling CHERI pointers for another address space with valid CHERI pointers is a broken design.

I think everything is much simpler and safer if you add a syscall that can construct valid CHERI caps in a task's register from non-CHERI integer arguments. It would require a TCB cap and perhaps also a VSpace cap. That would make it possible to do everything you need, without all the complications. seL4's changes would actually be minimal, both code-wise as semantically. For completeness you also want a syscall that can read the CHERI pointers in a safe, deconstructed way.

These syscalls can then safely be used to manipulate the new CHERI system registers and create both non-CHERI and CHERI-enabled user space tasks, without the task doing that being able to gain "almighty" DDC/PCC for itself. Cross-address space CHERI pointer passing by accident would be impossible, it could only be done explicitly by tasks with the right permissions via the syscalls, or by enabling the CHERI PTE bits on shared memory.

The problem of using PTE bits for IPC buffers is that it doesn't give explicit control, it's an all or nothing solution. If tasks A and B need to pass CHERI pointers, and task A and C, then very quickly B and C need to be able to pass CHERI pointers too. Worse, the stale pointers for B or C stay dormant in A's IPC buffer and might be used by attackers to gain access to A's address space. If B or C is the attacker, then they probably have control over that pointer value.

My other concern of using CHERI PTE bits on IPC buffers as policy mechanism is that you will just enable them all everywhere to simplify user space porting, creating a huge security hole.

As for passing CHERI pointers via IPC calls between threads in the same address space, I think that would be okay if we limit it to registers only (so no IPC buffer passing), preferably via explicit syscall wrappers, to make it an explicit operation by both the receiver and the sender. If user space wants to pass more than 4 CHERI pointers at once, it can do that via memory itself.

@heshamelmatary
Copy link

Hi Indan,

Thanks for replying.

(I'm answering your questions with a more dynamic system in mind than Microkit.)

Sure, happy to discuss your concerns. I just wanted to mainly stick with a simple current static system that exists and is deployed/used, for the purpose of this RFC and its implementation/evaluation, and just for a start. All your other concerns are definitely valid, but they may take some effort and time to think about, evaluate and reach an agreement on.

and the monitor invalidates its “almighty” CHERI capabilities.

How does it clean up derived CHERI pointers? After startup, the stack and maybe the heap will be littered with stale, but valid CHERI pointers.

I imagine that's the same with the current seL4 root tasks (without CHERI). It will still currently hold all capabilities that the kernel gave it. How does it currently address this issue? I don't know what it does in Microkit at the moment. But I imagine it could suspend/delete itself and become a zombie after it has finished bootstrapping the system? This could also work for CHERI, if there's no way to reach/use those capabilities after the root task is suspended/deleted.

Hence, if the root task isn't part of the TCB, I could think of some solutions here:

  1. Like with non-CHERI root tasks, if it just suspends/deletes itself, stale CHERI + seL4 capabilities won't be reachable.
  2. It could, itself, just zero/invalidate all tags on its stack/heap (and global data) at the end of booting, being trusted at this stage.
  3. It could create another "clean-up helper" task, that once the root task is suspended/done, it zeros all of its memory and/or revoke (using seL4's revocation mechanism) untypes.
  4. If the root task is privileged hybrid/legacy (i.e., it's not CHERI's memory-safe and doesn't need/want to be, but just needs to create purecap unprivileged CHERI tasks), it could just manually create/derive CHERI capabilities with some helper functions for its children purecap tasks, and probably prevent saving any capabilities to global memory. In this scenario, it's fine to leave stale CHERI pointers in the root task even, as long as those are never transferred to other tasks after bootstrapping.

An almighty capability is a valid CHERI capability that has full permissions and full address range (e.g., from 0 to 2^64 - 1 on 64-bit systems). CHERI hardware starts running (on reset) with 2 almighty capabilities: DDC and PCC. Both are used to derive and create further capabilities with less permissions and address space ranges. Typically there’s one or two when systems like: firmware, hypervisor, OS, linkers/loaders, and userspace. Note that no almighty CHERI capability can bypass an existing MMU protection.

How is the initialisation problem solved on BSD/Linux? Does the dynamic loader have unbound DDC/PCC? If so, how does it assure that all copies are gone after init? For that matter, how is the heap implemented? Does the heap manager have a reserved virtual address range which it can use and one pointer for the whole range?

CheriBSD currently keeps almighty/super capabilities in the kernel to construct further capabilities for the user if needed. It doesn't currently revoke or invalidate them and their copies after init. For example, when mmap() is called to perform allocation/mapping, or when ptrace is asked (and is authorised to, see below) to write a register capability to a target process, they use those root capabilities. While it's currently assumed that the kernel is trusted with those capabilities, there's an active research to improve that to even compartmentalise the kernel itself and apply some revocation techniques (i.e., to revoke those almighty capabilities after init) if needed for temporal safety.

Who gets almighty CHERI capabilities?

loader, kernel and the root task only. Only the kernel could keep them. Almighty capabilities are used to create/derive other capabilities for different address spaces, but only during bootstrapping time.

How are tasks that handle another task's cross-address-space pointers, e.g. debuggers, implemented on such platforms? Does ptrace pass valid CHERI pointers or does it construct them? This BSD ptrace page says they are constructed, why can't we do the same for seL4? That would solve the whole security problem of having CHERI pointers in the wrong address space.

Debuggers/tracers don't pass valid pointers, yes. It could (ask the kernel to) construct them using some system call. We can do the same in seL4. My colleagues working on CheriBSD told me there are currently 3 modes for the kernel to construct/write a capability to a target process using sysctl:

  1. you can set any capability that can be derived from the current register file in the target process
  2. you can set any capability that can be derived from the current memory mappings in the target process
  3. you can set any capability (hence, needs a root capability)

What is the added value of handling valid CHERI pointers in a task with a different address space, instead of having a syscall that creates a valid CHERI pointer in a task's register? It only seems to have security downsides, at best it's just more convenient.

* The root task (monitor), is going to be given full CHERI capability permissions/range that cover the entire user address space. It’s trusted to properly create a security policy and derive further CHERI capabilities for each protection domain.

The concern here is that it can be attacked to exploit (stale) CHERI pointers meant for a different address space (mostly in a scenario where you do more dynamic stuff after init like restarting tasks or reloading programs). That might sound paranoid, but I think it's the right level of wariness people using CHERI on seL4 would have.

That's a valid concern. I agree. Legit question though, how is the root task currently viewed in the currently implemented seL4 systems? Is it part of the TCB? Is it designed and implemented to be part of the attack vector and could be attacked after it boots?

* It creates CHERI capabilities for each “mr” mapping.

* The monitor/root-task itself has permissions to write valid CHERI capabilities to its IPC buffer to perform protection domain setups.

* The monitor will remap its IPC buffer without CHERI permissions to load/store valid CHERI capabilities after it boots.

* It creates CHERI capabilities for code and each protection domain to set up its CHERI captable.

* It disables any tagged CHERI capability propagation across protection domains (and address spaces) by creating IPC buffers without CHERI Page permissions

* It invalidates any almighty CHERI capabilities after performing the load process.

Please let me know what you think.

I'm getting more and more convinced that handling CHERI pointers for another address space with valid CHERI pointers is a broken design.

I totally agree with you, for a complete secure design with dynamic seL4 systems, this should (and could) be prevented. Hence I discussed solutions like sealing etc before. I still think, from a practical/implementation PoV, a static system like Microkit with a hybrid/legacy root task that, itself, doesn't care about or need CHERI's memory-safety, but just creates CHERI caps for protection domains (where they never propagate/receive valid CHERI caps after boot nor have inter-AS CHERI caps), this is also a secure implementation.

I think everything is much simpler and safer if you add a syscall that can construct valid CHERI caps in a task's register from non-CHERI integer arguments. It would require a TCB cap and perhaps also a VSpace cap. That would make it possible to do everything you need, without all the complications. seL4's changes would actually be minimal, both code-wise as semantically. For completeness you also want a syscall that can read the CHERI pointers in a safe, deconstructed way.

That's a good suggestion, and I've been keeping it in the backlog for a while, but now I've started experimenting with its implementation and I'd like your feedback. I imagine that would only be one system call to construct+write valid CHERI pointers that doesn't use the IPC buffer at all, like:

seL4_CheriWriteRegister(seL4_Cptr tcb, seL4_Cptr VSpace, seL4_Word reg_idx, seL4_Register cheri_reg, seL4_Word construct_mode);

  • tcb: is a capability to the destination TCB.
  • VSpace: is a capability to the VSpace associated with tcb. If it's zero or invalid, just write untagged cheri_reg in the CPU register. Also if the caller just wants to write untagged version of the register for whatever reason. Otherwise, if VSpace is valid, try to construct a valid tagged CHERI register from cheri_reg.
  • reg_idx: Index to the destination register.
  • cheri_reg: cheri-width register (could be tagged or not), but the kernel will always untag and re-construct from another source capability if there.
  • construct_mode: a flag to the kernel to advise it how the destination tagged register is constructed, given the passed VSpace is valid and has permission to do so. Following CheriBSD/sysctl above, this code be:
  1. 0: No tags: don't try to construct a tagged pointer, write cheri_reg untagged.
  2. 1: Most restrictive, only try to construct it from the current target thread's CPU registers (iterate over the 32 GPRs for RISC-V for instance, starting from the old register's value). Otherwise, write it untagged.
  3. 2: Virtual mapping, try to construct a capability for it only of the target's VSpace already has a mapping that covers this capability. Otherwise, right it untagged.
  4. 3: Arbitrary: create a valid CHERI capability always.

All current "ReadRegs" syscall just return untagged capabilities.

I also suggest for each VSpace seL4 capability, new permissions bits could be added analogous to the above construct_mode modes. If those permissions are 0, for instance, then regardless of construct_mode, no valid pointers can be constructed. If it's 3, then always allow constructing a valid pointer if asked to.

What do you think?

These syscalls can then safely be used to manipulate the new CHERI system registers and create both non-CHERI and CHERI-enabled user space tasks, without the task doing that being able to gain "almighty" DDC/PCC for itself. Cross-address space CHERI pointer passing by accident would be impossible, it could only be done explicitly by tasks with the right permissions via the syscalls, or by enabling the CHERI PTE bits on shared memory.

The problem of using PTE bits for IPC buffers is that it doesn't give explicit control, it's an all or nothing solution. If tasks A and B need to pass CHERI pointers, and task A and C, then very quickly B and C need to be able to pass CHERI pointers too. Worse, the stale pointers for B or C stay dormant in A's IPC buffer and might be used by attackers to gain access to A's address space. If B or C is the attacker, then they probably have control over that pointer value.

My other concern of using CHERI PTE bits on IPC buffers as policy mechanism is that you will just enable them all everywhere to simplify user space porting, creating a huge security hole.

As for passing CHERI pointers via IPC calls between threads in the same address space, I think that would be okay if we limit it to registers only (so no IPC buffer passing), preferably via explicit syscall wrappers, to make it an explicit operation by both the receiver and the sender. If user space wants to pass more than 4 CHERI pointers at once, it can do that via memory itself.

@Indanz
Copy link
Contributor

Indanz commented Apr 16, 2025

I imagine that's the same with the current seL4 root tasks (without CHERI). It will still currently hold all capabilities that the kernel gave it. How does it currently address this issue?

The main way of dealing with this in seL4 is by modularisation and delegation so that tasks don't have more capabilities than strictly necessary. In seL4, if you delete a capability, all corresponding objects get destroyed, so you need to keep references somewhere. But those references are in CSpace, not littered on the stack and heap. If the CSpace slot becomes invalid, all stale CPath values on stack and heap are unusable.

Another difference is that to misuse seL4 capabilities you need to trick a trusted task to execute specific syscalls on specific capabilities, which is much harder to exploit than arbitrary memory locations. If it handles a bunch of TCBs, its own TCB cap will not be in that list.

With CHERI you can use any cheri pointer meant for another task to attack the trusted task itself, because the domain it applies to is not stored in the cheri pointer itself. For seL4 capabilities that would be like being able to use any TCB cap to manipulate the root task itself.

I don't know what it does in Microkit at the moment.

Probably it doesn't do much after init. Even restarting faulting tasks is currently not done as far as I know.

But I imagine it could suspend/delete itself and become a zombie after it has finished bootstrapping the system? This could also work for CHERI, if there's no way to reach/use those capabilities after the root task is suspended/deleted.

That would work, yes.

Hence, if the root task isn't part of the TCB, I could think of some solutions here:

Talking about trusted computing base is the wrong way of looking at it: The question isn't whether it's trusted, the question is whether it's attackable, not only now, but also in the future. People generally put too much trust into their own software.

1. Like with non-CHERI root tasks, if it just suspends/deletes itself, stale CHERI + seL4 capabilities won't be reachable.

Fine, but perhaps impractical.

2. It could, itself, just zero/invalidate all tags on its stack/heap (and global data) at the end of booting, being trusted at this stage.

Theoretically sound, practically very hard to get right if the software gets slightly more complicated, as it will need valid cheri tags to function.

3. It could create another "clean-up helper" task, that once the root task is suspended/done, it zeros all of its memory and/or revoke (using seL4's revocation mechanism) untypes.

Doesn't add anything to 1) if the memory isn't shared. If the memory is shared, then the problem spreads to all the tasks sharing the memory.

4. If the root task is privileged hybrid/legacy (i.e., it's not CHERI's memory-safe and doesn't need/want to be, but just needs to create purecap unprivileged CHERI tasks), it could just manually create/derive CHERI capabilities with some helper functions for its children purecap tasks, and probably prevent saving any capabilities to global memory. In this scenario, it's fine to leave stale CHERI pointers in the root task even, as long as those are never transferred to other tasks after bootstrapping.

Agreed. But ask yourself: Why are you using a CHERI system to begin with, if your most privileged task is not protected by it? That is fine to do if you delegate the real work to less privileged tasks, but then those have the same stale CHERI pointer problem.

How is the initialisation problem solved on BSD/Linux? Does the dynamic loader have unbound DDC/PCC? If so, how does it assure that all copies are gone after init? For that matter, how is the heap implemented? Does the heap manager have a reserved virtual address range which it can use and one pointer for the whole range?

CheriBSD currently keeps almighty/super capabilities in the kernel to construct further capabilities for the user if needed. It doesn't currently revoke or invalidate them and their copies after init. For example, when mmap() is called to perform allocation/mapping, or when ptrace is asked (and is authorised to, see below) to write a register capability to a target process, they use those root capabilities. While it's currently assumed that the kernel is trusted with those capabilities, there's an active research to improve that to even compartmentalise the kernel itself and apply some revocation techniques (i.e., to revoke those almighty capabilities after init) if needed for temporal safety.

So user space can't create arbitrary valid CHERI pointers for different address spaces directly. Then why would we do that in seL4?

I guess CheriBSD only support a mmap() based heap allocators and not an sbrk() ones.

Debuggers/tracers don't pass valid pointers, yes. It could (ask the kernel to) construct them using some system call. We can do the same in seL4. My colleagues working on CheriBSD told me there are currently 3 modes for the kernel to construct/write a capability to a target process using sysctl:

1. you can set any capability that can be derived from the current register file in the target process

The target process could do that itself though, why would another task need to do that? What is the use case?

2. you can set any capability that can be derived from the current memory mappings in the target process

Makes sense for BSD/Linux. In seL4 user space has more direct control over memory mappings, so the added value is less there. The check could be replaced with requiring a vspace cap instead, that avoids the need for a relatively complicated page table walk.

3. you can set any capability (hence, needs a root capability)

In seL4 this would require a vspace cap, as that's the closest thing to full control over a task's address space.

Legit question though, how is the root task currently viewed in the currently implemented seL4 systems? Is it part of the TCB? Is it designed and implemented to be part of the attack vector and could be attacked after it boots?

That depends on the system. For very simple systems that don't care about restarting faulting tasks or logging any debug info when it happens, there is no way to interact with the root task, and hence it's secure. It can be as buggy and insecure software as you like, but it can't be used to attack the system as it's unreachable. I don't know whether you would call it part of the trusted compute base or not.

If you do restart tasks, or if you have some communication channel, then things become more precarious. Usually the root task won't do that itself directly, but it would create other threads or processes and delegate to them. At that point the problem moves from the root task to those other tasks. In that sense there is nothing special about the root task, it's just the task that has access to most capabilities.

That's why I keep using a fault handler task as example: Doesn't matter whether it's the root task or something else, if it wants to restart or debug it needs to set or get CHERI pointers. It has interaction with potentially untrusted tasks, so it can be attacked. Same for VM managers.

But with seL4 all the attack vectors are self-induced, you get what you create and there shouldn't be any unexpected interactions (well, ignoring HW based attacks like rowhammer). The main way of securing it is by keeping it as simple as functionally possible.

That's a good suggestion, and I've been keeping it in the backlog for a while, but now I've started experimenting with its implementation and I'd like your feedback. I imagine that would only be one system call to construct+write valid CHERI pointers that doesn't use the IPC buffer at all, like:

seL4_CheriWriteRegister(seL4_Cptr tcb, seL4_Cptr VSpace, seL4_Word reg_idx, seL4_Register cheri_reg, seL4_Word construct_mode);

* **tcb**: is a capability to the destination TCB.

* **VSpace**: is a capability to the VSpace associated with _tcb_. If it's zero or invalid, just write untagged `cheri_reg` in the CPU register. Also if the caller just wants to write untagged version of the register for whatever reason. Otherwise, if VSpace is valid, try to construct a valid tagged CHERI register from `cheri_reg`.

Good point about not needing vspace for writing invalid cheri pointers, which could be useful for other reasons.

* **reg_idx**: Index to the destination register.

* **cheri_reg**: cheri-width register (could be tagged or not), but the kernel will always untag and re-construct from another source capability if there.

I would probably split this up in two word_t width parameters: One for the register value and one for the CHERI capability value. That avoids any confusion whether this is a valid CHERI pointer or not and would reduce the seL4 kernel changes.

* **construct_mode**: a flag to the kernel to advise it how the destination tagged register is constructed, given the passed VSpace is valid and has permission to do so. Following CheriBSD/sysctl above, this code be:


1. 0: No tags: don't try to construct a tagged pointer, write `cheri_reg` untagged.

We can do this by not passing a vspace.

2. 1: Most restrictive, only try to construct it from the current target thread's CPU registers (iterate over the 32 GPRs for RISC-V for instance, starting from the old register's value). Otherwise, write it untagged.

If we add this, I'd prefer an explicit source register argument instead of trying to guess in the kernel.

3. 2: Virtual mapping, try to construct a capability for it only of the target's VSpace already has a mapping that covers this capability. Otherwise, right it untagged.

I think having a vspace cap is enough, no need to also check whether there is a compatible virtual mapping. User space should know what that mapping is already and could check it in user space. I'd rather not walk the whole page table and check all entries for compatibility. (Does Cheri BSD simplify this by limiting the range to one page or something?)

It also makes it possible to create the necessary cheri pointers before the mappings are done. We can't invalidate existing cheri pointers if a mapping disappears either.

4. 3: Arbitrary: create a valid CHERI capability always.

Could be replaced with tcb + vspace.

All current "ReadRegs" syscall just return untagged capabilities.

I would split this up in two word_t values too, to keep it consistent.

I also suggest for each VSpace seL4 capability, new permissions bits could be added analogous to the above construct_mode modes. If those permissions are 0, for instance, then regardless of construct_mode, no valid pointers can be constructed. If it's 3, then always allow constructing a valid pointer if asked to.

This is making things unnecessarily complicated with very little gain. Keep in mind you need both a tcb and a vspace cap already. If we need more fine grained permissions, it's on tcb caps, not vspace caps.

What do you think?

It's finally moving in the right direction.

@heshamelmatary
Copy link

I imagine that's the same with the current seL4 root tasks (without CHERI). It will still currently hold all capabilities that the kernel gave it. How does it currently address this issue?

The main way of dealing with this in seL4 is by modularisation and delegation so that tasks don't have more capabilities than strictly necessary. In seL4, if you delete a capability, all corresponding objects get destroyed, so you need to keep references somewhere. But those references are in CSpace, not littered on the stack and heap. If the CSpace slot becomes invalid, all stale CPath values on stack and heap are unusable.

Another difference is that to misuse seL4 capabilities you need to trick a trusted task to execute specific syscalls on specific capabilities, which is much harder to exploit than arbitrary memory locations. If it handles a bunch of TCBs, its own TCB cap will not be in that list.

With CHERI you can use any cheri pointer meant for another task to attack the trusted task itself, because the domain it applies to is not stored in the cheri pointer itself. For seL4 capabilities that would be like being able to use any TCB cap to manipulate the root task itself.

I don't know what it does in Microkit at the moment.

Probably it doesn't do much after init. Even restarting faulting tasks is currently not done as far as I know.

But I imagine it could suspend/delete itself and become a zombie after it has finished bootstrapping the system? This could also work for CHERI, if there's no way to reach/use those capabilities after the root task is suspended/deleted.

That would work, yes.

Hence, if the root task isn't part of the TCB, I could think of some solutions here:

Talking about trusted computing base is the wrong way of looking at it: The question isn't whether it's trusted, the question is whether it's attackable, not only now, but also in the future. People generally put too much trust into their own software.

1. Like with non-CHERI root tasks, if it just suspends/deletes itself, stale CHERI + seL4 capabilities won't be reachable.

Fine, but perhaps impractical.

2. It could, itself, just zero/invalidate all tags on its stack/heap (and global data) at the end of booting, being trusted at this stage.

Theoretically sound, practically very hard to get right if the software gets slightly more complicated, as it will need valid cheri tags to function.

3. It could create another "clean-up helper" task, that once the root task is suspended/done, it zeros all of its memory and/or revoke (using seL4's revocation mechanism) untypes.

Doesn't add anything to 1) if the memory isn't shared. If the memory is shared, then the problem spreads to all the tasks sharing the memory.

4. If the root task is privileged hybrid/legacy (i.e., it's not CHERI's memory-safe and doesn't need/want to be, but just needs to create purecap unprivileged CHERI tasks), it could just manually create/derive CHERI capabilities with some helper functions for its children purecap tasks, and probably prevent saving any capabilities to global memory. In this scenario, it's fine to leave stale CHERI pointers in the root task even, as long as those are never transferred to other tasks after bootstrapping.

Agreed. But ask yourself: Why are you using a CHERI system to begin with, if your most privileged task is not protected by it? That is fine to do if you delegate the real work to less privileged tasks, but then those have the same stale CHERI pointer problem.

You'll still be able to enhance your system's security by creating less privileged memory-safe Microkit protection domains ("children of the root task"). If the root task gets attacked after bootstrapping, best it could do is violate its own memory-safety (if it had/required it in the first place, being purecap), but that won't give it any extra powers (beyond what seL4 caps give it) over the purecap protection domains after boot, given we completely prevent CHERI caps propagation between protection domains in a system like Microkit.

How is the initialisation problem solved on BSD/Linux? Does the dynamic loader have unbound DDC/PCC? If so, how does it assure that all copies are gone after init? For that matter, how is the heap implemented? Does the heap manager have a reserved virtual address range which it can use and one pointer for the whole range?

CheriBSD currently keeps almighty/super capabilities in the kernel to construct further capabilities for the user if needed. It doesn't currently revoke or invalidate them and their copies after init. For example, when mmap() is called to perform allocation/mapping, or when ptrace is asked (and is authorised to, see below) to write a register capability to a target process, they use those root capabilities. While it's currently assumed that the kernel is trusted with those capabilities, there's an active research to improve that to even compartmentalise the kernel itself and apply some revocation techniques (i.e., to revoke those almighty capabilities after init) if needed for temporal safety.

So user space can't create arbitrary valid CHERI pointers for different address spaces directly. Then why would we do that in seL4?

I am not suggesting to do so if we introduce the new system call.

I guess CheriBSD only support a mmap() based heap allocators and not an sbrk() ones.

Debuggers/tracers don't pass valid pointers, yes. It could (ask the kernel to) construct them using some system call. We can do the same in seL4. My colleagues working on CheriBSD told me there are currently 3 modes for the kernel to construct/write a capability to a target process using sysctl:

1. you can set any capability that can be derived from the current register file in the target process

The target process could do that itself though, why would another task need to do that? What is the use case?

Debuggers: to set a target process' $pcc for example
Fault/exception handlers: to reset or increment $pcc for example after a user-level syscall or a fault.

2. you can set any capability that can be derived from the current memory mappings in the target process

Makes sense for BSD/Linux. In seL4 user space has more direct control over memory mappings, so the added value is less there. The check could be replaced with requiring a vspace cap instead, that avoids the need for a relatively complicated page table walk.

I think it has use cases (especially less restrictive ones) in seL4 as well.

  • For a debugger/fault handler, you might want to still be able to give it more privilege to create a CHERI capability from valid mappings, especially if it failed to find one in the current CPU's run-time snapshot of the target process GPRs.
  • An seL4 task could still have a VSpace capability but not necessarily page capabilities nor virtual mappings to construct/backup any arbitrary CHERI capability, and at the same time, the user may not want to create any arbitrary CHERI capabilities not backed up with currently mapped pages and/or seL4 page capabilities. It's just giving an option to the user and can be configurable per VSpace/TCB caps.

My colleagues also said CheriBSD doesn't do PTW for it, but rather walk the vm_map_entry structures/reservation

3. you can set any capability (hence, needs a root capability)

In seL4 this would require a vspace cap, as that's the closest thing to full control over a task's address space.

Yeah but imagine a purecap root task or another task having both TCB and VSpace capabilities to themselves. You still might not want them to call seL4_CheriWriteRegister which, in this case, could give them a super capability, which could break all CHERI memory-safety. So you want a more fine-grained permission in this case to prevent that. i.e., you could still have full control over a task's VSpace, but you may still want to make sure it's unprivileged and memory-safe; you could have full access to your own TCB/VSpace. But you still need a valid CHERI capability to load and/or jump (CFI) to another function for instance. Does that make sense?

Legit question though, how is the root task currently viewed in the currently implemented seL4 systems? Is it part of the TCB? Is it designed and implemented to be part of the attack vector and could be attacked after it boots?

That depends on the system. For very simple systems that don't care about restarting faulting tasks or logging any debug info when it happens, there is no way to interact with the root task, and hence it's secure. It can be as buggy and insecure software as you like, but it can't be used to attack the system as it's unreachable. I don't know whether you would call it part of the trusted compute base or not.

If you do restart tasks, or if you have some communication channel, then things become more precarious. Usually the root task won't do that itself directly, but it would create other threads or processes and delegate to them. At that point the problem moves from the root task to those other tasks. In that sense there is nothing special about the root task, it's just the task that has access to most capabilities.

That's why I keep using a fault handler task as example: Doesn't matter whether it's the root task or something else, if it wants to restart or debug it needs to set or get CHERI pointers. It has interaction with potentially untrusted tasks, so it can be attacked. Same for VM managers.

But with seL4 all the attack vectors are self-induced, you get what you create and there shouldn't be any unexpected interactions (well, ignoring HW based attacks like rowhammer). The main way of securing it is by keeping it as simple as functionally possible.

That's a good suggestion, and I've been keeping it in the backlog for a while, but now I've started experimenting with its implementation and I'd like your feedback. I imagine that would only be one system call to construct+write valid CHERI pointers that doesn't use the IPC buffer at all, like:
seL4_CheriWriteRegister(seL4_Cptr tcb, seL4_Cptr VSpace, seL4_Word reg_idx, seL4_Register cheri_reg, seL4_Word construct_mode);

* **tcb**: is a capability to the destination TCB.

* **VSpace**: is a capability to the VSpace associated with _tcb_. If it's zero or invalid, just write untagged `cheri_reg` in the CPU register. Also if the caller just wants to write untagged version of the register for whatever reason. Otherwise, if VSpace is valid, try to construct a valid tagged CHERI register from `cheri_reg`.

Good point about not needing vspace for writing invalid cheri pointers, which could be useful for other reasons.

Yeah, I thought it'd be useful for backward compatibility as well (e.g., non-CHERI tasks that might want to use this syscall).

* **reg_idx**: Index to the destination register.

* **cheri_reg**: cheri-width register (could be tagged or not), but the kernel will always untag and re-construct from another source capability if there.

I would probably split this up in two word_t width parameters: One for the register value and one for the CHERI capability value. That avoids any confusion whether this is a valid CHERI pointer or not and would reduce the seL4 kernel changes.

It doesn't really matter if the user passes it tagged or not. It'll always get untagged by the kernel. This is just to 1) enforce CHERI ABI with hardware-width registers and format actually being of seL4_Register 2) for performance not to unnecessarily spit 2 64-bit args.

* **construct_mode**: a flag to the kernel to advise it how the destination tagged register is constructed, given the passed VSpace is valid and has permission to do so. Following CheriBSD/sysctl above, this code be:


1. 0: No tags: don't try to construct a tagged pointer, write `cheri_reg` untagged.

We can do this by not passing a vspace.

2. 1: Most restrictive, only try to construct it from the current target thread's CPU registers (iterate over the 32 GPRs for RISC-V for instance, starting from the old register's value). Otherwise, write it untagged.

If we add this, I'd prefer an explicit source register argument instead of trying to guess in the kernel.

Sure. Just to confirm a source register index in the target process, right? If so, do we still want to give the user the option to iterate over GPRs if the source register isn't valid or big enough? Or if the user doesn't want to or know what source arg to choose? We could probably add a flag to construction_mode to iterate over GPRs or not. Similarly, what the priority of where/how to get a source CHERI capability to construct the new one from, in case all is allowed. e.g., first try this source register, if not, try remaining GPRs, but if nothing there, do the page mapping thing.

3. 2: Virtual mapping, try to construct a capability for it only of the target's VSpace already has a mapping that covers this capability. Otherwise, right it untagged.

I think having a vspace cap is enough, no need to also check whether there is a compatible virtual mapping. User space should know what that mapping is already and could check it in user space. I'd rather not walk the whole page table and check all entries for compatibility. (Does Cheri BSD simplify this by limiting the range to one page or something?)

It's just an extra option/guarantee to only construct a capability of there's a backed up page mapping for it. It could be useful as I mentioned above, if the current target process' GPR snapshot doesn't happen to have a capability to derive from, but it still has a mapping for it. It could reduce a privilege for a debugger not to request using a root capability at all (the confinement problem?) that allows it to construct arbitrary capabilities in the target process. I don't have a strong bias to add/support that at the moment, so maybe just 1 and 2.

But I could imagine (and I think it's actually necessarily if we completely ditch inter-AS super capabilities in user) a more sophisticated fine-grained option to allow the user to pass a set of seL4 page capabilities to this syscall and construct a single CHERI capability out of them. This could solve an issue for a user-level mmap() server as well, where you need to return a bigger-than-an-seL4-page-size CHERI capability off an array of contiguous pages. I actually came across this issue in sel4test when it tries to allocate/map sel4utils_map_pages() for length size and I just needed it to construct/return a single CHERI cap for it (as it just return a pointer to the allocated/mapped region).

It also makes it possible to create the necessary cheri pointers before the mappings are done. We can't invalidate existing cheri pointers if a mapping disappears either.

4. 3: Arbitrary: create a valid CHERI capability always.

Could be replaced with tcb + vspace.

Yeah but let me know about the situation above where a purecap task might have a tcb + vspace itself but you still don't want it to create arbitrary super CHERI capabilities?

All current "ReadRegs" syscall just return untagged capabilities.

I would split this up in two word_t values too, to keep it consistent.

I'd like to stick with enforcing the CHREI ABI by returning an untagged CHERI-width register, and probably return an additional byte for the tag and extra info if needed. This is what CheriBSD does in ptrace.

I also suggest for each VSpace seL4 capability, new permissions bits could be added analogous to the above construct_mode modes. If those permissions are 0, for instance, then regardless of construct_mode, no valid pointers can be constructed. If it's 3, then always allow constructing a valid pointer if asked to.

This is making things unnecessarily complicated with very little gain. Keep in mind you need both a tcb and a vspace cap already. If we need more fine grained permissions, it's on tcb caps, not vspace caps.

I provided some reasons why those permissions might be necessary above, please let me know what you think. I also thought these new permissions could just be in the tcb cap as well, but then I thought it makes more sense to bind it to a VSPace and really a CHERI capability is bound to an AS/VSpace, rather than a TCB.

What do you think?

It's finally moving in the right direction.

@heshamelmatary
Copy link

I anticipate if we decide to implement this new system call, we will also need a similar one (or integrate it in the above syscall) for writing tagged capabilities in the target process' AS memory, besides writing to registers. This is, for example, to enable user-level ELF loaders and process creators to initialise CHERI's capability table for each process/ELF/protection domain before it starts executing.

We will also likely need to stick those permissions to VSpace and probably also pass page capabilities to those system calls in scenarios like:

  1. Initialise CHERI's capable per ELF before a process executes for the first time, where there are no current valid CHERI capabilities in any GPRs (and a TCB could be created later even). This requires writing valid inter-AS CHERI capability for the children. This operation doesn't necessarily need an seL4 TCB capability, but just a VSpace and page capabilities.
  2. Create/write a CHERI capability for the process' entry point and/or stack to a PC register.

Giving a "forging" permission to a VSpace to arbitrarily manufacture/write any CHERI capability (i.e., option #3 above, even if they have a VSpace cap) could be considered less secure. This is compared to restricting "deriving" new CHERI capabilities to having existing GPRs to derive valid CHERI caps from (which doesn't apply in the previous two scenarios), or from VSpace/page capabilities.

@Indanz
Copy link
Contributor

Indanz commented Apr 23, 2025

The target process could do that itself though, why would another task need to do that? What is the use case?

Debuggers: to set a target process' $pcc for example Fault/exception handlers: to reset or increment $pcc for example after a user-level syscall or a fault.

Fair enough. Any examples where you would want to use a different source register than the destination one? If not, we could make this the default behaviour if no vspace cap is given.

I think it has use cases (especially less restrictive ones) in seL4 as well.

I'm not convinced it's worth it.

My colleagues also said CheriBSD doesn't do PTW for it, but rather walk the vm_map_entry structures/reservation

Yes, that's exactly the part that user space is supposed to handle in seL4. So if you want something similar, you have to implement it in user space. The seL4 kernel has no additional metadata for mappings, other than what's stored in page caps.

For the kernel to do a range check, it needs to do a depth first walk of the page table and check all permissions along the way.

3. you can set any capability (hence, needs a root capability)

In seL4 this would require a vspace cap, as that's the closest thing to full control over a task's address space.
Yeah but imagine a purecap root task or another task having both TCB and VSpace capabilities to themselves. You still might not want them to call seL4_CheriWriteRegister which, in this case, could give them a super capability, which could break all CHERI memory-safety. So you want a more fine-grained permission in this case to prevent that. i.e., you could still have full control over a task's VSpace, but you may still want to make sure it's unprivileged and memory-safe; you could have full access to your own TCB/VSpace. But you still need a valid CHERI capability to load and/or jump (CFI) to another function for instance. Does that make sense?

Not really. Everything you say also applies to other vspace operations, I don't see why CHERI would be an exception here.

Good point about not needing vspace for writing invalid cheri pointers, which could be useful for other reasons.

Yeah, I thought it'd be useful for backward compatibility as well (e.g., non-CHERI tasks that might want to use this syscall).

Ideally the new syscalls would be generic individual register read/write syscalls, with enough flexibility to implement what CHERI needs. That is, the non-CHERI generic syscalls should be added too. the only difference would be additional arguments for CHERI. That's another reason to keep everything the same with non tagged pointer arguments.

It doesn't really matter if the user passes it tagged or not. It'll always get untagged by the kernel. This is just to 1) enforce CHERI ABI with hardware-width registers and format actually being of seL4_Register

This is exactly what I'm trying to avoid.

  1. for performance not to unnecessarily spit 2 64-bit args.

None of this is performance critical and one or two cycles extra won't make any difference anyway.

If we add this, I'd prefer an explicit source register argument instead of trying to guess in the kernel.

Sure. Just to confirm a source register index in the target process, right?

Of course.

If so, do we still want to give the user the option to iterate over GPRs if the source register isn't valid or big enough?

No? Do you mean you want to combine multiple overlapping/adjacent CHERI pointers into one that covers both ranges? Does CheriBSD support that? That seems very obscure functionality.

Or if the user doesn't want to or know what source arg to choose?

The user can retrieve all register values and iterate over them itself if it wants to, it doesn't need the kernel to do that.

We could probably add a flag to construction_mode to iterate over GPRs or not. Similarly, what the priority of where/how to get a source CHERI capability to construct the new one from, in case all is allowed. e.g., first try this source register, if not, try remaining GPRs, but if nothing there, do the page mapping thing.

This is overcomplicating things unnecessarily, which is what I am trying to avoid.

But I could imagine (and I think it's actually necessarily if we completely ditch inter-AS super capabilities in user) a more sophisticated fine-grained option to allow the user to pass a set of seL4 page capabilities to this syscall and construct a single CHERI capability out of them. This could solve an issue for a user-level mmap() server as well, where you need to return a bigger-than-an-seL4-page-size CHERI capability off an array of contiguous pages. I actually came across this issue in sel4test when it tries to allocate/map sel4utils_map_pages() for length size and I just needed it to construct/return a single CHERI cap for it (as it just return a pointer to the allocated/mapped region).

Again, overcomplication with not much gain. For user-level mmap someone has to map the memory, and for that it already has the vspace cap. Your solutions avoids the vspace cap, but that isn't the problem: What you want to avoid in a pure mmap implementation is the extra tcb cap you need for CHERI, but that's not possible when writing registers.

I'd like to stick with enforcing the CHREI ABI by returning an untagged CHERI-width register, and probably return an additional byte for the tag and extra info if needed. This is what CheriBSD does in ptrace.

Then I'll probably complain about it when I review the code, except if you manage to keep it very simple.

@Indanz
Copy link
Contributor

Indanz commented Apr 23, 2025

I anticipate if we decide to implement this new system call, we will also need a similar one (or integrate it in the above syscall) for writing tagged capabilities in the target process' AS memory, besides writing to registers.

My initial reaction is: Absolutely not.

(If we do this, it's by adding generic cross-AS memory read/write syscalls, again with an extension for CHERI.)

This is, for example, to enable user-level ELF loaders and process creators to initialise CHERI's capability table for each process/ELF/protection domain before it starts executing.

How is this problem solved by CheriBSD and why can't we use the same solution?

2. Create/write a CHERI capability for the process' entry point and/or stack to a PC register.

This can easily be solved by passing those via registers during task startup, no need for cross-AS memory writes.

You can pass quite a few registers at launch and then run special code that does all the memory initialisation you need before normal startup. This can be special code that gets mapped before at load and unmapped when launch is done, making it fully invisible to normal user space code, having any ABI you want there.

Alternatively, you have a launcher task with almighty CHERI permission, which can write the memory itself directly. As all it does is loading ELF files, it has no direct interaction with untrusted tasks. Even if it does get compromised, there is nothing sensitive in its own address space, nor anything to exploit, as all it does is loading an ELF file and you can already control its behaviour via the ELF file you feed it. This is probably closest to how CheriBSD does this.

@heshamelmatary
Copy link

The target process could do that itself though, why would another task need to do that? What is the use case?

Debuggers: to set a target process' $pcc for example Fault/exception handlers: to reset or increment $pcc for example after a user-level syscall or a fault.

Fair enough. Any examples where you would want to use a different source register than the destination one? If not, we could make this the default behaviour if no vspace cap is given.

No examples come to mind. But I am thinking what if the caller just wants to invalidate the tag of the target register while keeping its value.

I think it has use cases (especially less restrictive ones) in seL4 as well.

I'm not convinced it's worth it.

My colleagues also said CheriBSD doesn't do PTW for it, but rather walk the vm_map_entry structures/reservation

Yes, that's exactly the part that user space is supposed to handle in seL4. So if you want something similar, you have to implement it in user space. The seL4 kernel has no additional metadata for mappings, other than what's stored in page caps.

For the kernel to do a range check, it needs to do a depth first walk of the page table and check all permissions along the way.

3. you can set any capability (hence, needs a root capability)

In seL4 this would require a vspace cap, as that's the closest thing to full control over a task's address space.
Yeah but imagine a purecap root task or another task having both TCB and VSpace capabilities to themselves. You still might not want them to call seL4_CheriWriteRegister which, in this case, could give them a super capability, which could break all CHERI memory-safety. So you want a more fine-grained permission in this case to prevent that. i.e., you could still have full control over a task's VSpace, but you may still want to make sure it's unprivileged and memory-safe; you could have full access to your own TCB/VSpace. But you still need a valid CHERI capability to load and/or jump (CFI) to another function for instance. Does that make sense?

Not really. Everything you say also applies to other vspace operations, I don't see why CHERI would be an exception here.

Good point about not needing vspace for writing invalid cheri pointers, which could be useful for other reasons.

Yeah, I thought it'd be useful for backward compatibility as well (e.g., non-CHERI tasks that might want to use this syscall).

Ideally the new syscalls would be generic individual register read/write syscalls, with enough flexibility to implement what CHERI needs. That is, the non-CHERI generic syscalls should be added too. the only difference would be additional arguments for CHERI. That's another reason to keep everything the same with non tagged pointer arguments.

It doesn't really matter if the user passes it tagged or not. It'll always get untagged by the kernel. This is just to 1) enforce CHERI ABI with hardware-width registers and format actually being of seL4_Register

This is exactly what I'm trying to avoid.

  1. for performance not to unnecessarily spit 2 64-bit args.

None of this is performance critical and one or two cycles extra won't make any difference anyway.

If we add this, I'd prefer an explicit source register argument instead of trying to guess in the kernel.

Sure. Just to confirm a source register index in the target process, right?

Of course.

If so, do we still want to give the user the option to iterate over GPRs if the source register isn't valid or big enough?

No? Do you mean you want to combine multiple overlapping/adjacent CHERI pointers into one that covers both ranges? Does CheriBSD support that? That seems very obscure functionality.

No that's not what I meant. It's just to find a big enough CHERI GPR to construct the new capability from.

Or if the user doesn't want to or know what source arg to choose?

The user can retrieve all register values and iterate over them itself if it wants to, it doesn't need the kernel to do that.

We could probably add a flag to construction_mode to iterate over GPRs or not. Similarly, what the priority of where/how to get a source CHERI capability to construct the new one from, in case all is allowed. e.g., first try this source register, if not, try remaining GPRs, but if nothing there, do the page mapping thing.

This is overcomplicating things unnecessarily, which is what I am trying to avoid.

But I could imagine (and I think it's actually necessarily if we completely ditch inter-AS super capabilities in user) a more sophisticated fine-grained option to allow the user to pass a set of seL4 page capabilities to this syscall and construct a single CHERI capability out of them. This could solve an issue for a user-level mmap() server as well, where you need to return a bigger-than-an-seL4-page-size CHERI capability off an array of contiguous pages. I actually came across this issue in sel4test when it tries to allocate/map sel4utils_map_pages() for length size and I just needed it to construct/return a single CHERI cap for it (as it just return a pointer to the allocated/mapped region).

Again, overcomplication with not much gain. For user-level mmap someone has to map the memory, and for that it already has the vspace cap. Your solutions avoids the vspace cap, but that isn't the problem: What you want to avoid in a pure mmap implementation is the extra tcb cap you need for CHERI, but that's not possible when writing registers.

Not trying to avoid vspace. It will still require TCB, VSpace, and page caps; you can still have a VSpace cap but without page caps to map to this vspace. My argument is, having a VSpace cap might not be enough (of a privilege) to construct valid arbitrary CHERI caps from to cover a big range of contiguous virtual memory. In other words, can an seL4 thread (e.g., a debugger) that has a VSpace (and only that) for different thread/AS read/write any of its memory without page mappings/caps? Similarly for constructing CHERI caps only with a VSpace.

In any case, we need a way to construct valid CHERI capabilities for bigger-than-a-page mapping, for both a debugger and a target thread (e.g., when creating a new protection domain from scratch, without having any existing valid CHERI GPRs).

I'd like to stick with enforcing the CHREI ABI by returning an untagged CHERI-width register, and probably return an additional byte for the tag and extra info if needed. This is what CheriBSD does in ptrace.

Then I'll probably complain about it when I review the code, except if you manage to keep it very simple.

@heshamelmatary
Copy link

I anticipate if we decide to implement this new system call, we will also need a similar one (or integrate it in the above syscall) for writing tagged capabilities in the target process' AS memory, besides writing to registers.

My initial reaction is: Absolutely not.

(If we do this, it's by adding generic cross-AS memory read/write syscalls, again with an extension for CHERI.)

Yeah that's exactly what I meant. I don't mean for a thread to hold inter-AS capabilities, but just exactly the same as constructing valid register caps for a target thread, we construct valid CHERI caps for a target thread and save it to its memory. I don't care if it's the same syscall or a separate one.

This is, for example, to enable user-level ELF loaders and process creators to initialise CHERI's capability table for each process/ELF/protection domain before it starts executing.

How is this problem solved by CheriBSD and why can't we use the same solution?

ELF loading is done by the kernel on fork/exec, unlike here. The kernel has root caps that it constructs CHERI caps from (for each ELF segment), then passes them to the user in its stack as expected by a UNIX process' auxv. The user's startup code constructs the capable from these CHERI caps passed in auxv[] and friends. We could do the same for seL4 (though we need to define a new ABI for native CHERI-seL4 threads including a stack layout etc), but whoever is loading the ELF and creating/starting the new thread will need to write inter-AS CHERI caps to the new thread's stack. Passing them in registers (without writing to the stack) is feasible, but isn't as portable/generic as auxv[] in cases where we might have more ELF segments than available GPRs or more fine-grained CHERI caps that aren't just used for constructing the captable.

2. Create/write a CHERI capability for the process' entry point and/or stack to a PC register.

This can easily be solved by passing those via registers during task startup, no need for cross-AS memory writes.

You can pass quite a few registers at launch and then run special code that does all the memory initialisation you need before normal startup. This can be special code that gets mapped before at load and unmapped when launch is done, making it fully invisible to normal user space code, having any ABI you want there.

Alternatively, you have a launcher task with almighty CHERI permission, which can write the memory itself directly. As all it does is loading ELF files, it has no direct interaction with untrusted tasks. Even if it does get compromised, there is nothing sensitive in its own address space, nor anything to exploit, as all it does is loading an ELF file and you can already control its behaviour via the ELF file you feed it. This is probably closest to how CheriBSD does this.

@Indanz
Copy link
Contributor

Indanz commented May 6, 2025

No examples come to mind. But I am thinking what if the caller just wants to invalidate the tag of the target register while keeping its value.

That seems obscure functionality. But you can already do this by writing the register twice, both times with no vspace given: Once to put it out of range and hence invalidate it, second time to restore the original value.

Or the valid tag can be an extra parameter to the new syscall.

Not trying to avoid vspace. It will still require TCB, VSpace, and page caps; you can still have a VSpace cap but without page caps to map to this vspace. My argument is, having a VSpace cap might not be enough (of a privilege) to construct valid arbitrary CHERI caps from to cover a big range of contiguous virtual memory.
In other words, can an seL4 thread (e.g., a debugger) that has a VSpace (and only that) for different thread/AS read/write any of its memory without page mappings/caps? Similarly for constructing CHERI caps only with a VSpace.

Of course not. VSpace gives control over someone's else virtual memory space, it doesn't automatically gives you access to it yourself. It's for managing another task's memory, having that right doesn't automatically grant you access to the same memory. This was one of my main arguments against managing cross-AS Cheri pointers the way you originally proposed. And this is also why adding cross-task memory read/write syscalls is not straightforward, as it has the same problem.

But having a tcb and vspace cap only gives you right to manage another task's Cheri register tags, it doesn't give you access to valid Cheri pointers. It is the tcb cap that gives you read and write access to the other task's registers, the vspace cap is an additional requirement we are going to add to manage Cheri tags. But you're still managing registers, not memory.

Perhaps, as an exception, we could allow tasks to create Cheri pointers for themselves if they have a vspace cap but no tcb cap. But not sure if that is a good idea.

In any case, we need a way to construct valid CHERI capabilities for bigger-than-a-page mapping, for both a debugger and a target thread (e.g., when creating a new protection domain from scratch, without having any existing valid CHERI GPRs).

Of course, but we have the new syscall for that and if you have a tcb and vspace cap you can create any arbitrary ranges, it's not limited to page sizes. What's missing is a good reason to add fine grained permissions to vpaces for Cheri, or to limit tag creation to the page table mapping.

Debuggers are a non-issue, as they would need tcb caps already. Anything creating new processes already has tcb and vspace caps. So I don't see a strong argument in favour of complicating things without much gain. (I mean on the kernel side, I'm fine with complicating user space startup if it keeps the kernel simpler.)

ELF loading is done by the kernel on fork/exec, unlike here.

What about dynamically linked applications? I know Linux starts the linker up in that case instead of loading all libraries itself.

The kernel has root caps that it constructs CHERI caps from (for each ELF segment), then passes them to the user in its stack as expected by a UNIX process' auxv. The user's startup code constructs the capable from these CHERI caps passed in auxv[] and friends. We could do the same for seL4 (though we need to define a new ABI for native CHERI-seL4 threads including a stack layout etc), but whoever is loading the ELF and creating/starting the new thread will need to write inter-AS CHERI caps to the new thread's stack. Passing them in registers (without writing to the stack) is feasible, but isn't as portable/generic as auxv[] in cases where we might have more ELF segments than available GPRs or more fine-grained CHERI caps that aren't just used for constructing the captable.

Nothing about seL4 is portable, whatever you do you'll need seL4 specific launch code. And like I said, if it doesn't fit in registers, you can always have a launcher that has super/root pointers that creates what's necessary in its own address space, just like monolithic kernels do. Or add a launch protocol where the init code can request more Cheri pointers when it's done with the current set and needs more to complete init. There are many (admittedly annoying) ways of solving this in user space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants