Lecture 5:
Kernel Extensions
CSC 469 / CSC 2208
Spring 2024
Discussion Questions (from last time)
• Why use a microkernel if your Linux applications are 5-7% slower than on
native Linux kernel?
• What other benefits or advantages might arise from the small L4 code
size?
CSC469 2 Lecture 5
How far can we take minimality?
• Microkernels: minimal set of abstractions and mechanisms
• Exokernel: MIT Research project
• Claim: OS abstractions are bad because they:
• deny application-specific optimizations;
• discourage innovation;
• impose “mandatory costs”.
• Solution: Separate concept of protection from abstraction and management
• Follows end-to-end principle: minimal, fewest H/W abstractions possible
• Exokernel is basically a secure resource multiplexor
• Applications link directly with a library that provides OS functions (libOS)
• Drawbacks?
CSC469 3 Lecture 5
Kernel comparison
• Monolithic
+ performance
- difficult to debug and maintain
• Microkernel
+ more reliable and secure
- performance overhead
• Exokernels
+ minimal and simple
- more work for application developers
CSC469 4 Lecture 5
Going farther…
• Exokernel drops OS abstractions, multiplexes hardware
• Much like an older strategy… Virtual Machines
• Place thin layer of software directly above hardware
• virtual machine monitor (VMM, aka hypervisor)
• Exports raw hardware interface
• OS/application above sees “virtual” machine identical to underlying physical
machine
• VMM multiplexes virtual machines
• We will explore this concept next time
• How can we add or modify OS functionality without complete redesign?
CSC469 5 Lecture 5
OS Extensions
• Adding new function to OS “on the fly”
• Why?
• Fixing mistakes
• Supporting new features or hardware
• Efficiency / Custom implementations
• How?
• Allow some OS function to run outside the kernel (μkernel)
• Give everyone their own virtual machine (VMs)
• Allow users to modify the OS (e.g., modules)
CSC469 6 Lecture 5
Loadable Kernel Modules
• Giving everyone a virtual machine doesn’t entirely solve the extension
problem
• You can run what you want on your VM, but do you really want to write a custom
OS?
• Often just want to modify/replace small part
• Solution: Allow parts of the kernel to be dynamically loaded / unloaded
• Requires dynamic relocation and linking
• Common strategy in monolithic kernels for device drivers (FreeBSD,
Windows, Linux)
CSC469 7 Lecture 5
Linux Loadable Kernel Modules
• Module writer must define (at least) two functions
• init_module – code executed when module loads
• cleanup_module – code executed when module unloads
• Module functions can refer to any exported kernel symbols
• Module is compiled into relocatable .ko file (since 2.6)
• Requires kernel source tree for kernel that module will be loaded into
mymodule.c
#include <linux/module.h> mymodule.ko
init_module() { .text
Kernel source
…
} headers .init.text
cleanup_module() { build environment make –C $KDIR M=$PWD
.modinfo
…
} __versions
Makefile
obj-m := mymodule.o
CSC469 8 Lecture 5
• insmod command loads module into running kernel
• 2.4 – insmod (at user level) resolves references to kernel symbols
• 2.6 – invokes syscall, kernel does the linking
• rmmod command removes module from kernel
• lsmod command lists currently-installed modules
• modprobe is a library wrapper that checks module dependencies and
loads additional required modules
copy_module_from_user
mymodule.ko insmod mymodule.ko check versions
sys_init_module
.text check_modinfo
call module_init
.init.text
.modinfo check module dependencies
rmmod mymodule.ko sys_delete_module check reference count
__versions
call module_cleanup
user kernel
CSC469 9 Lecture 5
Tracking Modules
• Kernel has a linked list of module objects
• struct contained in the module memory itself
state
list
name
…
state
list
name
…
enum module_state state; state
list
name
…
state
list
name
…
struct list_head list;
ref ref ref
ref
… … …
…
modules_which_use_me modules_which_use_me modules_which_use_me
modules_which_use_me
… … …
…
char name[MODULE_NAME_LEN];
…
/* What modules depend on me? */
struct list_head source_list;
/* What modules do I depend on? */
struct list_head target_list;
atomic_t refcount;
…
CSC469 10 Lecture 5
rmmod
• Unlinks module from kernel
• Needs to ensure no one is using module first!
• Reference count incremented whenever module is used
• source_list identifies other modules that depend on this one
• Invokes module-provided exit / cleanup function
• Frees memory
CSC469 11 Lecture 5
Problems with module approach
• Requires stable interfaces
• Linux uses version numbers to check if module is compiled for correct version of
kernel, but it is easy to get this wrong
• Unsafe
• Module code can do anything because it runs privileged
• E.g. VMWare Workstation driver
• “hijacks” machine by changing interrupt descriptor table (IDT) base register and then
jumps to code in the VM application!
CSC469 12 Lecture 5
Alternate kernel-level schemes
• Trusted compiler (or certification authority) + digital signatures
• Allows verification of source of code added to kernel
• You still have to decide if you trust that source
• Code can still do anything
• Proof-carrying code
• Code Consumer (OS) supplies a specification for what extensions are
allowed to do
• Code Producer (the extension) must supply a proof that it is safe to
execute according to specification
• OS validates proof
• Proof should be easy to check, but may be hard to generate (e.g. maze
example)
CSC469 13 Lecture 5
Checking a proof vs generating one
• G. Necula - Safe Kernel Extensions Without Run-Time Checking, OSDI’96
• A maze is “safe” if there’s a path through it.
• Easy to check a path, but hard to generate.
CSC469 14 Lecture 5
Alternates (2)
• Sandboxing (software fault isolation)
• Limit memory references to per-module segments
• Check for certain unsafe instructions
• Examples:
• SPIN (U. of Washington)
• Modula-3 + trusted compiler
• Safety properties provided by language
• Problems with dynamic behavior (e.g. “while(1)”)
• Vino (Harvard)
• Sandboxed C/C++ code called “grafts”
• Timeouts to guard against misbehaved grafts
• Resource limits + transactional “undo”
• Byte-Granularity Isolation (Microsoft) - BGI
CSC469 15 Lecture 5
eBPF
• “extended Berkeley Packet Filters”
• Language-level VM within Linux kernel
• Register-based VM
• Custom 64-bit RISC instruction set
• Bytecode verifier
• Restrictions are placed on eBPF programs for safety
• Limited number of instructions
• Controlled memory referencing
• Originally, no loops allowed
• Bounded loops were introduced in Linux 5.3
History: BPF
• “The BSD Packet Filter: A New Architecture for User-level Packet
Capture,” by Steven McCanne and Van Jacobson, in Proceedings of the
1993 Winter USENIX Conference.
• Register-based language-level virtual machine to run user programs for packet
capture & filtering inside the BSD Unix kernel.
• 2 registers
• 22 instructions
• No backward branches (no loops)
• Safety / restrictions not mentioned in paper
History: eBPF
• BPF instruction set was too limited
• Linux introduced new “internal” BPF circa 2013
• User programs written in “classic” BPF were translated to internal BPF
• New virtual machine had ten (10) 64-bit registers (enough to pass function
arguments in regs), new BPF_CALL instruction to call kernel functions, ~100
instructions, and other features
• “internal” BPF was made available to users as “extended” BPF soon after
(patch mid-2014)
• Verifier checks user programs at load time
• Termination (no loops), no uninitialized reads, no out-of-bounds memory
access, etc.
• Added support for data “maps” (key-value structures) shared between kernel and
user-space.
Classic usage: optimize packet filtering
$tcpdump host 127.0.0.1 and port 22 –d
• -d means print compiled bytecode and stop
(Brendan Gregg example, O’Reilly Velocity talk, 2017)
Running eBPF Programs
User-level
BPF
program Statistics
Event
1.compile config 4. output
BPF Per-event data
bytecode
2. load 3. attach
Kernel
tracepoints Static tracing
maps
Event sources
kprobes
eBPF eBPF Dynamic tracing
Bytecode Verifier Virtual Machine uprobes
perf_events Sampling, PMCs
Running eBPF Programs
• Must be “attached” to code points in kernel
• Event triggers execution of eBPF code
• Used for:
• Classic network filtering and monitoring
• Restricting system calls (seccomp)
• Debugging and performance analysis
How does it work?
• Userspace has one overloaded system call, bpf()
int bpf(int cmd, union bpf_attr *attr, unsigned int size);
• Meaning of attr depends on the command.
• For loading an eBPF program cmd = BPF_PROG_LOAD
• For load, attr includes a program type, and the list of instructions in the program
• Type determines what eBPF program is allowed to access in kernel
• In-kernel verifier checks safety of eBPF program
• Terminates, bounded loops, no unreachable code
• No out-of-bounds accesses, no uninitialized reads
• Access to kernel functions restricted by program type
How it works (2)
• How is eBPF program “attached” to the kernel, so that it gets invoked at
the desired time?
• Depends on the kind of event
• For sockets, setsockopt()
• For perf events, ioctl()
• Other commands attach, create and access maps
• Need to specify map type, max # of elements, key size and value size (in bytes)