This repository contains instructions and source code for reproducing the micro-benchmarks in the HotOS'21 paper BPF for Storage: An Exokernel-Inspired Approach. [paper] [talk]
Operating System: Ubuntu 20.04 with modified Linux kernel 5.8.0
Disk: Intel Optane SSD P5800X
kernel/syscall_hook.diff: Linux kernel patch with the dispatch hook in the syscall layerkernel/nvme_driver_hook.diff: Linux kernel patch with the dispatch look in the NVMe driver interrupt handlerbpf/load_bpf.sh: Script to load BPF program into the kernelbpf/bpf_loader.c: BPF program loaderbpf/bpf_program.c: BPF program running memcpybpf/Makefile: Makefile for the BPF programbench/read_baseline.cpp: Benchmark program for baseline read()bench/read_bpf.cpp: Benchmark program for read() with BPFbench/uring_baseline.cpp: Benchmark program for baseline io_uringbench/uring_bpf.cpp: Benchmark program for io_uring with BPFbench/CMakeLists.txt: CMakeLists for the benchmark programs
There are two different kernel patches (syscall_hook.diff and nvme_driver_hook.diff) that contain dispatch hooks in the syscall layer and the NVMe driver, respectively. To run experiments with different dispatch hooks, we need to compile and install different kernels.
First, make sure that we have all the dependencies required to build a Linux kernel. You can run the following script to install those dependencies:
# enable deb-src
sudo cp /etc/apt/sources.list /etc/apt/sources.list~
sudo sed -Ei 's/^# deb-src /deb-src /' /etc/apt/sources.list
sudo apt-get update
# install build dependency
sudo apt-get build-dep linux linux-image-$(uname -r) -y
sudo apt-get install libncurses-dev flex bison openssl libssl-dev dkms libelf-dev libudev-dev libpci-dev libiberty-dev autoconf fakeroot -yThen, clone the Linux repository and checkout to 5.8:
git clone https://github.com/torvalds/linux.git
cd linux
git checkout tags/v5.8Apply the kernel patch you need and compile the modified kernel:
git apply syscall_hook.diff # apply nvme_driver_hook.diff instead if you want to run experiments with the dispatch hook in the NVMe driver
make localmodconfig
make deb-pkgAfter the kernel is successfully compiled, install all the .deb files generated in the parent folder of linux:
cd ..
sudo dpkg -i *.debFinally, reboot the machine and make sure that you boot into the right kernel. You can examine your current kernel by running uname -r and boot into another kernel using grub-reboot with a reboot.
In the micro-benchmarks mentioned in the papar, we use a simple BPF program running memcpy to simulate B-Tree page parsing.
First, install the dependencies for building and loading BPF programs:
sudo apt update
sudo apt install gcc-multilib clang llvm libelf-dev libdwarf-dev -y
wget http://archive.ubuntu.com/ubuntu/pool/universe/libb/libbpf/libbpf0_0.1.0-1_amd64.deb
wget http://archive.ubuntu.com/ubuntu/pool/universe/libb/libbpf/libbpf-dev_0.1.0-1_amd64.deb
sudo dpkg -i libbpf0_0.1.0-1_amd64.deb
sudo dpkg -i libbpf-dev_0.1.0-1_amd64.debThen, run the script provided in this repository to compile and load the BPF program before running the benchmarks:
cd bpf
sudo ./load_bpf.shFirst, compile the benchmark programs:
# install CMake
apt install cmake -y
# compile benchmark programs
cd bench
mkdir build
cd build
cmake ..
makeBefore running the benchmark, you may disable hyper-threading and CPU frequency scaling to avoid instable results. To disable hyper-threading, you can run:
sudo bash -c "echo off > /sys/devices/system/cpu/smt/control" # need to be run again after each rebootTo disable CPU frequency scaling on Intel CPUs, you can:
-
Add
intel_pstate=passive intel_pstate=no_hwpto your kernel parameters and then reboot- After reboot,
cat /sys/devices/system/cpu/intel_pstate/statusshould showpassiveinstead ofactive
- After reboot,
-
For each online CPU core, set the
scaling_governortoperformance, and set bothscaling_max_freqandscaling_min_freqto the max frequencyscaling_governor,scaling_max_freq, andscaling_min_freqfor each CPU core are available in/sys/devices/system/cpu/cpu$CPUID/, where$CPUIDis the core number- You can find the max frequency of a CPU core in
cpuinfo_max_freq
-
Disable all C-states except for C0 state for each online CPU core
- C-state knobs for each CPU core are available in
/sys/devices/system/cpu/cpu$CPUID/cpuidle, where$CPUIDis the core number
- C-state knobs for each CPU core are available in
-
Run the following script to disable global CPU frequency scaling and turbo boost:
cd /sys/devices/system/cpu/intel_pstate sudo bash -c "echo 1 > no_turbo" sudo bash -c "echo 100 > max_perf_pct" sudo bash -c "echo 100 > min_perf_pct"
To run the B-Tree lookup simulation with read() syscall, run:
# B-Tree lookup simulation with normal read() syscall
sudo ./read_baseline <number of threads> <b-tree depth> <number of iterations> <devices, e.g. /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1>
# B-Tree lookup simulation with read() syscall and in-kernel dispatching
sudo ./read_bpf <number of threads> <b-tree depth> <number of iterations> <devices, e.g. /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1>After the benchmark is finished, it will print the latency of each simulated b-tree lookup at nanosecond scale.
To monitor the IOPS, you can run sar -d -p 1 3600. Note that for ./read_bpf with the dispatch hook in the NVMe driver, the actual IOPS is the IOPS reported by sar times the B-Tree depth, since sar only captures IOPS in the Linux block layer, while the I/O request resubmission happens in the NVMe driver in this case.
To run the B-Tree lookup simulation with io_uring, run:
# B-Tree lookup simulation with normal io_uring
sudo ./uring_baseline <batch size> <b-tree depth> <number of iterations> <devices, e.g. /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1>
# B-Tree lookup simulation with io_uring and in-kernel dispatching
sudo ./uring_bpf <batch size> <b-tree depth> <number of iterations> <devices, e.g. /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1>After the benchmark is finished, it will print the latency of each simulated b-tree lookup at nanosecond scale.
To monitor the IOPS, you can run sar -d -p 1 3600. Note that for ./uring_bpf with the dispatch hook in the NVMe driver, the actual IOPS is the IOPS reported by sar times the B-Tree depth, since sar only captures IOPS in the Linux block layer, while the I/O request resubmission happens in the NVMe driver in this case.
For any questions or comments, please reach out to [email protected].