Kmi

09 Nov, 2024

`kmi`

kmi is a kernel I’ve been working on for a little while now. Its main feature is thread migration, through which inter-process communication is implemented. You can find its sources at https://github.com/Kimplul/kmi.

If you don’t know what any of that means, don’t worry, I go through all the prerequisite knowledge below. If you do have experience with kernels, jump to Thread migration

Preqrequisite theory

Note: I recommend reading through the Minix book¹, it still holds up and goes through all of the following information way better than I could ever. Consider this a lazy way to mark my sources.

Threads and processes

Let’s start with some basics. I’m assuming we’re dealing with a ’typical' desktop computer, so keep in mind that a lot of the following might not hold for other kinds of computers, such as embedded systems, even though they might use a lot of the same terminology.

A processor executes machine instructions, one at a time. We can logically imagine that these instructions form a sequence, a thread of execution. These threads typically run within a protected context, referred to as a process.

Cburnett, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2233446

Processes contain information such as which regions of memory threads within it are allowed to access, and the underlying hardware can check that threads operate within those limits. Most importantly, a thread in one process shouldn’t be able to arbitrarily modify memory in another process. If a process wants to access some new region of memory, it has to ask a kernel for permission. Kernels are just some bit of code that runs in a separate kernelspace, whereas processes run in userspace. These are two separate ‘modes’ that processors implement, and generally speaking code running in kernelspace has fewer restrictions than code running in userspace. In particular, kernels are responsible for creating, destroying and maintaining processes and threads.

Monolithic kernels and microkernels

Processes alone aren’t particularly useful, and to get actual work done, they generally need some services that an operating system provides. Files are a great example. All major operating systems currently in use have implemented the majority of their services within kernelspace, which processes can use by doing a system call. A system call is a special kind of function call that jumps from userspace into kernelspace, to some predetermined address in the kernel. These kinds of kernels are typically called monolithic, and have in practice turned out to work more than well enough. They’re generally fast, since all services are just a system call away, and services can cooperate within the kernel via simple, regular function calls. However, since the kernel has fewer restrictions than processes, a bug can be more destructive. The system might crash outright, or it might accidentally change something in an unrelated process’s state.

Bugs in processes, on the other hand, at worst cause the process to be terminated.² The idea with microkernels is to move as many services as possible into separate processes in userspace, and provide some kind of fast inter-process communication (IPC) mechanism.

Golftheman - http://en.wikipedia.org/wiki/Image:OS-structure.svg, Public Domain, https://commons.wikimedia.org/w/index.php?curid=4397379

Message passing

The current status quo for IPC is message passing. There’s an bazillion different variations, with properties tuned to some specific use case, so the example given here is a bit of a case of lies-to-children, unfortunately. Anycase, message passing boils down to ‘I have some bit of data I want to give to some other process’. This is generally done via a system call, often called something along the lines of send(), which causes the kernel to copy the data into an internal buffer, a ‘mailbox’ if you will. The recipient then has to come and empty the mailbox, usually via a system call named something like receive(), which causes the kernel to copy the contents of the mailbox into the receiving process.

In synchronous message passing, the sender waits patiently in the kernel for a response, whereas asynchornous message passing allows the sender to do other work while waiting for a response. Both have their uses, though synchronous specifically has some properties that make can it a bit faster.³

Thread migration

Thread migration is redirecting the sending thread to the receiving process to handle the message it itself sent. This is on a surface level a minor change, but has some interesting implications.

Parallellism. If two threads want to access some service at the same time, they can both just do their respective migrations at the same time. If they had to send messages instead, there would need to be a third thread that picks up and processes the messages, one at a time.⁴ Also worth noting that having more threads takes some amount of resources and puts more stress on other parts of the operating system, so we’re kind of killing two birds with one stone here.
Generalizing function calls. A thread migration can be though of as a function call, but to another process. Processors are good at function calls, and in fact there have been research project that would allow doing thread migrations completely in userspace, which would mean that accessing services would be closer in speed to a function call than a system call, potentially tens to hundreds of times faster.⁵
Userspace scheduling. Typical microkernels still need a scheduler in kernelspace, at least partially since a message forms an implicit dependency between two processes. To select the best candidate to be run next, schedulers need this kind of information, but it can be difficult to get out of the kernel, at least in any kind of performant manner. Thread migrations don’t have this issue, since the thread is responsible for all of its own processing (in the general case, at least), including reporting to the scheduler that it’s waiting on a mutex or semaphore or whatever. This means we can move the scheduler to userspace, potentially decreasing microkernel footprint even further and allowing all kinds of cool things like recursive scheduling.

Of course, there are drawbacks as well.

Capturing threads. What if a thread does a migration into some process that never lets the thread return to the original process? This could potentially be used as a denial of service, or happen just by accident. ⁶
Asynchronous code. With (asynchronous) message passing, it’s easier to just send out messages that don’t need a response, like ‘please start working on this while I do something else’ and only later check if it actually succeeded. With thread migration, the thread must always return to its caller, meaning we’re wasting a bit of time going there and back.

Probably some others as well, but these are the major pros/cons that I’ve thought of.

What’s so special about `kmi`?

Even among other thread migration kernels (of which there don’t seem to be many, mainly COMPOSITE being actively developed), kmi has a few unique features and does things a little differently. Whether they lead to a better or worse system overall is still a bit up in the air, but I least I find them to be interesting.

Receiving threads

At its simplest, When a thread wants to do a migration, it uses the ipc_req() system call, which takes five arguments: pid, d0, d1, d2 and d3. pid is just which process we want to migrate to, and d0 through d3 are arguments passed in registers to the process, effectively our ‘message’. If there’s a need to send more data at a time, there’s shared memory available as well.

Assuming pid exists and can be migrated to, the thread is routed to its _start(). In kmi, _start() takes four parameters: pid, tid, d0, d1, d2 and d3. pid is the effective process from where the migration came from, and tid is the thread that did the migration. d0 through d3 are the arguments passed to ipc_req(). When the thread wants to return, it uses sys_resp(), which just takes four arguments, d0 through d3, which are passed to the previous process. The return value of the system call in the original process takes the form of a structure, with elements s, pid and d0 through d3. s is just the status, the kernel can set this field to a non-zero value if something went wrong with the actual migration, and pid is the process from where we just returned. d0 through d3 are just the parameters passed to ipc_resp().

If you’re familiar with _start() as the procedure that is first called when a process is created, that also applies here. The first thing a thread does when created is a ‘migration’ from kernelspace to userspace, with pid just set to 0 to indicate that it came from the kernel. 0 is a reserved pid for the kernel, and 1 is reserved for the init process.

Avoiding doing everything yourself

Note how in the previous section I used the term ’effective process’. Imagine a situation where we’re writing a service of some kind, and we need the help of another service to finish our job. How do we indicate to the other service what process we would like help with? We could just pass the pid of the process we got as one of our four parameters, d0 - d3, but now we’re wasting one of our precious parameter slots, and now there’s a risk of passing the wrong pid by accident. Instead, kmi provides ipc_fwd(), which is effectively the same as ipc_req(), except the pid that the next process gets is the same as the one we got. This is the idea behind ’effective pid’, or eid, and is something the kernel can automatically keep track of for us.

For comparison, here’s a regular ipc_req() doing a service-to-service migration:

And here’s an ipc_fwd() doing the same thing:

The graphs are very similar, with effectively just the eid for process 3 being different. To 3, it looks like 1 did a migration directly into it, but 2 can still come in and check that 3 responded in some sensible manner before it returns to 1 and ‘pretends’ it did everything itself. 1 is completely oblivious to 3, and 3 is completely oblivious to 2.

Avoiding doing unnecessary returns

If you’re familiar with tail calls, this following will likely be fairly familiar. If we’re a service and we know that we don’t need to do anything after an ipc_req(), we can use ipc_tail(). The arguments are the same, the only difference is that when ipc_resp() is called in the next process, the current process is skipped and execution is returned directly to the previous process.

This is of course recursive, so no matter how many processes did an ipc_tail(), they will all be skipped.⁷ Similarly, if we would’ve done an ipc_fwd(), we can opt for ipc_kick() (kick the can down the road, if you will).

Remember how the return value of an ipc_*() is a struct with a field pid? This becomes relevant right now, as 1 sees that 3 is the one who responded, even though 1 did a migration to 2. Imagine we have a UNIX-like operating system running on top of kmi, we’re process 1, and we want to open a device file, say a UART or something. We can do a migration to 2, which in this case could be a virtual file system or other kind of router. 2 finds out who is responsible for that specific device, say process 3, and does an ipc_kick() to it. Now 3 sees that 1 wants to open it, does some internal bookkeeping, and returns with ipc_resp(). Now we directly return to 1, and can see that it just returned from 3, and can direct all further read/write/etc. migrations to 3.

Activations

Activations are very similar to stack frames, with the exception that previous activations are off-limits to newer ones. By default, the kernel sets the stack pointer register to point to the current activation, and the activation stack can be used as a kind of funky normal stack. However, the activation stack is thread-specific, and threads can’t share pointers to elements on the activation stack amongst eachother. This is generally considered bad practice anyway, but if it’s a feature that is required, the libc can always allocate stacks on the heap to make them accessible to all. The activation stack is not required to be used as the sole stack in a process.

The kernel keeps track of how much space on the activation stack each process uses, and marks those regions unaccessible when a migration occurs. Each process is guaranteed to have some specific amount of activation stack for each migration, and if there isn’t enough left a migration is cancelled.

Mixed in between process activations are kernel activations, which is memory reserved for register saving and keeping track of which process and effective process we’re in and where the previous kernel activation is. These regions are never accessible to processes.⁸

Notifications

To provide at least some way to ‘ping’ a process that some operation has finished, I provide notifications. Notifications reuse a lot of the thread migration code by essentially forcing a thread migration to take place when a notification arrives. There are two categories of notifications, critical and non-critical, where non-critical ones are only allowed to force a migration if the receiving thread is in its owning ‘root’ process, otherwise the signal is queued. Non-critical notifications are things like processes trying to signal eachother. Critical notifications are hardware generated interrupts that need to be handled immediately, and the handler is expected to be very short. Interrupts are disabled during critical notification handling.

‘Benchmark’

As a very simple IPC benchmark, I’ve counted how many ipc_req()/ipc_resp() pairs can be executed in one second and compared it to the reported performance of sending and receiving a message with seL4⁹. Below are the rough results:

These results are taken from running on real hardware, so pipelining and cache flushing and so on is taken into account, but unfortunately I don’t have quite the same hardware as the seL4 benchmark, so this is at best an approximation. Still, even so, it seems like we’re broadly speaking in the same ballpark with seL4, which I’m fairly pleased with. seL4 is used in some real-world projects,¹⁰ and I believe it might still be one of the fastest message passing implementations in existence, which is why I chose to compare against it. This benchmark would imply that thread migration is possible to implement in a manner fast enough to be useful in at least some real world applications.

Status

At the time of writing, 2024/11/10, the kernel can load and execute programs loaded into memory, create and destroy threads and processes, allocate both private and shared memory, as well as do the operations mentioned in this post (ipc_*(), etc). Hardware interrupt handling is still a bit unfinished, and I’m currently using a Big Kernel Lock so the ’true’ potential parallellism of thread migration isn’t really fully utilized. Of course, a kernel alone isn’t too useful, and I’m slowly working on building a set of services that would enable a simple shell. This would include an SD card/MMIO, ext2, and UART drivers, as well as a (very limited) UNIX-like virtual filesystem and userspace scheduling. This work is still very much unfinished and not published anywhere, but that would be my mid-term goal for this project. It will hopefully give insight into how practical my ideas for the kernel are in a ‘real’ system.

Conclusion

kmi is an implementation with a thread migration implementation with some unqiue properties and seems to be in the same performance ballpark as seL4, which might point towards kmi being useful in the real world with enough further development.

Tanenbaum, Woodhull. “Operating Systems: Design and Implementation (3rd Edition)” (2005) ↩︎
Bugs can of course be more insidious that this, and in particular when you start connecting processes together you might see a bug spread from process to process, potentially even in system destroying ways. ↩︎
Liedtke et al. “The performance of μ-Kernel-Based Systems” (1997), https://dl.acm.org/doi/10.1145/268998.266660 ↩︎
Of course the message passing implementation can implement pipelining, or having multiple threads processing incoming messages, but those solutions increase complexity in the implementation, whereas with message passing we don’t have to do any of that. ↩︎
Vilanova et al. “Direct inter-process communication (dIPC)” (2017), https://dl.acm.org/doi/10.1145/3064176.3064197 ↩︎
If we detect that a process is capturing threads, we can of course always kill it, and apply some secruity rules like never migrate to an unknown process. These are just heuristics though, and I’m not aware of any way to ensure capturing is completely impossible. ↩︎
More specifically, their activation record will be overwritten by the ipc_tail(). More on activations in its own section. ↩︎
I don’t know how well this design works with things like spectre, but in my little imaginary corner of computer science I assume the hardware works as it is supposed to and userspace can’t arbitrarily read memory that’s supposed to be kernel-only. ↩︎
https://sel4.systems/About/Performance/ ↩︎
https://trustworthy.systems/projects/OLD/SMACCM/ ↩︎

#kmi

Kmi

kmi