Kmi
kmi
kmi
is a kernel I’ve been working on for a little while now. Its main feature
is thread migration, through which inter-process communication is implemented.
You can find its sources at https://github.com/Kimplul/kmi.
If you don’t know what any of that means, don’t worry, I go through all the prerequisite knowledge below. If you do have experience with kernels, jump to Thread migration
Preqrequisite theory
Note: I recommend reading through the Minix book1, it still holds up and goes through all of the following information way better than I could ever. Consider this a lazy way to mark my sources.
Threads and processes
Let’s start with some basics. I’m assuming we’re dealing with a ’typical' desktop computer, so keep in mind that a lot of the following might not hold for other kinds of computers, such as embedded systems, even though they might use a lot of the same terminology.
A processor executes machine instructions, one at a time. We can logically imagine that these instructions form a sequence, a thread of execution. These threads typically run within a protected context, referred to as a process.
Cburnett, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2233446
Processes contain information such as which regions of memory threads within it are allowed to access, and the underlying hardware can check that threads operate within those limits. Most importantly, a thread in one process shouldn’t be able to arbitrarily modify memory in another process. If a process wants to access some new region of memory, it has to ask a kernel for permission. Kernels are just some bit of code that runs in a separate kernelspace, whereas processes run in userspace. These are two separate ‘modes’ that processors implement, and generally speaking code running in kernelspace has fewer restrictions than code running in userspace. In particular, kernels are responsible for creating, destroying and maintaining processes and threads.
Monolithic kernels and microkernels
Processes alone aren’t particularly useful, and to get actual work done, they generally need some services that an operating system provides. Files are a great example. All major operating systems currently in use have implemented the majority of their services within kernelspace, which processes can use by doing a system call. A system call is a special kind of function call that jumps from userspace into kernelspace, to some predetermined address in the kernel. These kinds of kernels are typically called monolithic, and have in practice turned out to work more than well enough. They’re generally fast, since all services are just a system call away, and services can cooperate within the kernel via simple, regular function calls. However, since the kernel has fewer restrictions than processes, a bug can be more destructive. The system might crash outright, or it might accidentally change something in an unrelated process’s state.
Bugs in processes, on the other hand, at worst cause the process to be terminated.2 The idea with microkernels is to move as many services as possible into separate processes in userspace, and provide some kind of fast inter-process communication (IPC) mechanism.
Golftheman - http://en.wikipedia.org/wiki/Image:OS-structure.svg, Public Domain, https://commons.wikimedia.org/w/index.php?curid=4397379
Message passing
The current status quo for IPC is message passing. There’s an bazillion
different variations, with properties tuned to some specific use case, so the
example given here is a bit of a case of lies-to-children, unfortunately.
Anycase, message passing boils down to ‘I have some bit of data I want to give
to some other process’. This is generally done via a system call, often called
something along the lines of send()
, which causes the kernel to copy the data
into an internal buffer, a ‘mailbox’ if you will. The recipient then has to come
and empty the mailbox, usually via a system call named something like
receive()
, which causes the kernel to copy the contents of the mailbox into
the receiving process.
In synchronous message passing, the sender waits patiently in the kernel for a response, whereas asynchornous message passing allows the sender to do other work while waiting for a response. Both have their uses, though synchronous specifically has some properties that make can it a bit faster.3
Thread migration
Thread migration is redirecting the sending thread to the receiving process to handle the message it itself sent. This is on a surface level a minor change, but has some interesting implications.
-
Parallellism. If two threads want to access some service at the same time, they can both just do their respective migrations at the same time. If they had to send messages instead, there would need to be a third thread that picks up and processes the messages, one at a time.4 Also worth noting that having more threads takes some amount of resources and puts more stress on other parts of the operating system, so we’re kind of killing two birds with one stone here.
-
Generalizing function calls. A thread migration can be though of as a function call, but to another process. Processors are good at function calls, and in fact there have been research project that would allow doing thread migrations completely in userspace, which would mean that accessing services would be closer in speed to a function call than a system call, potentially tens to hundreds of times faster.5
-
Userspace scheduling. Typical microkernels still need a scheduler in kernelspace, at least partially since a message forms an implicit dependency between two processes. To select the best candidate to be run next, schedulers need this kind of information, but it can be difficult to get out of the kernel, at least in any kind of performant manner. Thread migrations don’t have this issue, since the thread is responsible for all of its own processing (in the general case, at least), including reporting to the scheduler that it’s waiting on a mutex or semaphore or whatever. This means we can move the scheduler to userspace, potentially decreasing microkernel footprint even further and allowing all kinds of cool things like recursive scheduling.
Of course, there are drawbacks as well.
-
Capturing threads. What if a thread does a migration into some process that never lets the thread return to the original process? This could potentially be used as a denial of service, or happen just by accident. 6
-
Asynchronous code. With (asynchronous) message passing, it’s easier to just send out messages that don’t need a response, like ‘please start working on this while I do something else’ and only later check if it actually succeeded. With thread migration, the thread must always return to its caller, meaning we’re wasting a bit of time going there and back.
Probably some others as well, but these are the major pros/cons that I’ve thought of.
What’s so special about kmi
?
Even among other thread migration kernels (of which there don’t seem to be many,
mainly
COMPOSITE being
actively developed),
kmi
has a few unique features and does things a little differently. Whether
they lead to a better or worse system overall is still a bit up in the air, but
I least I find them to be interesting.
Receiving threads
At its simplest, When a thread wants to do a migration, it uses the ipc_req()
system call, which takes five arguments: pid
, d0
, d1
, d2
and d3
.
pid
is just which process we want to migrate to, and d0
through d3
are
arguments passed in registers to the process, effectively our ‘message’. If
there’s a need to send more data at a time, there’s shared memory available as
well.
Assuming pid
exists and can be migrated to, the thread is routed to its _start()
.
In kmi
, _start()
takes four parameters: pid
, tid
, d0
, d1
, d2
and
d3
. pid
is the effective process from where the migration came
from, and tid
is the thread that did the migration. d0
through d3
are the
arguments passed to ipc_req()
. When the thread wants to return, it uses
sys_resp()
, which just takes four arguments, d0
through d3
, which are
passed to the previous process. The return value of the system call in the
original process takes the form of a structure, with elements s
, pid
and
d0
through d3
. s
is just the status, the kernel can set this field to a
non-zero value if something went wrong with the actual migration, and pid
is
the process from where we just returned. d0
through d3
are just the
parameters passed to ipc_resp()
.
If you’re familiar with _start()
as the procedure that is first called when a
process is created, that also applies here. The first thing a thread does when
created is a ‘migration’ from kernelspace to userspace, with pid
just set to 0
to indicate that it came from the kernel. 0 is a reserved pid for the kernel,
and 1 is reserved for the init
process.
Avoiding doing everything yourself
Note how in the previous section I used the term ’effective process’. Imagine a
situation where we’re writing a service of some kind, and we need the help of
another service to finish our job. How do we indicate to the other service what
process we would like help with? We could just pass the pid
of the process
we got as one of our four parameters, d0
- d3
, but now we’re wasting one of
our precious parameter slots, and now there’s a risk of passing the wrong pid
by accident. Instead, kmi
provides ipc_fwd()
, which is effectively the same
as ipc_req()
, except the pid
that the next process gets is the same as the
one we got. This is the idea behind ’effective pid’, or eid
, and is something
the kernel can automatically keep track of for us.
For comparison, here’s a regular ipc_req()
doing a service-to-service
migration:
And here’s an ipc_fwd()
doing the same thing:
The graphs are very similar, with effectively just the eid
for process 3 being
different. To 3, it looks like 1 did a migration directly into it, but 2 can
still come in and check that 3 responded in some sensible manner before it
returns to 1 and ‘pretends’ it did everything itself. 1 is completely oblivious
to 3, and 3 is completely oblivious to 2.
Avoiding doing unnecessary returns
If you’re familiar with tail calls, this following will likely be fairly
familiar. If we’re a service and we know that we dont’t need to do anything
after an ipc_req()
, we can use ipc_tail()
. The arguments are the same, the
only difference is that when ipc_resp()
is called in the next process, the
current process is skipped and execution is returned directly to the previous
process.
This is of course recursive, so no matter how many processes did an
ipc_tail()
, they will all be skipped.7 Similarly, if we would’ve done an
ipc_fwd()
, we can opt for ipc_kick()
(kick the can down the road, if you
will).
Remember how the return value of an ipc_*()
is a struct with a field pid
?
This becomes relevant right now, as 1 sees that 3 is the one who responded, even
though 1 did a migration to 2. Imagine we have a UNIX-like operating system
running on top of kmi
, we’re process 1, and we want to open a device file, say
a UART or something. We can do a migration to 2, which in this case could be a
virtual file system or other kind of router. 2 finds out who is responsible for
that specific device, say process 3, and does an ipc_kick()
to it. Now 3 sees
that 1 wants to open it, does some internal bookkeeping, and returns with
ipc_resp()
. Now we directly return to 1, and can see that it just returned
from 3, and can direct all further read/write/etc. migrations to 3.
Activations
Activations are very similar to stack frames, with the exception that previous activations are off-limits to newer ones. By default, the kernel sets the stack pointer register to point to the current activation, and the activation stack can be used as a kind of funky normal stack. However, the activation stack is thread-specific, and threads can’t share pointers to elements on the activation stack amongst eachother. This is generally considered bad practice anyway, but if it’s a feature that is required, the libc can always allocate stacks on the heap to make them accessible to all. The activation stack is not required to be used as the sole stack in a process.
The kernel keeps track of how much space on the activation stack each process uses, and marks those regions unaccessible when a migration occurs. Each process is guaranteed to have some specific amount of activation stack for each migration, and if there isn’t enough left a migration is cancelled.
Mixed in between process activations are kernel activations, which is memory reserved for register saving and keeping track of which process and effective process we’re in and where the previous kernel activation is. These regions are never accessible to processes.8
Notifications
To provide at least some way to ‘ping’ a process that some operation has finished, I provide notifications. Notifications reuse a lot of the thread migration code by essentially forcing a thread migration to take place when a notification arrives. There are two categories of notifications, critical and non-critical, where non-critical ones are only allowed to force a migration if the receiving thread is in its owning ‘root’ process, otherwise the signal is queued. Non-critical notifications are things like processes trying to signal eachother. Critical notifications are hardware generated interrupts that need to be handled immediately, and the handler is expected to be very short. Interrupts are disabled during critical notification handling.
‘Benchmark’
As a very simple IPC benchmark, I’ve counted how many ipc_req()
/ipc_resp()
pairs can be executed in one second and compared it to the reported performance
of sending and receiving a message with seL4
9. Below are the rough
results:
These results are taken from running on real hardware, so pipelining and cache
flushing and so on is taken into account, but unfortunately I don’t have quite
the same hardware as the seL4
benchmark, so this is at best an approximation.
Still, even so, it seems like we’re broadly speaking in the same ballpark with
seL4
, which I’m fairly pleased with. seL4
is used in some real-world projects,10
and I believe it might still be one of the fastest message passing
implementations in existence, which is why I chose to compare against it. This
benchmark would imply that thread migration is possible to implement in a manner
fast enough to be useful in at least some real world applications.
Status
At the time of writing, 2024/11/10, the kernel can load and execute programs
loaded into memory, create and destroy threads and processes, allocate both
private and shared memory, as well as do the operations mentioned in this
post (ipc_*()
, etc). Hardware interrupt handling is still a bit unfinished, and I’m
currently using a Big Kernel Lock so the ’true’ potential parallellism of thread
migration isn’t really fully utilized. Of course, a kernel alone isn’t too
useful, and I’m slowly working on building a set of services that would enable a
simple shell. This would include an SD card/MMIO, ext2, and UART drivers, as well
as a (very limited) UNIX-like virtual filesystem and userspace scheduling.
This work is still very much unfinished and not published anywhere, but that
would be my mid-term goal for this project. It will hopefully give insight into
how practical my ideas for the kernel are in a ‘real’ system.
Conclusion
kmi
is an implementation with a thread migration implementation with some
unqiue properties and seems to be in the same performance ballpark as seL4
,
which might point towards kmi
being useful in the real world with enough
further development.
-
Tanenbaum, Woodhull. “Operating Systems: Design and Implementation (3rd Edition)” (2005) ↩︎
-
Bugs can of course be more insidious that this, and in particular when you start connecting processes together you might see a bug spread from process to process, potentially even in system destroying ways. ↩︎
-
Liedtke et al. “The performance of μ-Kernel-Based Systems” (1997), https://dl.acm.org/doi/10.1145/268998.266660 ↩︎
-
Of course the message passing implementation can implement pipelining, or having multiple threads processing incoming messages, but those solutions increase complexity in the implementation, whereas with message passing we don’t have to do any of that. ↩︎
-
Vilanova et al. “Direct inter-process communication (dIPC)” (2017), https://dl.acm.org/doi/10.1145/3064176.3064197 ↩︎
-
If we detect that a process is capturing threads, we can of course always kill it, and apply some secruity rules like never migrate to an unknown process. These are just heuristics though, and I’m not aware of any way to ensure capturing is completely impossible. ↩︎
-
More specifically, their activation record will be overwritten by the
ipc_tail()
. More on activations in its own section. ↩︎ -
I don’t know how well this design works with things like spectre, but in my little imaginary corner of computer science I assume the hardware works as it is supposed to and userspace can’t arbitrarily read memory that’s supposed to be kernel-only. ↩︎