metastuff

Ejit

I’ve been writing a hybrid bytecode/just-in-time compiler for some time now, and the project is starting to be in good enough shape that I thought I might write a few words about it. Less technical post, more ’look at this cool thing I made'.

What?

A bit of theory to start with:

A just-in-time (JIT) compiler is something that generates machine code at runtime, as opposed to ahead-of-time (AOT), like most compiler you’re probably familiar with (gcc, clang, msvc, etc). JITs are typically used in scripting languages, and to a lesser extent in some scientific calculations where we only know some parameters at runtime and want to optimize based on them.

JITs are interesting because they have to balance how fast they want to generate code versus how fast the generated code can be. Generally, the more optimizations you do during code generation, the faster the output is, but at the cost of compilation time. There’s a wide range of different JIT compilers out there that target different regions of this pareto frontier, ranging from libgccjit and LLVM’s ORC at the very-slow-compiler but incredibly-fast-code extreme to lightening at the other extreme. Often, projects use several tiers of compilers that span the range, with more heavily used code being iteratively recompiled with increasingly more heavy compilers.

There also a ‘secondary’ attribute, that being the complexity of the API. More complex JIT engines generally follow many of the same optimization strategies as heavy-duty AOT compilers, and the API can reflect that, requiring users to figure out modules, basic blocks and possibly other idiosyncracies of the implementation. These are generally not too bad once you learn them, but it can be a bit of a pain to get started.

I’m trying to target a niche that I don’t think is particularly well served by any existing JIT compilers at the moment. The idea is to have an API that’s as simple to use as possible, have very low compile times and be as portable as possible. The target usecase is being the primary (and likely only) execution backend for interpreted (toy) languages, essentially directly generating ejit functions from abstract syntax trees (ASTs).

How?

ejit (pronounced like ‘idiot’ with a deep southern twang) is what I’ve come up with after a few failed iterations. Essentially, I took the massively cool (though slightly poorly named) lightening JIT, added a bytecode interpreter on top of it and implemented virtual registers as well as a linear register allocator with some heuristics.

If you’re not familiar with lightening, it was forked from the similarly named GNU lightning, and provides absolutely no frills. It’s practically speaking a portable assembler, and the user is responsible for selecting which registers to use and managing the stack, among other things. The user even has to allocate a region of memory to write instructions into! GNU lightning has some simple peephole optimizations, but they were ripped out in lightening to increase code generation speed as much as possible.

Why?

Generally, even a poorly integrated JIT scheme will beat a good bytecode interpreter, and the earlier a project decides to go down the JIT route, the better the JIT integration generally is. While I adore GNU lightning and especially lightening for their dedication to code generation speed above all else, they’re not always the easiest to use, and require a bit of elbow grease to get integrated into a project ‘properly’ (some architecture-specific preprocessor magic, etc). Most projects typically start out with a bytecode interpreter and then start gradually compiling the bytecode down to machine code, but with ejit I’d like to encourage the opposite. ejit is heavily focused on the JIT aspect, and (hopefully) by targeting it first instead of trying to compile existing bytecode we can get more performant, lightweight execution engines for experimental languages.

Ok, show me the code

Sure, here a simple example, the classic recursive fibonacci (slightly modified from https://metanimi.dy.fi/cgit/ejit/tree/examples/fib.c ):

struct ejit_func *compile()
{
#define NR EJIT_GPR(0)
#define RESR EJIT_GPR(1)

	struct ejit_operand args[1] = {
		EJIT_OPERAND_GPR(NR.r, EJIT_INT32)
	};
	struct ejit_func *f = ejit_create_func(EJIT_INT32, 1, args);
	struct ejit_reloc recurse = ejit_bgti(f, NR, 2); /* n <= 2 */
	ejit_reti(f, 1);

	struct ejit_label label = ejit_label(f);
	ejit_patch(f, recurse, label); /* tell branch to jump here */

	/* fib(n - 1) */
	ejit_subi(f, NR, NR, 1);
	struct ejit_operand arg[1] = {
		EJIT_OPERAND_GPR(NR.r, EJIT_INT32)
	};
	ejit_calli_i(f, f, 1, arg);
	ejit_retval(f, RESR); /* get result */

	/* fib(n - 2) */
	ejit_subi(f, NR, NR, 1);
	ejit_calli_i(f, f, 1, arg); /* we can reuse arg since NR now has the new value */
	ejit_retval(f, NR);

	ejit_addr(f, RESR, NR, NR); /* add results */
	ejit_retr(f, RESR);

	ejit_compile_func(f);
	return f;
}

Let’s go through some features:

Actually running the function is done with ejit_run_func_i(FUNC, NARGS, ARGS). Note again the prefixes _i, _l, _f and _d.

Users & Trivial benchmark

I wrote posthaste for a university course using ejit. posthaste is a statically typed interpreted teaching language, and my implementations contains a fairly straightforward (if somewhat repetitive) AST -> ejit lowering pass in src/lower.c. Statically typed interpreted languages are kind of rare so this might not be a particularly representative case for the feasibility of ejit as a general purpose execution engine, and I’ve been thinking about writing a dynamically typed version of the language but haven’t gotten around to it yet.

See examples for some very simple benchmarks. The runtime performance is pretty good even when just running in pure bytecode mode (i.e. disabling the JIT). For example, comparing examples/fib.ph to ’equivalent’ functions in some other languages results in the following:

Implementation Runtime(s)
posthaste (jit) 1.336
luajit 2.1.17 3.106
posthaste (nojit) 9.321
lua 5.4.7 16.307
python 3.13.2 23.216

I’m not claiming these results are super meaningful, posthaste is a lot simpler and as such it’s not necessarily suprising that it comes out on top. There’s just a lot less cruft that might get in the way of code generation like garbage collection or dynamic types. I do have a few small peephole optimizations implemented in posthaste, though they’re pretty much all just about using constant values in instructions instead of allocating a virtual register to store a temporary constant value.

Speed tricks

There are some tricks under the hood to go fast. First of all, the most important piece is probably the register allocator. If registers are chosen poorly, we’ll be spending a lot of time just moving values from and to the stack, which generally has a significant impact even with modern caches. As such, the register allocator increments each virtual register’s priority when it is accessed. Generally, registers within loops should be given priority, since the loop might run the same code multiple times, so there’s a heuristic that can be given to the register allocator: ejit_set/incr/decr_prio(FUNC, INT). This sets how much to increment a register’s priority by when it is accessed, and should generally be adjusted higher when entering a loop, and lowered when exiting one. This is fairly easy to do on an AST level, but anything lower than that is probably a bit more tricky.

Since ejit must deal with calling functions either from bytecode or from machine code, I decided to keep the escape interface as simple as possible, with a typical array of type + value arguments passed on the stack. However, this is very slow, and I came up with a workaround. Internally, JITted functions use the native calling convention, and each function gets a ’trampoline’ that shifts arguments from the stack into registers. When calling another ejit_func, we can check if they’re both JITted and skip the trampoline entirely, getting almost-native function call speeds.

Somewhat similarly on the interpreter side, we use a single dynamic vector for registers that is shared between all function calls in a given thread, such that internally a bytecode function passes the current register state to another bytecode function and keeps track of where its own registers are. This avoids having to dynamically allocate new registers for each function call, and is decently fast. There’s still a bit of an impedance mismatch between calling external/native functions with escape, I’ve been thinking of maybe passing the register state to these external functions as well but that’s not as convenient for JITted functions, unfortunately. Jumping in and out of the interpreted might be something that has to be done with more dynamically typed languages, like checking if a function has to be recompiled for a new set of types or something. Bit of a TODO at the moment.

By default, ejit generates ‘safe’ machine code, i.e. the pages are marked read-only. This unfortunately requires an extra system call, so if you’re absolutely certain that your program never gets called with untrusted input or otherwise can never be compromised and speed is absolutely paramount, you can set im_scawed = false when calling ejit_select_compile_func and get a decent compilation speed improvement. Pages will remain writable, which can be a security hazard, be warned.

For comparison, here are the compilation times and runtimes for ejit/examples/fib.c, compiled 1 000 000 times and run with n = 42 (./examples/fib 1000000 42 0/1/2, respectively):

Configuration Compilation time (s) Runtime (s)
nojit 0.688 5.990
unsafe jit 3.818 0.593
safe jit 8.089 0.594

Just for reference, libgccjit takes about 26418 seconds to compile the equivalent function a million times.

64 vs 32 bits

lightening supports JIT on 32bit architectures, and I wanted to support that as much as possible. I came up with an idea of presenting a 64bit API, but keeping track of which instructions and types are used, and disable JIT compilation on 32bit when 64bit operations are used, and just using the interpreter (which can easily handle 64bit value) as a fallback. Emulating 64bit operations in the JIT would’ve been too cumbersome and likely not worth it. This is also why there’s separate ejit_calli_i and ejit_calli_l instructions, ejit_calli_l forces the return type to be 64bits, whereas ejit_calli_i uses however many bits long happens to be. One can be JIT compiled on 32bit arches, the other one cannot. The interpreter is still decently fast, again using ejit/examples/fib.c as a comparison between 32bit x86 and 64bit x86_64 running on the same CPU:

Configuration Runtime (s)
32bit nojit 8.249
32bit jit 0.727
64bit nojit 5.768
64bit jit 0.593

Emulating 64bit instruction in the intepreter is a bit more costly than having them natively, so the 32bit bytecode has a slightly worse performance ratio compared to the 64bit interpeter (~11x vs ~10x slower than the JIT) but not too bad. Trying to get 32bit code to JIT can be a bit tricky, so you should carefully consider if you want to write ejit in a 32bit friendly manner or just say fuck it.

Limitations and future improvements

Stack allocations are not supported at the moment. Should be possible to implement, just haven’t really needed them so far.

It would be fairly straightforward to keep track of which virtual registers are alive when, and this information might be useful for moving virtual registers more tightly into physical registers, but I’m worried it would need some kind of static single assignment (SSA) analysis, which might raise the complexity a bit too much for the expected gains.

SIMD instructions might be interesting to play round with, but there’s currently no support for them in lightening so this would probably be a fairly significant undertaking.

Similar projects

MIR also tries to be a lightweight execution engine, and has built-in interpreter. MIR has a number of optimization passes and does CFG/SSA analysis which might produce faster machine code at the cost of code generation speed. I played around a bit and found it to be ~3-4x slower at generating code, but didn’t see any significantly faster runtime performance for the produced machine code. I wasn’t able to get the interpeter working. My testing was very limited though, so please don’t take this as a dismissal of the project outright.

Cranelift also sports a more extensive catalogue of optimization passes and does analysis, seemingly moreso than MIR, and even has support for SIMD, which is really cool. It seems to be at least a bit slower at generating code than MIR, and my limited testing didn’t seem to find cases where its optimizations resulted in any significant speed boosts. I also ran into some issues when trying to generate multiple functions, with Cranelift panicing after about 64K functions generated. It’s written in Rust, which is neat, but I’m not too familiar with the language and it’s possible that the issue was from me doing something wrong.

SLJIT is on a similar level as lightening, i.e. a kind of portable assembler where the user has to allocate registers manually. Does seem to handle executable memory allocations for the user, though. I have not tried generating code with it yet, but seems to support SIMD and custom instructions, which is really cool.

Addendum 2025-03-15

There’s an obvious alternative to passing around an interpreter ’thread’, that being allocating all registers on the stack with alloca or variable length arrays. This has the downside of not really being able to detect when we run out of memory and probably would require some maximum recursion depth limit be implemented, although that kind of already applies to the JIT as well. On the upside, jumping into the interpreter is now about as fast as it can be, which I imagine is useful for implementing more dynamic compilation schemes.

A quick benchmark for the alloca implementation:

Configuration Runtime (s)
32bit alloca 9.133
32bit nojit 8.249
64bit alloca 4.970
64bit nojit 5.768

Apparently messing with the stack frame at runtime is not great on 32bit x86, but still, not too bad. For now, the alloca implementation lives in its own alloca branch, I’m stil a bit scared of the safety implications. Although it’s worth noting that I don’t currently handle running out of memory particularly gracefully anyway. Currently some critical allocations are just guarded by an assertion, but since the assertion would like to print a string before aborting, some libc internals might explode. So, y’know, make sure you don’t run out of memory.

#ejit