Ejit
I’ve been writing a hybrid bytecode/just-in-time compiler for some time now, and the project is starting to be in good enough shape that I thought I might write a few words about it. Less technical post, more ’look at this cool thing I made'.
What?
A bit of theory to start with:
A just-in-time (JIT) compiler is something that generates machine code at runtime, as opposed to ahead-of-time (AOT), like most compiler you’re probably familiar with (gcc, clang, msvc, etc). JITs are typically used in scripting languages, and to a lesser extent in some scientific calculations where we only know some parameters at runtime and want to optimize based on them.
JITs are interesting because they have to balance how fast they want to generate
code versus how fast the generated code can be. Generally, the more
optimizations you do during code generation, the faster the output is, but at
the cost of compilation time. There’s a wide range of different JIT compilers
out there that target different regions of this pareto frontier, ranging from
libgccjit
and
LLVM’s ORC
at the very-slow-compiler but
incredibly-fast-code extreme to lightening
at the
other extreme. Often, projects use several tiers of compilers that span the range,
with more heavily used code being iteratively recompiled with increasingly more
heavy compilers.
There also a ‘secondary’ attribute, that being the complexity of the API. More complex JIT engines generally follow many of the same optimization strategies as heavy-duty AOT compilers, and the API can reflect that, requiring users to figure out modules, basic blocks and possibly other idiosyncracies of the implementation. These are generally not too bad once you learn them, but it can be a bit of a pain to get started.
I’m trying to target a niche that I don’t think is particularly well served by
any existing JIT compilers at the moment. The idea is to have an API that’s as
simple to use as possible, have very low compile times and be as portable as
possible. The target usecase is being the primary (and likely only) execution
backend for interpreted (toy) languages, essentially directly generating ejit
functions from abstract syntax trees (ASTs).
How?
ejit
(pronounced like ‘idiot’ with a deep
southern twang) is what I’ve come up with after a few failed iterations.
Essentially, I took the massively cool (though slightly poorly named)
lightening
JIT, added a bytecode
interpreter on top of it and implemented virtual registers as well as a linear
register allocator with some heuristics.
If you’re not familiar with lightening
, it was forked from the similarly named
GNU lightning
, and provides
absolutely no frills. It’s practically speaking a portable assembler, and the user is
responsible for selecting which registers to use and managing the stack, among other things.
The user even has to allocate a region of memory to write instructions into!
GNU lightning
has some simple peephole optimizations, but they were ripped out
in lightening
to increase code generation speed as much as possible.
Why?
Generally, even a poorly integrated JIT scheme will beat a good bytecode
interpreter, and the earlier a project decides to go down the JIT route, the
better the JIT integration generally is. While I
adore GNU lightning
and especially lightening
for their dedication to code generation speed above all else,
they’re not always the easiest to use, and require a bit of elbow grease to get integrated into a project
‘properly’ (some architecture-specific preprocessor magic, etc). Most projects
typically start out with a bytecode interpreter and then start gradually
compiling the bytecode down to machine code, but with ejit
I’d like to
encourage the opposite. ejit
is heavily focused on the JIT aspect, and
(hopefully) by targeting it first instead of trying to compile existing bytecode
we can get more performant, lightweight execution engines for experimental languages.
Ok, show me the code
Sure, here a simple example, the classic recursive fibonacci (slightly modified from https://metanimi.dy.fi/cgit/ejit/tree/examples/fib.c ):
struct ejit_func *compile()
{
#define NR EJIT_GPR(0)
#define RESR EJIT_GPR(1)
struct ejit_operand args[1] = {
EJIT_OPERAND_GPR(NR.r, EJIT_INT32)
};
struct ejit_func *f = ejit_create_func(EJIT_INT32, 1, args);
struct ejit_reloc recurse = ejit_bgti(f, NR, 2); /* n <= 2 */
ejit_reti(f, 1);
struct ejit_label label = ejit_label(f);
ejit_patch(f, recurse, label); /* tell branch to jump here */
/* fib(n - 1) */
ejit_subi(f, NR, NR, 1);
struct ejit_operand arg[1] = {
EJIT_OPERAND_GPR(NR.r, EJIT_INT32)
};
ejit_calli_i(f, f, 1, arg);
ejit_retval(f, RESR); /* get result */
/* fib(n - 2) */
ejit_subi(f, NR, NR, 1);
ejit_calli_i(f, f, 1, arg); /* we can reuse arg since NR now has the new value */
ejit_retval(f, NR);
ejit_addr(f, RESR, NR, NR); /* add results */
ejit_retr(f, RESR);
ejit_compile_func(f);
return f;
}
Let’s go through some features:
-
ejit_create_func(RETURN_TYPE, NARGS, ARGS)
is how a new function is created.ARGS
refers to the type signature, i.e. what types of parameters a function takes. Each function is its own context, so there’s no need to keep track of anything beyond this returned pointer. Functions are destroyed withejit_destroy_func(FUNC)
, after which the function should never be called from anywhere. It is up to the user to ensure this. -
struct ejit_reloc
is for relocations. Relocations are essentially placeholders that branching instructions use to know where to jump. When we want to fill this placeholder, we just callejit_patch(FUNC, RELOC, LABEL)
. Labels are equivalent to assembly labels if you’ve used them before, and can be created withejit_label(FUNC)
. -
EJIT_GPR(0)
is a variable, or slot, or virtual register, take your pick. Instructions generally operate on registers (or possibly on constant values), and there’s no real limit on how many a function can have.ejit
will try to place the virtual registers with the highest priority (calculated based on how often the register is used and aejit_set_prio(FUNC, INT)
heuristic) into physical registers, and the rest just go on the native stack. There are also floating point variables/slots/virtual registers,EJIT_FPR(...)
.EJIT_GPR(N)
andEJIT_FPR(N)
are not related and can both be used at the same time. All virtual registers stay the same before and after a call, i.e. they’re treated as being caller save. -
ejit_calli_i(FUNC, TARGET_FUNC, NARGS, ARGS)
is how a call to anotherejit_func
is made.ejit
typechecksARGS
againstTARGET_FUNC
. Note that the return value is fetched separately, withejit_retval(FUNC, GPR)
. This allows us to possibly save an instruction when calling void functions (what an optimization, lol), but does kind of mean that you could call a function with some return type and accidentally read the return value as another type, so be careful. Note the trailing_i
suffix, there are three other:_l
,_f
and_d
._i
is a generic ’long’ return type, and_l
is specificallyint64_t
. More on this in #[64 vs 32 bits]._f
and_d
refer to float and double, respectively.ejit_retr(FUNC, GPR)
is how to return the value in a slot. Only a single value can be returned per function. Calling native functions is done viaejit_escape_i/l/f/d(FUNC, PTR, NARGS, ARGS)
. -
ejit_compile_func(FUNC)
is all that’s needed to actually prepare the function for execution. If the function can be JIT compiled, it will, otherwise bytecode will be generated and the function can be interpreted. There’s also a more detailedejit_select_compile_func
with more parameters, butejit_compile_func
will generally pick the safest options that are sure to work and it’s probably best to just use it.
Actually running the function is done with ejit_run_func_i(FUNC, NARGS, ARGS)
.
Note again the prefixes _i
, _l
, _f
and _d
.
Users & Trivial benchmark
I wrote posthaste for a university
course using ejit
. posthaste
is a statically typed interpreted teaching
language, and my implementations contains a fairly straightforward (if somewhat
repetitive) AST -> ejit
lowering pass in src/lower.c
. Statically typed
interpreted languages are kind of rare so this might not be a particularly
representative case for the feasibility of ejit
as a general purpose execution
engine, and I’ve been thinking about writing a dynamically typed version of the
language but haven’t gotten around to it yet.
See examples
for some very simple benchmarks. The runtime performance is
pretty good even when just running in pure bytecode mode (i.e. disabling the
JIT). For example, comparing examples/fib.ph
to ’equivalent’ functions in some
other languages results in the following:
Implementation | Runtime(s) |
---|---|
posthaste (jit) |
1.336 |
luajit 2.1.17 |
3.106 |
posthaste (nojit) |
9.321 |
lua 5.4.7 |
16.307 |
python 3.13.2 |
23.216 |
I’m not claiming these results are super meaningful, posthaste
is a lot
simpler and as such it’s not necessarily suprising that it comes out on top.
There’s just a lot less cruft that might get in the way of code generation like
garbage collection or dynamic types.
I do have a few small peephole optimizations implemented in posthaste
, though
they’re pretty much all just about using constant values in instructions instead of allocating
a virtual register to store a temporary constant value.
Speed tricks
There are some tricks under the hood to go fast. First of all, the most
important piece is probably the register allocator. If registers are chosen
poorly, we’ll be spending a lot of time just moving values from and to the
stack, which generally has a significant impact even with modern caches. As
such, the register allocator increments each virtual register’s priority when it
is accessed. Generally, registers within loops should be given priority, since
the loop might run the same code multiple times, so there’s a heuristic that can
be given to the register allocator: ejit_set/incr/decr_prio(FUNC, INT)
. This
sets how much to increment a register’s priority by when it is accessed, and
should generally be adjusted higher when entering a loop, and lowered when exiting one.
This is fairly easy to do on an AST level, but anything lower than that is
probably a bit more tricky.
Since ejit
must deal with calling functions either from bytecode or from
machine code, I decided to keep the escape
interface as simple as possible,
with a typical array of type
+ value
arguments passed on the stack. However,
this is very slow, and I came up with a workaround. Internally, JITted functions
use the native calling convention, and each function gets a ’trampoline’ that
shifts arguments from the stack into registers. When calling another
ejit_func
, we can check if they’re both JITted and skip the trampoline
entirely, getting almost-native function call speeds.
Somewhat similarly on the interpreter side, we use a single dynamic vector for
registers that is shared between all function calls in a given thread, such that
internally a bytecode function passes the current register state to another
bytecode function and keeps track of where its own registers are. This avoids
having to dynamically allocate new registers for each function call, and is
decently fast. There’s still a bit of an impedance mismatch between calling
external/native functions with escape
, I’ve been thinking of maybe passing the
register state to these external functions as well but that’s not as convenient
for JITted functions, unfortunately. Jumping in and out of the interpreted might
be something that has to be done with more dynamically typed languages, like
checking if a function has to be recompiled for a new set of types or something.
Bit of a TODO at the moment.
By default, ejit
generates ‘safe’ machine code, i.e. the pages are marked
read-only. This unfortunately requires an extra system call, so if you’re
absolutely certain that your program never gets called with untrusted input or
otherwise can never be compromised and speed is absolutely paramount, you can
set im_scawed = false
when calling ejit_select_compile_func
and get a decent
compilation speed improvement. Pages will remain writable, which can be a
security hazard, be warned.
For comparison, here are the compilation times and runtimes for
ejit/examples/fib.c
, compiled 1 000 000 times and run with n = 42
(./examples/fib 1000000 42 0/1/2
, respectively):
Configuration | Compilation time (s) | Runtime (s) |
---|---|---|
nojit |
0.688 | 5.990 |
unsafe jit |
3.818 | 0.593 |
safe jit |
8.089 | 0.594 |
Just for reference, libgccjit
takes about 26418 seconds to compile the
equivalent function a million times.
64 vs 32 bits
lightening
supports JIT on 32bit architectures, and I wanted to support that as
much as possible. I came up with an idea of presenting a 64bit API, but
keeping track of which
instructions and types are used, and disable JIT compilation on 32bit when 64bit
operations are used, and just using the interpreter (which can easily handle
64bit value) as a fallback. Emulating 64bit operations in the JIT would’ve been too
cumbersome and likely not worth it. This is also why there’s separate
ejit_calli_i
and ejit_calli_l
instructions, ejit_calli_l
forces the return
type to be 64bits, whereas ejit_calli_i
uses however many bits long
happens
to be. One can be JIT compiled on 32bit arches, the other one cannot.
The interpreter is still decently fast, again using ejit/examples/fib.c
as a
comparison between 32bit x86
and 64bit x86_64
running on the same CPU:
Configuration | Runtime (s) |
---|---|
32bit nojit |
8.249 |
32bit jit |
0.727 |
64bit nojit |
5.768 |
64bit jit |
0.593 |
Emulating 64bit instruction in the intepreter is a bit more costly than having
them natively, so the 32bit bytecode has a slightly worse performance ratio
compared to the 64bit interpeter (~11x vs ~10x slower than the JIT) but not too
bad. Trying to get 32bit code to JIT can be a bit tricky, so you should
carefully consider if you want to write ejit
in a 32bit friendly manner or
just say fuck it.
Limitations and future improvements
Stack allocations are not supported at the moment. Should be possible to implement, just haven’t really needed them so far.
It would be fairly straightforward to keep track of which virtual registers are alive when, and this information might be useful for moving virtual registers more tightly into physical registers, but I’m worried it would need some kind of static single assignment (SSA) analysis, which might raise the complexity a bit too much for the expected gains.
SIMD instructions might be interesting to play round with, but there’s currently
no support for them in lightening
so this would probably be a fairly significant undertaking.
Similar projects
MIR
also tries to be a lightweight execution
engine, and has built-in interpreter. MIR
has a number of optimization
passes and does CFG/SSA analysis which might produce faster machine code at the
cost of code generation speed. I played around a bit and found it to be ~3-4x
slower at generating code, but didn’t see any significantly faster runtime
performance for the produced machine code. I wasn’t able to get the interpeter working.
My testing was very limited though, so
please don’t take this as a dismissal of the project outright.
Cranelift
also sports a more extensive catalogue of optimization passes and does analysis,
seemingly moreso than MIR
, and even has support for SIMD, which is really
cool. It seems to be at least a bit slower at generating code than MIR
, and my limited testing
didn’t seem to find cases where its optimizations resulted in any significant
speed boosts. I also ran into some issues when trying to generate multiple
functions, with Cranelift
panicing after about 64K functions generated.
It’s written in Rust, which is neat, but I’m not too
familiar with the language and it’s possible that the issue was from me doing
something wrong.
SLJIT
is on a similar level as
lightening
, i.e. a kind of portable assembler where the user has to allocate
registers manually. Does seem to handle executable memory allocations for the
user, though. I have not tried generating code with it yet, but seems to support
SIMD and custom instructions, which is really cool.
Addendum 2025-03-15
There’s an obvious alternative to passing around an interpreter ’thread’, that
being allocating all registers on the stack with alloca
or variable length
arrays. This has the downside of not really being able to detect when we run out
of memory and probably would require some maximum recursion depth limit be
implemented, although that kind of already applies to the JIT as well. On the
upside, jumping into the interpreter is now about as fast as it can be, which I
imagine is useful for implementing more dynamic compilation schemes.
A quick benchmark for the alloca implementation:
Configuration | Runtime (s) |
---|---|
32bit alloca |
9.133 |
32bit nojit |
8.249 |
64bit alloca |
4.970 |
64bit nojit |
5.768 |
Apparently messing with the stack frame at runtime is not great on 32bit x86,
but still, not too bad. For now, the alloca
implementation lives in its own
alloca
branch, I’m stil a bit scared of the safety implications. Although it’s
worth noting that I don’t currently handle running out of memory particularly
gracefully anyway. Currently some critical allocations are just guarded by an
assertion, but since the assertion would like to print a string before aborting,
some libc
internals might explode. So, y’know, make sure you don’t run out of
memory.