Prospero challenge with ejit
This is my implementation of the
prospero challenge using
ejit
.
As a quick introduction, here's an excerpt from the ejit
README:
ejit
is a generic JIT/bytecode interpreter with a strong focus on compilation speed and portability. It's built on top of the excellentlightening
pure JIT library, but adds a bytecode layer for systems thatlightening
doesn't support. The bytecode layer also grants some QoL improvements, such as an (effectively) unlimited number of virtual registers and a register allocator for mapping theses virtual registers down to physical registers and a slightly easier to use API (at least IMO).
ejit
is primarily inteded to be used as the main execution engine for
experimental/toy scripting languages, so the prospero challenge is arguably
a bit outside the intended use case of ejit
, it not really being designed with
relatively heavy computation in mind. Still, this was a good opportunity to see
just how well/poorly the library can handle such tasks.
Running
git submodule update --init --recursive
make
The produced binary has a few command line parameters:
usage: prospero [-j threads] [-r resolution] [-c]
-j how many threads to run
-r length of one side
-c enable JIT
By default, the binary runs a single worker thread, produces an image of size
1024x1024 and does not enable the JIT part of ejit
.
Results
Here are some runtime results for when the JIT compiler is enabled:
Params | Runtime for JIT (s) |
---|---|
-j1 -c |
8.943 |
-j2 -c |
4.357 |
-j6 -c |
1.678 |
-j12 -c |
1.617 |
I have an AMD Ryzen 5 5600X with 6 cores and 12 threads. I was kind of surprised
that -j12
wasn't any faster than -j6
, but if I had to guess as to why,
presumably there are only so many physical floating point units (FPUs) and when
more than 6 threads start running, they have to start sharing FPUs and threads
start stalling each other. Maybe, I haven't looked into the pipeline of my
processor to know for sure.
I was also curious to see how the bytecode interpreter fared, since the main
focus of ejit
is for the JIT to be efficient, and the bytecode is mainly as a
backup for more obscure hardware. Here's are some runtime results for when the
JIT compiler is disabled:
Params | Runtime for bytecode (s) |
---|---|
-j1 |
35.311 |
-j2 |
18.514 |
-j6 |
6.155 |
-j12 |
3.760 |
In this case, I'm guessing that there's enough overhead during interpretation that adding threads beyond the number of physical cores still improves performance. Interpretation is something mainly done with integer ALUs, which I guess my CPU has more of than FPUs?
Still, very interesting results. For reference, I ran Aviva
Ruben's Cranelift implementation
which finished in 0.241
seconds when using all threads (I believe
into_par_iter()
uses threads, anyway...). Very impressive! Cranelift is
particularly cool in that it exposes SIMD instructions, which ejit
does not
have.
To roughly compare the compilation speed of ejit
with JIT enabled vs.
Cranelift, setting the resolution of the generated image to 1 gives the
following table:
Framework | Total runtime (s) |
---|---|
Cranelift | 0.088 |
ejit |
0.004 |
This runtime is not quite the same as the compilation time, but the compilation
should be the dominating factor in this case. There's a fairly obvious tradeoff
here, Cranelift can produce much faster code (especially if you can take advantage
of SIMD operations), whereas ejit
can produce machine code much faster. This
is what I was expecting, but it's nice to see some (semi)hard numbers.
Limitations of ejit
for this particular challenge
As previously stated, not having SIMD instructions hurts performance
significantly. I've been thinking about how SIMD could be implemented, but
nothing concrete yet. There's a lot more variability in SIMD implementations than
base ISAs, and it seems to be rather difficult to be generic over any large-ish
portion of them. sljit
takes the approach
that the user can place custom instructions into the instruction stream and
implement SIMD for specific platforms, which is cool, but is kind of at
odds with the 'portability' aspect of ejit
that I would like to maintain as
far as possible.
ejit
also has at least so far had less of a focus on floating point
instructions, and max
/min
were implemented with branches because ejit
simply doesn't have any built-in equivalent instructions. sqrt
I had implement
as a function call to libm
, which seems kind of silly if we're going for top
performance. Square roots seem to be fairly common in prospero.vm
as well,
probably compounding the problem. Then again, these operations are rather
mathematical and I'm not sure how useful they are for 'typical' scripting
languages, which, again, is more what ejit
is aimed towards. I'm not in a huge
rush to start implementing these operations as built-ins quite yet, but
definitely something to think about.
I also noticed that in prospero.vm
a lot of variables are only used once. The
current register allocator in ejit
assigns a variable to a register 'forever',
meaning that the vast majority of all variables in prospero.vm
get placed on
the stack. This is unfortunate, but would require something like Static Single
Assignment (SSA)
to properly tackle, which I suspect would increase compilation times
dramatically. It might be possible to get by with some kind of simple lifetime
tracking scheme which would save on compilation time, but I'm not convinced the
register allocations would be significantly better than the current scheme.
Something to keep in mind for the future, though.