# Prospero challenge with `ejit` This is my implementation of the [prospero](https://www.mattkeeter.com/projects/prospero/) challenge using [`ejit`](https://metanimi.dy.fi/cgit/ejit/). As a quick introduction, here's an excerpt from the `ejit` README: > `ejit` is a generic JIT/bytecode interpreter with a > strong focus on compilation speed and portability. It's built on top of the > excellent [`lightening`](https://gitlab.com/wingo/lightening) pure JIT library, > but adds a bytecode layer for systems that `lightening` doesn't support. The > bytecode layer also grants some QoL improvements, such as an (effectively) > unlimited number of virtual registers and a register allocator for mapping > theses virtual registers down to physical registers and a slightly easier to use > API (at least IMO). `ejit` is primarily inteded to be used as the main execution engine for experimental/toy scripting languages, so the **prospero** challenge is arguably a bit outside the intended use case of relatively heavy computation. Still, this was a good opportunity to see just how well/poorly the library can handle such tasks. ## Running ``` git submodule update --init --recursive make ``` The produced binary has a few command line parameters: > usage: prospero [-j threads] [-r resolution] [-c] > -j how many threads to run > -r length of one side > -c enable JIT By default, the binary runs a single worker thread, produces an image of size 1024x1024 and does not enable the JIT part of `ejit`. ## Results Here are some runtime results for when the JIT compiler is enabled: | Params | Runtime for JIT (s) | |---------|---------------------| | -j1 -c | 8.943 | | -j2 -c | 4.357 | | -j6 -c | 1.678 | | -j12 -c | 1.617 | I have an AMD Ryzen 5 5600X with 6 cores and 12 threads. I was kind of surprised that `-j12` wasn't any faster than `-j6`, but if I had to guess as to why, presumably there are only so many physical floating point units (FPUs) and when more than 6 threads start running, they have to start sharing FPUs and threads start stalling each other. Maybe, I haven't looked into the pipeline of my processor to know for sure. I was also curious to see how the bytecode interpreter fared, since the main focus of `ejit` is for the JIT to be efficient, and the bytecode is mainly as a backup for more obscure hardware. Here's are some runtime results for when the JIT compiler is disabled: | Params | Runtime for bytecode (s) | |--------|--------------------------| | -j1 | 35.311 | | -j2 | 18.514 | | -j6 | 6.155 | | -j12 | 3.760 | In this case, I'm guessing that there's enough overhead during interpretation that adding threads beyond the number of physical cores still improves performance. Interpretation is something mainly done with integer ALUs, which I guess my CPU has more of than FPUs? Still, very interesting results. For reference, I ran [Aviva Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s Cranelift implementation which finished in `0.241` seconds when using all threads (I believe `into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is particularly cool in that it exposes SIMD instructions, which `ejit` does not have. To roughly compare the compilation speed of `ejit` with JIT enabled vs. Cranelift, setting the resolution of the generated image to 1 gives the following table: | Framework | Total runtime (s) | |-----------|-------------------| | Cranelift | 0.088 | | `ejit` | 0.004 | This runtime is not quite the same as the compilation time, but the compilation should be the dominating factor in this case. There's a fairly obvious tradeoff here, Cranelift can produce much faster code (especially if you can take advantage of SIMD operations), whereas `ejit` can produce machine code much faster. This is what I was expecting, but it's nice to see some (semi)hard numbers. ## Limitations of `ejit` for this particular challenge As previously stated, not having SIMD instructions hurts performance significantly. I've been thinking about how SIMD could be implemented, but nothing concrete yet. There's a lot more variability in SIMD implementations than base ISAs, and it seems to be rather difficult to be generic over any large-ish portion of them. [`sljit`](https://github.com/zherczeg/sljit) takes the approach that the user can place custom instructions into the instruction stream and implement SIMD for specific platforms, which is cool, but is kind of at odds with the 'portability' aspect of `ejit` that I would like to maintain as far as possible. `ejit` also has at least so far had less of a focus on floating point instructions, and `max`/`min` were implemented with branches because `ejit` simply doesn't have any built-in equivalent instructions. `sqrt` I had implement as a function call to `libm`, which seems kind of silly if we're going for top performance. Square roots seem to be fairly common in `prospero.vm` as well, probably compounding the problem. Then again, these operations are rather mathematical and I'm not sure how useful they are for 'typical' scripting languages, which, again, is more what `ejit` is aimed towards. I'm not in a huge rush to start implementing these operations as built-ins quite yet, but definitely something to think about. I also noticed that in `prospero.vm` a lot of variables are only used once. The current register allocator in `ejit` assigns a variable to a register 'forever', meaning that the vast majority of all variables in `prospero.vm` get placed on the stack. This is unfortunate, but would require something like [Static Single Assignment (SSA)](https://en.wikipedia.org/wiki/Static_single-assignment_form) to properly tackle, which I suspect would increase compilation times dramatically. It might be possible to get by with some kind of simple lifetime tracking scheme which would save on compilation time, but I'm not convinced the register allocations would be significantly better than the current scheme. Something to keep in mind for the future, though.