# Prospero challenge with `ejit`

This is my implementation of the
[prospero](https://www.mattkeeter.com/projects/prospero/) challenge using
[`ejit`](https://metanimi.dy.fi/cgit/ejit/).

Note that as of 2025-04-05, I've implemented some optimizations int `ejit` that weren't
part of the original submission. Results from the original submission can be
found in `README.old.md`.

As a quick introduction, here's an excerpt from the `ejit` README:

> `ejit` is a generic JIT/bytecode interpreter with a
> strong focus on compilation speed and portability. It's built on top of the
> excellent [`lightening`](https://gitlab.com/wingo/lightening) pure JIT library,
> but adds a bytecode layer for systems that `lightening` doesn't support. The
> bytecode layer also grants some QoL improvements, such as an (effectively)
> unlimited number of virtual registers and a register allocator for mapping
> theses virtual registers down to physical registers and a slightly easier to use
> API (at least IMO).

`ejit` is primarily inteded to be used as the main execution engine for
experimental/toy scripting languages, so the **prospero** challenge is arguably
a bit outside the intended use case of `ejit`, it not really being designed with
relatively heavy computation in mind. Still, this was a good opportunity to see
just how well/poorly the library can handle such tasks.

## Running

```
git submodule update --init --recursive
make
```

The produced binary has a few command line parameters:

```
usage: prospero [-j threads] [-r resolution] [-c]
 -j how many threads to run
 -r length of one side
 -c enable JIT
```

By default, the binary runs a single worker thread, produces an image of size
1024x1024 and does not enable the JIT part of `ejit`.

## Results

Here are some runtime results for when the JIT compiler is enabled:

| Params  | Runtime for JIT (s) |
|---------|---------------------|
|`-j1  -c`| 3.149               |
|`-j2  -c`| 1.578               |
|`-j6  -c`| 0.538               |
|`-j12 -c`| 0.413               |

I was also curious to see how the bytecode interpreter fared, since the main
focus of `ejit` is for the JIT to be efficient, and the bytecode is mainly as a
backup for more obscure hardware. Here's are some runtime results for when the
JIT compiler is disabled:

| Params | Runtime for bytecode (s) |
|--------|--------------------------|
|`-j1`   | 28.148                   |
|`-j2`   | 14.120                   |
|`-j6`   | 4.971                    |
|`-j12`  | 3.429                    |


For reference, I ran [Aviva Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s
Cranelift implementation which finished in `0.241` seconds when using all threads
(I believe `into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is
particularly cool in that it exposes SIMD instructions, which `ejit` does not
have.

To roughly compare the compilation speed of `ejit` with JIT enabled vs.
Cranelift, setting the resolution of the generated image to 1 gives the
following table:

| Framework | Total runtime (s) |
|-----------|-------------------|
| Cranelift | 0.088             |
| `ejit`    | 0.004             |

This runtime is not quite the same as the compilation time, but the compilation
should be the dominating factor in this case. There's a fairly obvious tradeoff
here, Cranelift can produce much faster code (especially if you can take advantage
of SIMD operations), whereas `ejit` can produce machine code much faster. This
is what I was expecting, but it's nice to see some (semi)hard numbers.

To cheat a bit, I also tried compiling a version of the Cranelift implementation
that didn't use SIMD (commit 55090c86e2d1a25b86a8e1f8d0e15a10b9c4f5b7) and
didn't use parallelism (`into_par_iter()` seemed to have a bit more overhead
when dividing up work than my implementation), and the Cranelift implementation
finished in 4.079 seconds. Compared to my `-j1 -c` case of 3.149 seconds, I'm
fairly pleased with the results. I'm hoping that I'm doing an apples-to-apples
comparison here, and that the optimizations I did to `ejit` and `lightening`
since first submitting this implementation  are useful for other tasks as well,
and aren't just me trying to one-up Cranelift.

## Optimizations done since initial submission

Probably most significantly, I implemented the `sqrt`, `fmax` and `fmin`
operations. Previously they were either branches or an actual function call in
the case of `sqrt`, now they're done inline in machine code.

I also improved the register allocator somewhat. Previously each variable was
mapped to a fixed location, either to a register or a stack slot, and 'locked'
that location for the lifetime of the procedure being generated. I implemented a
basic lifetime check, and now locations that don't overlap temporally can be
placed in the same slot. I also sort the slots according to a calculated spill
cost, so the slot with a higher total spill cost is more likely to be placed
into a register. The code given for this challenge is in a kind-of-ssa form,
which the register allocator is fairly well suited for, but I'm hoping that it
also works well for more stack-like access patterns, like in
[posthaste](https://metanimi.dy.fi/git/posthaste) (because stack-like is easier
to implement in a compiler than SSA-like). More testing would be useful, though.

## Limitations of `ejit` for this particular challenge

Not having SIMD instructions hurts performance
significantly. I've been thinking about how SIMD could be implemented, but
nothing concrete yet. There's a lot more variability in SIMD implementations than
base ISAs, and it seems to be rather difficult to be generic over any large-ish
portion of them. [`sljit`](https://github.com/zherczeg/sljit) takes the approach
that the user can place custom instructions into the instruction stream and
implement SIMD for specific platforms, which is cool, but is kind of at
odds with the 'portability' aspect of `ejit` that I would like to maintain as
far as possible.