diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 130 |
1 files changed, 130 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..081ccaf --- /dev/null +++ b/README.md @@ -0,0 +1,130 @@ +# Prospero challenge with `ejit` + +This is my implementation of the +[prospero](https://www.mattkeeter.com/projects/prospero/) challenge using +[`ejit`](https://metanimi.dy.fi/cgit/ejit/). + +As a quick introduction, here's an excerpt from the `ejit` README: + +> `ejit` is a generic JIT/bytecode interpreter with a +> strong focus on compilation speed and portability. It's built on top of the +> excellent [`lightening`](https://gitlab.com/wingo/lightening) pure JIT library, +> but adds a bytecode layer for systems that `lightening` doesn't support. The +> bytecode layer also grants some QoL improvements, such as an (effectively) +> unlimited number of virtual registers and a register allocator for mapping +> theses virtual registers down to physical registers and a slightly easier to use +> API (at least IMO). + +`ejit` is primarily inteded to be used as the main execution engine for +experimental/toy scripting languages, so the **prospero** challenge is arguably +a bit outside the intended use case of relatively heavy computation. Still, +this was a good opportunity to see just how well/poorly the library can handle +such tasks. + +## Running + +``` +git submodule update --init --recursive +make +``` + +The produced binary has a few command line parameters: + +> usage: prospero [-j threads] [-r resolution] [-c] +> -j how many threads to run +> -r length of one side +> -c enable JIT + +By default, the binary runs a single worker thread, produces an image of size +1024x1024 and does not enable the JIT part of `ejit`. + +## Results + +Here are some runtime results for when the JIT compiler is enabled: + +| Params | Runtime for JIT (s) | +|---------|---------------------| +| -j1 -c | 8.943 | +| -j2 -c | 4.357 | +| -j6 -c | 1.678 | +| -j12 -c | 1.617 | + +I have an AMD Ryzen 5 5600X with 6 cores and 12 threads. I was kind of surprised +that `-j12` wasn't any faster than `-j6`, but if I had to guess as to why, +presumably there are only so many physical floating point units (FPUs) and when +more than 6 threads start running, they have to start sharing FPUs and threads +start stalling each other. Maybe, I haven't looked into the pipeline of my +processor to know for sure. + +I was also curious to see how the bytecode interpreter fared, since the main +focus of `ejit` is for the JIT to be efficient, and the bytecode is mainly as a +backup for more obscure hardware. Here's are some runtime results for when the +JIT compiler is disabled: + +| Params | Runtime for bytecode (s) | +|--------|--------------------------| +| -j1 | 35.311 | +| -j2 | 18.514 | +| -j6 | 6.155 | +| -j12 | 3.760 | + +In this case, I'm guessing that there's enough overhead during interpretation +that adding threads beyond the number of physical cores still improves +performance. Interpretation is something mainly done with integer ALUs, which I +guess my CPU has more of than FPUs? + +Still, very interesting results. For reference, I ran [Aviva +Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s Cranelift implementation +which finished in `0.241` seconds when using all threads (I believe +`into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is +particularly cool in that it exposes SIMD instructions, which `ejit` does not +have. + +To roughly compare the compilation speed of `ejit` with JIT enabled vs. +Cranelift, setting the resolution of the generated image to 1 gives the +following table: + +| Framework | Total runtime (s) | +|-----------|-------------------| +| Cranelift | 0.088 | +| `ejit` | 0.004 | + +This runtime is not quite the same as the compilation time, but the compilation +should be the dominating factor in this case. There's a fairly obvious tradeoff +here, Cranelift can produce much faster code (especially if you can take advantage +of SIMD operations), whereas `ejit` can produce machine code much faster. This +is what I was expecting, but it's nice to see some (semi)hard numbers. + +## Limitations of `ejit` for this particular challenge + +As previously stated, not having SIMD instructions hurts performance +significantly. I've been thinking about how SIMD could be implemented, but +nothing concrete yet. There's a lot more variability in SIMD implementations than +base ISAs, and it seems to be rather difficult to be generic over any large-ish +portion of them. [`sljit`](https://github.com/zherczeg/sljit) takes the approach +that the user can place custom instructions into the instruction stream and +implement SIMD for specific platforms, which is cool, but is kind of at +odds with the 'portability' aspect of `ejit` that I would like to maintain as +far as possible. + +`ejit` also has at least so far had less of a focus on floating point +instructions, and `max`/`min` were implemented with branches because `ejit` +simply doesn't have any built-in equivalent instructions. `sqrt` I had implement +as a function call to `libm`, which seems kind of silly if we're going for top +performance. Square roots seem to be fairly common in `prospero.vm` as well, +probably compounding the problem. Then again, these operations are rather +mathematical and I'm not sure how useful they are for 'typical' scripting +languages, which, again, is more what `ejit` is aimed towards. I'm not in a huge +rush to start implementing these operations as built-ins quite yet, but +definitely something to think about. + +I also noticed that in `prospero.vm` a lot of variables are only used once. The +current register allocator in `ejit` assigns a variable to a register 'forever', +meaning that the vast majority of all variables in `prospero.vm` get placed on +the stack. This is unfortunate, but would require something like [Static Single +Assignment (SSA)](https://en.wikipedia.org/wiki/Static_single-assignment_form) +to properly tackle, which I suspect would increase compilation times +dramatically. It might be possible to get by with some kind of simple lifetime +tracking scheme which would save on compilation time, but I'm not convinced the +register allocations would be significantly better than the current scheme. +Something to keep in mind for the future, though. |