aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorKimplul <kimi.h.kuparinen@gmail.com>2025-03-26 18:10:41 +0200
committerKimplul <kimi.h.kuparinen@gmail.com>2025-03-26 18:10:41 +0200
commit6d88dce5ebbc6fdb91054e5ffc3f8843d20b2b97 (patch)
tree1132b7dde6612df38eb83f1c9f9b18eb3af5a8e8 /README.md
downloadejit-prospero-6d88dce5ebbc6fdb91054e5ffc3f8843d20b2b97.tar.gz
ejit-prospero-6d88dce5ebbc6fdb91054e5ffc3f8843d20b2b97.zip
initial commit
Diffstat (limited to 'README.md')
-rw-r--r--README.md130
1 files changed, 130 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..081ccaf
--- /dev/null
+++ b/README.md
@@ -0,0 +1,130 @@
+# Prospero challenge with `ejit`
+
+This is my implementation of the
+[prospero](https://www.mattkeeter.com/projects/prospero/) challenge using
+[`ejit`](https://metanimi.dy.fi/cgit/ejit/).
+
+As a quick introduction, here's an excerpt from the `ejit` README:
+
+> `ejit` is a generic JIT/bytecode interpreter with a
+> strong focus on compilation speed and portability. It's built on top of the
+> excellent [`lightening`](https://gitlab.com/wingo/lightening) pure JIT library,
+> but adds a bytecode layer for systems that `lightening` doesn't support. The
+> bytecode layer also grants some QoL improvements, such as an (effectively)
+> unlimited number of virtual registers and a register allocator for mapping
+> theses virtual registers down to physical registers and a slightly easier to use
+> API (at least IMO).
+
+`ejit` is primarily inteded to be used as the main execution engine for
+experimental/toy scripting languages, so the **prospero** challenge is arguably
+a bit outside the intended use case of relatively heavy computation. Still,
+this was a good opportunity to see just how well/poorly the library can handle
+such tasks.
+
+## Running
+
+```
+git submodule update --init --recursive
+make
+```
+
+The produced binary has a few command line parameters:
+
+> usage: prospero [-j threads] [-r resolution] [-c]
+> -j how many threads to run
+> -r length of one side
+> -c enable JIT
+
+By default, the binary runs a single worker thread, produces an image of size
+1024x1024 and does not enable the JIT part of `ejit`.
+
+## Results
+
+Here are some runtime results for when the JIT compiler is enabled:
+
+| Params | Runtime for JIT (s) |
+|---------|---------------------|
+| -j1 -c | 8.943 |
+| -j2 -c | 4.357 |
+| -j6 -c | 1.678 |
+| -j12 -c | 1.617 |
+
+I have an AMD Ryzen 5 5600X with 6 cores and 12 threads. I was kind of surprised
+that `-j12` wasn't any faster than `-j6`, but if I had to guess as to why,
+presumably there are only so many physical floating point units (FPUs) and when
+more than 6 threads start running, they have to start sharing FPUs and threads
+start stalling each other. Maybe, I haven't looked into the pipeline of my
+processor to know for sure.
+
+I was also curious to see how the bytecode interpreter fared, since the main
+focus of `ejit` is for the JIT to be efficient, and the bytecode is mainly as a
+backup for more obscure hardware. Here's are some runtime results for when the
+JIT compiler is disabled:
+
+| Params | Runtime for bytecode (s) |
+|--------|--------------------------|
+| -j1 | 35.311 |
+| -j2 | 18.514 |
+| -j6 | 6.155 |
+| -j12 | 3.760 |
+
+In this case, I'm guessing that there's enough overhead during interpretation
+that adding threads beyond the number of physical cores still improves
+performance. Interpretation is something mainly done with integer ALUs, which I
+guess my CPU has more of than FPUs?
+
+Still, very interesting results. For reference, I ran [Aviva
+Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s Cranelift implementation
+which finished in `0.241` seconds when using all threads (I believe
+`into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is
+particularly cool in that it exposes SIMD instructions, which `ejit` does not
+have.
+
+To roughly compare the compilation speed of `ejit` with JIT enabled vs.
+Cranelift, setting the resolution of the generated image to 1 gives the
+following table:
+
+| Framework | Total runtime (s) |
+|-----------|-------------------|
+| Cranelift | 0.088 |
+| `ejit` | 0.004 |
+
+This runtime is not quite the same as the compilation time, but the compilation
+should be the dominating factor in this case. There's a fairly obvious tradeoff
+here, Cranelift can produce much faster code (especially if you can take advantage
+of SIMD operations), whereas `ejit` can produce machine code much faster. This
+is what I was expecting, but it's nice to see some (semi)hard numbers.
+
+## Limitations of `ejit` for this particular challenge
+
+As previously stated, not having SIMD instructions hurts performance
+significantly. I've been thinking about how SIMD could be implemented, but
+nothing concrete yet. There's a lot more variability in SIMD implementations than
+base ISAs, and it seems to be rather difficult to be generic over any large-ish
+portion of them. [`sljit`](https://github.com/zherczeg/sljit) takes the approach
+that the user can place custom instructions into the instruction stream and
+implement SIMD for specific platforms, which is cool, but is kind of at
+odds with the 'portability' aspect of `ejit` that I would like to maintain as
+far as possible.
+
+`ejit` also has at least so far had less of a focus on floating point
+instructions, and `max`/`min` were implemented with branches because `ejit`
+simply doesn't have any built-in equivalent instructions. `sqrt` I had implement
+as a function call to `libm`, which seems kind of silly if we're going for top
+performance. Square roots seem to be fairly common in `prospero.vm` as well,
+probably compounding the problem. Then again, these operations are rather
+mathematical and I'm not sure how useful they are for 'typical' scripting
+languages, which, again, is more what `ejit` is aimed towards. I'm not in a huge
+rush to start implementing these operations as built-ins quite yet, but
+definitely something to think about.
+
+I also noticed that in `prospero.vm` a lot of variables are only used once. The
+current register allocator in `ejit` assigns a variable to a register 'forever',
+meaning that the vast majority of all variables in `prospero.vm` get placed on
+the stack. This is unfortunate, but would require something like [Static Single
+Assignment (SSA)](https://en.wikipedia.org/wiki/Static_single-assignment_form)
+to properly tackle, which I suspect would increase compilation times
+dramatically. It might be possible to get by with some kind of simple lifetime
+tracking scheme which would save on compilation time, but I'm not convinced the
+register allocations would be significantly better than the current scheme.
+Something to keep in mind for the future, though.