diff options
author | Kimplul <kimi.h.kuparinen@gmail.com> | 2025-04-05 18:59:00 +0300 |
---|---|---|
committer | Kimplul <kimi.h.kuparinen@gmail.com> | 2025-04-05 18:59:00 +0300 |
commit | 2a4568e97a4bc71a4af428ef30e835a6cf48fe0c (patch) | |
tree | 094335709aead5c58692e7458a054a1a2df9d9ca /README.md | |
parent | 63786e449651a748c860e9df0b160deaefa07ec5 (diff) | |
download | ejit-prospero-master.tar.gz ejit-prospero-master.zip |
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 94 |
1 files changed, 46 insertions, 48 deletions
@@ -4,6 +4,10 @@ This is my implementation of the [prospero](https://www.mattkeeter.com/projects/prospero/) challenge using [`ejit`](https://metanimi.dy.fi/cgit/ejit/). +Note that as of 2025-04-05, I've implemented some optimizations int `ejit` that weren't +part of the original submission. Results from the original submission can be +found in `README.old.md`. + As a quick introduction, here's an excerpt from the `ejit` README: > `ejit` is a generic JIT/bytecode interpreter with a @@ -46,17 +50,10 @@ Here are some runtime results for when the JIT compiler is enabled: | Params | Runtime for JIT (s) | |---------|---------------------| -|`-j1 -c`| 8.943 | -|`-j2 -c`| 4.357 | -|`-j6 -c`| 1.678 | -|`-j12 -c`| 1.617 | - -I have an AMD Ryzen 5 5600X with 6 cores and 12 threads. I was kind of surprised -that `-j12` wasn't any faster than `-j6`, but if I had to guess as to why, -presumably there are only so many physical floating point units (FPUs) and when -more than 6 threads start running, they have to start sharing FPUs and threads -start stalling each other. Maybe, I haven't looked into the pipeline of my -processor to know for sure. +|`-j1 -c`| 3.149 | +|`-j2 -c`| 1.578 | +|`-j6 -c`| 0.538 | +|`-j12 -c`| 0.413 | I was also curious to see how the bytecode interpreter fared, since the main focus of `ejit` is for the JIT to be efficient, and the bytecode is mainly as a @@ -65,20 +62,15 @@ JIT compiler is disabled: | Params | Runtime for bytecode (s) | |--------|--------------------------| -|`-j1` | 35.311 | -|`-j2` | 18.514 | -|`-j6` | 6.155 | -|`-j12` | 3.760 | - -In this case, I'm guessing that there's enough overhead during interpretation -that adding threads beyond the number of physical cores still improves -performance. Interpretation is something mainly done with integer ALUs, which I -guess my CPU has more of than FPUs? - -Still, very interesting results. For reference, I ran [Aviva -Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s Cranelift implementation -which finished in `0.241` seconds when using all threads (I believe -`into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is +|`-j1` | 28.148 | +|`-j2` | 14.120 | +|`-j6` | 4.971 | +|`-j12` | 3.429 | + + +For reference, I ran [Aviva Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s +Cranelift implementation which finished in `0.241` seconds when using all threads +(I believe `into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is particularly cool in that it exposes SIMD instructions, which `ejit` does not have. @@ -97,9 +89,37 @@ here, Cranelift can produce much faster code (especially if you can take advanta of SIMD operations), whereas `ejit` can produce machine code much faster. This is what I was expecting, but it's nice to see some (semi)hard numbers. +To cheat a bit, I also tried compiling a version of the Cranelift implementation +that didn't use SIMD (commit 55090c86e2d1a25b86a8e1f8d0e15a10b9c4f5b7) and +didn't use parallelism (`into_par_iter()` seemed to have a bit more overhead +when dividing up work than my implementation), and the Cranelift implementation +finished in 4.079 seconds. Compared to my `-j1 -c` case of 3.149 seconds, I'm +fairly pleased with the results. I'm hoping that I'm doing an apples-to-apples +comparison here, and that the optimizations I did to `ejit` and `lightening` +since first submitting this implementation are useful for other tasks as well, +and aren't just me trying to one-up Cranelift. + +## Optimizations done since initial submission + +Probably most significantly, I implemented the `sqrt`, `fmax` and `fmin` +operations. Previously they were either branches or an actual function call in +the case of `sqrt`, now they're done inline in machine code. + +I also improved the register allocator somewhat. Previously each variable was +mapped to a fixed location, either to a register or a stack slot, and 'locked' +that location for the lifetime of the procedure being generated. I implemented a +basic lifetime check, and now locations that don't overlap temporally can be +placed in the same slot. I also sort the slots according to a calculated spill +cost, so the slot with a higher total spill cost is more likely to be placed +into a register. The code given for this challenge is in a kind-of-ssa form, +which the register allocator is fairly well suited for, but I'm hoping that it +also works well for more stack-like access patterns, like in +[posthaste](https://metanimi.dy.fi/git/posthaste) (because stack-like is easier +to implement in a compiler than SSA-like). More testing would be useful, though. + ## Limitations of `ejit` for this particular challenge -As previously stated, not having SIMD instructions hurts performance +Not having SIMD instructions hurts performance significantly. I've been thinking about how SIMD could be implemented, but nothing concrete yet. There's a lot more variability in SIMD implementations than base ISAs, and it seems to be rather difficult to be generic over any large-ish @@ -108,25 +128,3 @@ that the user can place custom instructions into the instruction stream and implement SIMD for specific platforms, which is cool, but is kind of at odds with the 'portability' aspect of `ejit` that I would like to maintain as far as possible. - -`ejit` also has at least so far had less of a focus on floating point -instructions, and `max`/`min` were implemented with branches because `ejit` -simply doesn't have any built-in equivalent instructions. `sqrt` I had implement -as a function call to `libm`, which seems kind of silly if we're going for top -performance. Square roots seem to be fairly common in `prospero.vm` as well, -probably compounding the problem. Then again, these operations are rather -mathematical and I'm not sure how useful they are for 'typical' scripting -languages, which, again, is more what `ejit` is aimed towards. I'm not in a huge -rush to start implementing these operations as built-ins quite yet, but -definitely something to think about. - -I also noticed that in `prospero.vm` a lot of variables are only used once. The -current register allocator in `ejit` assigns a variable to a register 'forever', -meaning that the vast majority of all variables in `prospero.vm` get placed on -the stack. This is unfortunate, but would require something like [Static Single -Assignment (SSA)](https://en.wikipedia.org/wiki/Static_single-assignment_form) -to properly tackle, which I suspect would increase compilation times -dramatically. It might be possible to get by with some kind of simple lifetime -tracking scheme which would save on compilation time, but I'm not convinced the -register allocations would be significantly better than the current scheme. -Something to keep in mind for the future, though. |