aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 081ccafb2f19c5abbd8c249bc47ae440847e72a0 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# Prospero challenge with `ejit`

This is my implementation of the
[prospero](https://www.mattkeeter.com/projects/prospero/) challenge using
[`ejit`](https://metanimi.dy.fi/cgit/ejit/).

As a quick introduction, here's an excerpt from the `ejit` README:

> `ejit` is a generic JIT/bytecode interpreter with a
> strong focus on compilation speed and portability. It's built on top of the
> excellent [`lightening`](https://gitlab.com/wingo/lightening) pure JIT library,
> but adds a bytecode layer for systems that `lightening` doesn't support. The
> bytecode layer also grants some QoL improvements, such as an (effectively)
> unlimited number of virtual registers and a register allocator for mapping
> theses virtual registers down to physical registers and a slightly easier to use
> API (at least IMO).

`ejit` is primarily inteded to be used as the main execution engine for
experimental/toy scripting languages, so the **prospero** challenge is arguably
a bit outside the intended use case of relatively heavy computation. Still,
this was a good opportunity to see just how well/poorly the library can handle
such tasks.

## Running

```
git submodule update --init --recursive
make
```

The produced binary has a few command line parameters:

> usage: prospero [-j threads] [-r resolution] [-c]
>  -j how many threads to run
>  -r length of one side
>  -c enable JIT

By default, the binary runs a single worker thread, produces an image of size
1024x1024 and does not enable the JIT part of `ejit`.

## Results

Here are some runtime results for when the JIT compiler is enabled:

| Params  | Runtime for JIT (s) |
|---------|---------------------|
| -j1  -c | 8.943               |
| -j2  -c | 4.357               |
| -j6  -c | 1.678               |
| -j12 -c | 1.617               |

I have an AMD Ryzen 5 5600X with 6 cores and 12 threads. I was kind of surprised
that `-j12` wasn't any faster than `-j6`, but if I had to guess as to why,
presumably there are only so many physical floating point units (FPUs) and when
more than 6 threads start running, they have to start sharing FPUs and threads
start stalling each other. Maybe, I haven't looked into the pipeline of my
processor to know for sure.

I was also curious to see how the bytecode interpreter fared, since the main
focus of `ejit` is for the JIT to be efficient, and the bytecode is mainly as a
backup for more obscure hardware. Here's are some runtime results for when the
JIT compiler is disabled:

| Params | Runtime for bytecode (s) |
|--------|--------------------------|
| -j1    | 35.311                   |
| -j2    | 18.514                   |
| -j6    | 6.155                    |
| -j12   | 3.760                    |

In this case, I'm guessing that there's enough overhead during interpretation
that adding threads beyond the number of physical cores still improves
performance. Interpretation is something mainly done with integer ALUs, which I
guess my CPU has more of than FPUs?

Still, very interesting results. For reference, I ran [Aviva
Ruben](https://whtwnd.com/aviva.gay/3ll5dbskg3v26)'s Cranelift implementation
which finished in `0.241` seconds when using all threads (I believe
`into_par_iter()` uses threads, anyway...). Very impressive! Cranelift is
particularly cool in that it exposes SIMD instructions, which `ejit` does not
have.

To roughly compare the compilation speed of `ejit` with JIT enabled vs.
Cranelift, setting the resolution of the generated image to 1 gives the
following table:

| Framework | Total runtime (s) |
|-----------|-------------------|
| Cranelift | 0.088             |
| `ejit`    | 0.004             |

This runtime is not quite the same as the compilation time, but the compilation
should be the dominating factor in this case. There's a fairly obvious tradeoff
here, Cranelift can produce much faster code (especially if you can take advantage
of SIMD operations), whereas `ejit` can produce machine code much faster. This
is what I was expecting, but it's nice to see some (semi)hard numbers.

## Limitations of `ejit` for this particular challenge

As previously stated, not having SIMD instructions hurts performance
significantly. I've been thinking about how SIMD could be implemented, but
nothing concrete yet. There's a lot more variability in SIMD implementations than
base ISAs, and it seems to be rather difficult to be generic over any large-ish
portion of them. [`sljit`](https://github.com/zherczeg/sljit) takes the approach
that the user can place custom instructions into the instruction stream and
implement SIMD for specific platforms, which is cool, but is kind of at
odds with the 'portability' aspect of `ejit` that I would like to maintain as
far as possible.

`ejit` also has at least so far had less of a focus on floating point
instructions, and `max`/`min` were implemented with branches because `ejit`
simply doesn't have any built-in equivalent instructions. `sqrt` I had implement
as a function call to `libm`, which seems kind of silly if we're going for top
performance. Square roots seem to be fairly common in `prospero.vm` as well,
probably compounding the problem. Then again, these operations are rather
mathematical and I'm not sure how useful they are for 'typical' scripting
languages, which, again, is more what `ejit` is aimed towards. I'm not in a huge
rush to start implementing these operations as built-ins quite yet, but
definitely something to think about.

I also noticed that in `prospero.vm` a lot of variables are only used once. The
current register allocator in `ejit` assigns a variable to a register 'forever',
meaning that the vast majority of all variables in `prospero.vm` get placed on
the stack. This is unfortunate, but would require something like [Static Single
Assignment (SSA)](https://en.wikipedia.org/wiki/Static_single-assignment_form)
to properly tackle, which I suspect would increase compilation times
dramatically. It might be possible to get by with some kind of simple lifetime
tracking scheme which would save on compilation time, but I'm not convinced the
register allocations would be significantly better than the current scheme.
Something to keep in mind for the future, though.