diff options
author | Kimplul <kimi.h.kuparinen@gmail.com> | 2024-04-27 00:09:48 +0300 |
---|---|---|
committer | Kimplul <kimi.h.kuparinen@gmail.com> | 2024-04-27 00:09:48 +0300 |
commit | 515ad657ec6ef685c9c91479540518a0091f1516 (patch) | |
tree | 68b5ae0e8d83bcc7bd1bd6b84648780729e20f9b | |
parent | 966f97ad7c92f559ae54d0214db90c370e3b48eb (diff) | |
download | posthaste-515ad657ec6ef685c9c91479540518a0091f1516.tar.gz posthaste-515ad657ec6ef685c9c91479540518a0091f1516.zip |
documentation
-rw-r--r-- | README.md | 284 | ||||
-rw-r--r-- | src/check.c | 20 | ||||
-rw-r--r-- | src/lower.c | 3 |
3 files changed, 300 insertions, 7 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..86d3a66 --- /dev/null +++ b/README.md @@ -0,0 +1,284 @@ +# PostHaste in C with a JIT + +This is an implementation of the PostHaste language in C with a two-stage +compilation pipeline, AST -> bytecode -> machine code. + +## Building + +At its simplest: + +``` +make +``` + +This will produce a binary called `posthaste`, which takes on input parameter, +that being the file to execute. You can alternatively give the `Makefile` some +arguments: + ++ `RELEASE=<0/1>` toggles optimization flags for the `posthaste` binary between + `-O0` and `-O2 -flto=auto`. Optimization flags can be tweaked in + `scripts/makefile`. + By default, `RELEASE=0`. + ++ `DEBUG=<0/1>` disables/enables debugging output, such as dumping a textual + representation of the AST and bytecode. + By default, `DEBUG=1`. + ++ `ASSERT=<0/1>` disables/enables assertions. Can be used to eek out that little + extra bit of performance. + By default, `ASSERT=1`. + ++ `JIT=<0/1>` disables/enables JIT compilation. + By default, `JIT=1`. + +For example, `make RELEASE=1 DEBUG=0 ASSERT=0 JIT=1` should produce a true speed +demon. + +## Usage + +``` +./posthaste <input file> +``` + +## Implementation details + +I've tried to sprinkle comments here and there where it seemed to make sense, +but more significantly I've tried to make the source code as readable as +possible. I guess we'll see how well I succeeded. + +A lot of the information in this document can also be found in various +comments throughout the codebase, and as always, the code itself doesn't lie. + +### Scoping + +Scopes consist of a vector of visible symbol definitions and a pointer to a +possible parent scope. Each AST node knows which scope it belongs to (at least +once semantic checking is done), so reaching a visible symbol from any other +symbol involves looking through the visible symbols in the same scope. If a +symbol matching the query string is found, great, return the AST node that +defined that symbol. Otherwise, repeat the process for the parent scope. + +I just used a simple vector of string -> AST node pairs, ideally I would +probably implement some kind of hashmap, but for small-ish programs I don't +think there's a noticeable difference in symbol lookup performance. + +Each scope just has a single vector of visible symbols, i.e. it wouldn't be possible +for a variable and a procedure or function to share the same name, but that's +not an issue as the lexer forcibly separates variable names from procedure and +function names. + +This gives me a fairly nice way to handle local variables and formals, as we can +create a new scope for a procedure/function and set its parent as the file +scope, and now we can insert symbols into this child scope and 'shadow' global +symbols just by returning early from the symbol lookup. Only name clashes +in the scope we're trying to add a symbol to causes a termination. +This also works recursively for `until` and `unless` blocks. + +The data structures that handle scoping are implemented in +`include/posthaste/scope.h` and `src/posthaste/scope.c`. + +### Semantic checking + +Semantic checking is done in two phases. The first phase just inserts all +function/procedure definitions into the top level scope so they can call +each other, regardless of the order in which they're defined. +The second phase pretty much does everything else. + +Type checking is fairly easy in PostHaste, as there are really only a handful of +primitive types, so we can use a simple enum to represent them all. Each AST node +gets a member that holds the type of the node and the semantic checker places +the correct type into this member during a depth-first traversal of each AST +node. + +The chosen scope structure is also helpul here. As we're traversing statement +lists, we add definitions to the scope we're currently in as we reach them. This +way, we can query if some specific symbol has already been defined, either in +the current scope or some parent scope. If no symbol matches the query, we can +report an error. + +I do static type checking, so all AST nodes in the program get some type +assigned to them. Some nodes don't strictly have concrete types, like `TYPE_INT` +or `TYPE_DATE`, so I have an internal `TYPE_VOID`. For example, a procedure call +without a return type gets internally converted to a `TYPE_VOID`. Since types +are just enum values, checking if two types are compatible is just doing an +equality check. `TYPE_VOID` is a bit tricky, as each node has to be aware of its +existence, and for example variable definitions must explicitly check for it +instead of blindly deducing the variable's type based on the expression. + +Some notable checks: + ++ Procedure arguments vs formal parameters. This is fairly easy, as call arguments are + stored in a list, as are formal parameters, so after doing type checking for + both, we can first check if the lists are the same length and that each AST + node at the same index in both lists has the same type. Since formal + parameters must have concrete types, the issue mentioned with `TYPE_VOID` is + sidestepped. + ++ Return type checking, specifically the `auto` type. Essentially, when the checker + enters a procedure or function definition AST node, it first and foremost sets + the return type. If the return type name is `auto`, the type is set to `TYPE_AUTO`. + The parser also inserts an implicit `AST_RETURN` node as the body of the + function, so that any time we run into a return node, we can check if the type + of the return expression matches the parent function/procedure. If the type is + `TYPE_AUTO`, the return node sets the return type to be the expression type. + ++ Boolean types. Comparisons return a `TYPE_BOOL`, which can be used by until + and unless and printed with a corresponding "true" or "false" string. + `TYPE_INT` is considered truthy, and until/unless accepts either an int or a + bool. + ++ Non-void procedure call as statement. Effectively, wherever there's a + statement list, I have a `analyze_statement_list()` function, which also + checks that none of the statements are a non-void procedure call. This was a + late addition, as I only now realized that such a requirement existed, and + feels a little bit hacky but it seems to work. + +Semantic checking is implemented in `include/posthaste/check.c` and +`src/check.c`. + +### Lowering + +I refer to bytecode generation from AST as lowering. I went with a scheme +similar to what Lua uses, i.e. the bytecode assumes it has access to two arrays, +one for global values and one for local ones. The global value array, as the +name suggests, contains the global values in the program. The local value array +is referred to as a stack, as both local variables and temporary values exists +in it. As a slight optimization, function calls just move an internal stack +pointer register up by some amount, so the same stack is shared between all +functions. + +It's worth noting that at this point, there's not really any relevant +differences between procedures and functions, so from here on I'm using them +interchangeably unless specifically referring to the PostHaste +function/procedure differentiation. + +The bytecode is of a fairly common three-value type, that is each instruction +consists of an opcode, one output location and two input locations. Not all +locations must be used in each instruction. Each location effectively encodes +whether it points to the global array or the stack, and an index into the array. +I also added a separate `CONST` instruction that pushes a constant value to an +output location, so for example `2 + 3` would look something like this: + +``` +CONST l1, 2 // write 2 to stack location 1 +CONST l2, 3 // write 3 to stack location 2 +ADD g0, l1, l2 // load stack locations 1 and 2 and add them together, + // write the result to the global location 0 +``` + +The lowering effectively just has to select where to place values into the stack +and/or global array, which is surprisingly easy to do. Since we don't allow +variable definitions as expressions, each variable definition just pushes a +virtual stack pointer up by one slot, effectively reserving that slot for +itself. Expressions place their value into the current virtual stack pointer +location, and in the case of binary operations like `ADD`, `SUB`, etc, they +increment the stack pointer temporarily while lowering the left/righthand side, +and then set the stack pointer back to where it was. + +Each stack location is 64 bits wide. For ease of implementation, `TYPE_DATE` and +`TYPE_INT` are both encoded as 64 bit integers, as are pointers to strings and +bools. The 64 bit values are just interpreted differently based on the +instructions, which effectively means we must have a working static type +checker. + +Argument passing is done by emitting `ARG` instructions, which push the location +they're given to a 'virtual' argument stack. Getting the return value +correspondingly has a `RETVAL` instruction, which moves the returned value from +a 'virtual' return register. I say 'virtual' because the JIT compiler gets rid +of these constructs, but they are very real in the bytecode interpreter. In +general, I tried to opt for making the instruction set easier to compiler later, +rather than make the interpretation fast. + +Lowering is implemented in `include/posthaste/lower.h` and `src/lower.c`. + +### Interpreting + +The bytecode interpreter is fairly straightforward. It has a number of internal +registers, `pc`, `sp`, and the previously mentioned virtual argument stack. Each +procedure call in PostHaste executes the interpreter recursively, copying +arguments from the stack to where the formals are defined to be. Formal parameters +must be the first consequtive stack locations in the order they are defined in +PostHaste, which makes argument handling fairly easy to implement. + +The interpreter uses computed gotos to try and reduce execution overhead. By +using labels and gotos, we can avoid using a while loop with a switch statement +within it, potentially saving a couple jumps between bytecode instructions. +The bytecode instruction set itself is not designed to be particularly fast, +as previously stated. Another example of this is that locations have to be +decoded at runtime. For example, when an instruction requests a value from some +location, the location has to be parsed to check whether we're dealing with +a location in the stack or in the global array, incurring an extra conditional +for each location. + +Recursion is handled just as any other function call. + +The interpreter is implemented in `include/posthaste/interpret.h` and +`src/interpret.c`. + +### Compiling + +The compiler is fairly straightforward. I chose to use Lightening, a library +I've made some contributions to (ported it to MIPS and PowerPC, among other +things), but the maintainer has been awol for a little while so I'm using my +fork of it. + +Lightening is more or less a very lightweight portable assembler, that gives the +user a number of virtual instructions and registers that are more or less 1:1 +mapped to machine instructions. This means that a user would have to implement +code generation and register allocation by hand, but this project is +simple enough that we can get all of that for basically nothing. + +Each bytecode instruction is compiled 'as if' it was being +interpreted, so loading data from either the stack or global array, doing some +operation to that data, and writing something back. This means each instruction +is kind of a self-contained thing, and there are no dependencies between +registers. This means that as long as we can implement each instruction with the +registers we have, we're golden. And we practically only need three registers, +as the three-value instructions can be implemented using two hardware registers, and +keeping the global array pointer in a third register. The stack is just, well, +the stack, so we use the stack pointer to access the stack. The stack pointer is +automatically handled by Lightening, making our lives easier. + +I mentioned in my email that I JIT-compiled all functions regardless of if they +were called or not, but that's not true anymore. I added a simple scheme that +only run the JIT compiler on functions that get called, turned out to be easy +enough as that information is 'implicitly' encoded in the bytecode. Just wait +until you run into a `CALL` instruction, check if the corresponding `fn` has +already been compiled. If not, compile, otherwise, continue compilation of your +current function. + +The JIT compilation is implemented in `include/posthaste/compile.h` and +`src/compile.c`. + +Machine code execution is trivial, just call the functions as if they were any +other function and watch the magic happen. There's unfortunately not really any +good ways to view the generated machine code, as that would require some +external disassembler library, which I didn't want to add a dependency to. +Recursion is handled just as any other function call. Machine code execution is +implemented in `include/posthaste/execute.h` and `src/execute.c`. + +# Thoughts on assignment + +Obviously I went a bit overkill so not sure how representative my opinions are +but I really enjoyed working on this. The language itself is simple enough that +no individual phase felt like a slog, but is enough like a real language that it +really was quite satisfying seeing things work. + +Some semantic things were a little bit weird to implement, for example the +requirement that calls to non-void functions/procedures shouldn't be allowed as +statements. That one I had to explicitly check for in code, it would've been +easier to just ignore it as it didn't really 'naturally' fit into the structure +of the checker. The programs themselves would still work. + +The semantic checker was probably the most difficult bit to be honest, I'm still +not entirely sure I'm not overlooking some corner case somewhere but the +bytecode lowering and JIT compilation sort of just fell into place. I have +played around with some bytecode-adjacent things, and I think I made the right +choice with regard to the design complexity. + +The `get_a0()` etc. thing seemed to work quite +nicely, I might try using it in other projects. It's not anything I came up +with, I believe GCC does something similar, but for example `ast_visit()` turned +out to be way easier to implement with just a fixed `struct ast`. If I had used +a separate struct for each AST node type or mangled them all into a `union`, I'm +fairly sure I would've quintupled the size of this project. diff --git a/src/check.c b/src/check.c index 85e5459..0858c67 100644 --- a/src/check.c +++ b/src/check.c @@ -216,6 +216,12 @@ static int analyze_formal_def(struct state *state, struct scope *scope, if (analyze_type(scope, n)) return -1; + if (!concrete_type(n->t)) { + semantic_error(scope, n, "illegal type for formal parameter: %s", + type_str(n->t)); + return -1; + } + if (scope_add_formal(scope, n)) return -1; @@ -653,7 +659,7 @@ static int analyze_proc_call(struct state *state, struct scope *scope, foreach_node(a, args) { if (a->t != formal->t) { - semantic_error(scope, c, "expected %s, got %s", + semantic_error(scope, a, "expected %s, got %s", type_str(formal->t), type_str(a->t)); return -1; } @@ -723,9 +729,9 @@ static int analyze_until(struct state *state, struct scope *scope, if (analyze(state, until_scope, cond)) return -1; - if (cond->t != TYPE_BOOL) { - semantic_error(scope, cond, "expected %s, got %s", - type_str(TYPE_BOOL), type_str(cond->t)); + if (cond->t != TYPE_BOOL && cond->t != TYPE_INT) { + semantic_error(scope, cond, "expected truthy type, got %s", + type_str(cond->t)); return -1; } @@ -749,9 +755,9 @@ static int analyze_unless(struct state *state, struct scope *scope, if (analyze(state, scope, cond)) return -1; - if (cond->t != TYPE_BOOL) { - semantic_error(scope, cond, "expected %s, got %s", - type_str(TYPE_BOOL), type_str(cond->t)); + if (cond->t != TYPE_BOOL && cond->t != TYPE_INT) { + semantic_error(scope, cond, "expected truthy type, got %s", + type_str(cond->t)); return -1; } diff --git a/src/lower.c b/src/lower.c index 1949bd5..af9f9d4 100644 --- a/src/lower.c +++ b/src/lower.c @@ -605,6 +605,9 @@ static void lower_global_var(struct fn *f, struct ast *n) { static void add_proc(struct ast *n) { size_t idx = vec_len(&fns); + /* global locs are effectively just indexes, so this is a nifty way to + * encode the procedure index into the AST for later use by + * lower_proc_call and so on */ n->l = build_global_loc(idx); struct fn f = {.name = proc_id(n), .idx = idx, |