PostHaste in C with a JIT

This is an implementation of the PostHaste language in C with a two-stage compilation pipeline, AST -> bytecode -> machine code.

Building

At its simplest:

make

This will produce a binary called posthaste, which takes on input parameter, that being the file to execute. You can alternatively give the Makefile some arguments:

RELEASE=<0/1> toggles optimization flags for the posthaste binary between -O0 and -O2 -flto=auto. Optimization flags can be tweaked in scripts/makefile. By default, RELEASE=0.
DEBUG=<0/1> disables/enables debugging output, such as dumping a textual representation of the AST and bytecode. By default, DEBUG=1.
ASSERT=<0/1> disables/enables assertions. Can be used to eek out that little extra bit of performance. By default, ASSERT=1.
JIT=<0/1> disables/enables JIT compilation. By default, JIT=1.

For example, make RELEASE=1 DEBUG=0 ASSERT=0 JIT=1 should produce a true speed demon.

Usage

./posthaste <input file>

Implementation details

I've tried to sprinkle comments here and there where it seemed to make sense, but more significantly I've tried to make the source code as readable as possible. I guess we'll see how well I succeeded.

A lot of the information in this document can also be found in various comments throughout the codebase, and as always, the code itself doesn't lie.

Scoping

Scopes consist of a vector of visible symbol definitions and a pointer to a possible parent scope. Each AST node knows which scope it belongs to (at least once semantic checking is done), so reaching a visible symbol from any other symbol involves looking through the visible symbols in the same scope. If a symbol matching the query string is found, great, return the AST node that defined that symbol. Otherwise, repeat the process for the parent scope.

I just used a simple vector of string -> AST node pairs, ideally I would probably implement some kind of hashmap, but for small-ish programs I don't think there's a noticeable difference in symbol lookup performance.

Each scope just has a single vector of visible symbols, i.e. it wouldn't be possible for a variable and a procedure or function to share the same name, but that's not an issue as the lexer forcibly separates variable names from procedure and function names.

This gives me a fairly nice way to handle local variables and formals, as we can create a new scope for a procedure/function and set its parent as the file scope, and now we can insert symbols into this child scope and 'shadow' global symbols just by returning early from the symbol lookup. Only name clashes in the scope we're trying to add a symbol to causes a termination. This also works recursively for until and unless blocks.

The data structures that handle scoping are implemented in include/posthaste/scope.h and src/posthaste/scope.c.

Semantic checking

Semantic checking is done in two phases. The first phase just inserts all function/procedure definitions into the top level scope so they can call each other, regardless of the order in which they're defined. The second phase pretty much does everything else.

Type checking is fairly easy in PostHaste, as there are really only a handful of primitive types, so we can use a simple enum to represent them all. Each AST node gets a member that holds the type of the node and the semantic checker places the correct type into this member during a depth-first traversal of each AST node.

The chosen scope structure is also helpul here. As we're traversing statement lists, we add definitions to the scope we're currently in as we reach them. This way, we can query if some specific symbol has already been defined, either in the current scope or some parent scope. If no symbol matches the query, we can report an error.

I do static type checking, so all AST nodes in the program get some type assigned to them. Some nodes don't strictly have concrete types, like TYPE_INT or TYPE_DATE, so I have an internal TYPE_VOID. For example, a procedure call without a return type gets internally converted to a TYPE_VOID. Since types are just enum values, checking if two types are compatible is just doing an equality check. TYPE_VOID is a bit tricky, as each node has to be aware of its existence, and for example variable definitions must explicitly check for it instead of blindly deducing the variable's type based on the expression.

Some notable checks:

Procedure arguments vs formal parameters. This is fairly easy, as call arguments are stored in a list, as are formal parameters, so after doing type checking for both, we can first check if the lists are the same length and that each AST node at the same index in both lists has the same type. Since formal parameters must have concrete types, the issue mentioned with TYPE_VOID is sidestepped.
Return type checking, specifically the auto type. Essentially, when the checker enters a procedure or function definition AST node, it first and foremost sets the return type. If the return type name is auto, the type is set to TYPE_AUTO. The parser also inserts an implicit AST_RETURN node as the body of the function, so that any time we run into a return node, we can check if the type of the return expression matches the parent function/procedure. If the type is TYPE_AUTO, the return node sets the return type to be the expression type.
Boolean types. Comparisons return a TYPE_BOOL, which can be used by until and unless and printed with a corresponding "true" or "false" string. TYPE_INT is considered truthy, and until/unless accepts either an int or a bool.
Non-void procedure call as statement. Effectively, wherever there's a statement list, I have a analyze_statement_list() function, which also checks that none of the statements are a non-void procedure call. This was a late addition, as I only now realized that such a requirement existed, and feels a little bit hacky but it seems to work.

Semantic checking is implemented in include/posthaste/check.c and src/check.c.

Lowering

I refer to bytecode generation from AST as lowering. I went with a scheme similar to what Lua uses, i.e. the bytecode assumes it has access to two arrays, one for global values and one for local ones. The global value array, as the name suggests, contains the global values in the program. The local value array is referred to as a stack, as both local variables and temporary values exists in it. As a slight optimization, function calls just move an internal stack pointer register up by some amount, so the same stack is shared between all functions.

It's worth noting that at this point, there's not really any relevant differences between procedures and functions, so from here on I'm using them interchangeably unless specifically referring to the PostHaste function/procedure differentiation.

The bytecode is of a fairly common three-value type, that is each instruction consists of an opcode, one output location and two input locations. Not all locations must be used in each instruction. Each location effectively encodes whether it points to the global array or the stack, and an index into the array. I also added a separate CONST instruction that pushes a constant value to an output location, so for example 2 + 3 would look something like this:

CONST l1, 2 // write 2 to stack location 1
CONST l2, 3 // write 3 to stack location 2
ADD g0, l1, l2 // load stack locations 1 and 2 and add them together,
               // write the result to the global location 0

The lowering effectively just has to select where to place values into the stack and/or global array, which is surprisingly easy to do. Since we don't allow variable definitions as expressions, each variable definition just pushes a virtual stack pointer up by one slot, effectively reserving that slot for itself. Expressions place their value into the current virtual stack pointer location, and in the case of binary operations like ADD, SUB, etc, they increment the stack pointer temporarily while lowering the left/righthand side, and then set the stack pointer back to where it was.

Each stack location is 64 bits wide. For ease of implementation, TYPE_DATE and TYPE_INT are both encoded as 64 bit integers, as are pointers to strings and bools. The 64 bit values are just interpreted differently based on the instructions, which effectively means we must have a working static type checker.

Argument passing is done by emitting ARG instructions, which push the location they're given to a 'virtual' argument stack. Getting the return value correspondingly has a RETVAL instruction, which moves the returned value from a 'virtual' return register. I say 'virtual' because the JIT compiler gets rid of these constructs, but they are very real in the bytecode interpreter. In general, I tried to opt for making the instruction set easier to compiler later, rather than make the interpretation fast.

Lowering is implemented in include/posthaste/lower.h and src/lower.c.

Interpreting

The bytecode interpreter is fairly straightforward. It has a number of internal registers, pc, sp, and the previously mentioned virtual argument stack. Each procedure call in PostHaste executes the interpreter recursively, copying arguments from the stack to where the formals are defined to be. Formal parameters must be the first consequtive stack locations in the order they are defined in PostHaste, which makes argument handling fairly easy to implement.

The interpreter uses computed gotos to try and reduce execution overhead. By using labels and gotos, we can avoid using a while loop with a switch statement within it, potentially saving a couple jumps between bytecode instructions. The bytecode instruction set itself is not designed to be particularly fast, as previously stated. Another example of this is that locations have to be decoded at runtime. For example, when an instruction requests a value from some location, the location has to be parsed to check whether we're dealing with a location in the stack or in the global array, incurring an extra conditional for each location.

Recursion is handled just as any other function call.

The interpreter is implemented in include/posthaste/interpret.h and src/interpret.c.

Compiling

The compiler is fairly straightforward. I chose to use Lightening, a library I've made some contributions to (ported it to MIPS and PowerPC, among other things), but the maintainer has been awol for a little while so I'm using my fork of it.

Lightening is more or less a very lightweight portable assembler, that gives the user a number of virtual instructions and registers that are more or less 1:1 mapped to machine instructions. This means that a user would have to implement code generation and register allocation by hand, but this project is simple enough that we can get all of that for basically nothing.

Each bytecode instruction is compiled 'as if' it was being interpreted, so loading data from either the stack or global array, doing some operation to that data, and writing something back. This means each instruction is kind of a self-contained thing, and there are no dependencies between registers. This means that as long as we can implement each instruction with the registers we have, we're golden. And we practically only need three registers, as the three-value instructions can be implemented using two hardware registers, and keeping the global array pointer in a third register. The stack is just, well, the stack, so we use the stack pointer to access the stack. The stack pointer is automatically handled by Lightening, making our lives easier.

I mentioned in my email that I JIT-compiled all functions regardless of if they were called or not, but that's not true anymore. I added a simple scheme that only run the JIT compiler on functions that get called, turned out to be easy enough as that information is 'implicitly' encoded in the bytecode. Just wait until you run into a CALL instruction, check if the corresponding fn has already been compiled. If not, compile, otherwise, continue compilation of your current function.

The JIT compilation is implemented in include/posthaste/compile.h and src/compile.c.

Machine code execution is trivial, just call the functions as if they were any other function and watch the magic happen. There's unfortunately not really any good ways to view the generated machine code, as that would require some external disassembler library, which I didn't want to add a dependency to. Recursion is handled just as any other function call. Machine code execution is implemented in include/posthaste/execute.h and src/execute.c.

Thoughts on assignment

Obviously I went a bit overkill so not sure how representative my opinions are but I really enjoyed working on this. The language itself is simple enough that no individual phase felt like a slog, but is enough like a real language that it really was quite satisfying seeing things work.

Some semantic things were a little bit weird to implement, for example the requirement that calls to non-void functions/procedures shouldn't be allowed as statements. That one I had to explicitly check for in code, it would've been easier to just ignore it as it didn't really 'naturally' fit into the structure of the checker. The programs themselves would still work.

The semantic checker was probably the most difficult bit to be honest, I'm still not entirely sure I'm not overlooking some corner case somewhere but the bytecode lowering and JIT compilation sort of just fell into place. I have played around with some bytecode-adjacent things, and I think I made the right choice with regard to the design complexity.

The get_a0() etc. thing seemed to work quite nicely, I might try using it in other projects. It's not anything I came up with, I believe GCC does something similar, but for example ast_visit() turned out to be way easier to implement with just a fixed struct ast. If I had used a separate struct for each AST node type or mangled them all into a union, I'm fairly sure I would've quintupled the size of this project.