metastuff

Ek

ek

This is partially a continuation of my earlier tri post. TL;DR: I’ve been working on a programming language that compiles down to trinary assembly. I’ve named it ek, Swedish for oak. Currently hosted at https://github.com/Kimplul/ek

It’s still not quite usable, but I recently got it to a state where I wanted to write down a few words about the project and ‘announce’ it, if you will. This is not intended as documentation, rather just a quick overview of the basics of this language.

PL nerds, avert thine eyes, for there is nothing of interest to you here. My language is boring :)

Goals

First and foremost, what’s the point of the language? Couldn’t I have just used some other language or compiler toolchain and just ported it to tri or something?

I don’t think so. I did look at some of the less intimidating compiler frameworks like qbe (very cool) but even though the instruction set I currently have is semantically close to RISC-V, actually trying to add trinary values into the mix seemed to be a massive pain to deal with. Not necessarily saying it’s impossible, but I also just thought it might be fun to implement everything from scratch. A learning experience.

As such, for this project I’m more interested in compiler internals over language semantics, and the language itself has semantics very close to C. I did splurge on a slightly more complex type system, as a treat. More on it a bit later. I have other language projects that deal with slightly more exotic language semantics, see fwd (or, for a laugh, lyn).

Pretty much my only explicit requirement was that the language be more convenient than writing raw assembly. I’m sure you can do great things with just assembly, and it might be a fun challenge to see how far I could personally get, but I’m also rather lazy and want to have options to just get stuff done when I feel like it. Implicitly, the language would also have to be fairly simple so I can actually implement it in some semi-reasonable amount of time (and I’m glacially slow at developing stuff), so I kind of went for the lowest common denominator for the most part.

The language itself

Basic syntax

Let’s start off with an example snippet and dissect it a bit:

/* there are comments /* thery can nest */*/

// as are these

/*
 * functions can only show up at the top level, => is equivalent to
 *
 *	auto something() -> some_type
 *
 * in C++
 */
fib(i27 n => i27)
{
	if n < 2 {
		return 1;
	} else {
		return fib(n - 1) + fib(n - 2);
	}
}

main()
{
	/* variable definition, `const` or `mut` signify type deduction. Call
	 * fib() and place the return value into r */
	mut r = fib(42); /* i27 r = ...; would also work */

	/* ! denotes a macro expansion, like in Rust */
	println!(r);
}

I expect this to look fairly familiar if you have any C background. Types are where they should be, on the left. Function signatures are a bit different, with a weird => inside the brackets, and if (as well as for and while) require curly braces, but parentheses around the condition are now optional. The parser accepts switch statements as well, but they’re not fully implemented yet. Whitespace is completely insignificant.

I do have macros, but they are very limited and work on the AST level, so you can’t generate arbitrary code with them. They are also hygienic, i.e. create a new scope from which newly defined variables can’t escape. Note that you can still use a macro as an expression, since blocks ‘return’ the value of the last statement defined within.

Using ! to denote macro expansion is maybe a bit of a controversial feature, I know some people seem to hate it, but it does make writing the compiler easier and allows me to use some lexer tricks to bend the language to my will a bit better. Essentially, println! isn’t lexed as ID + !, but rather a token type APPLY whose value is ID. If you don’t know what I’m on about, don’t worry, it’s not important, just something I thought was kind of neat in a hacky way when I came up with it.

As a quality of life feature, I also allow if to be used an expression, but only when prefixed with the do keyword:

mut a = do if some_cond {1} else {2};

This is my version of a ternary operator, a bit more verbose but I think it’s more readable, especially when chained. I’ve also been playing around with the idea of allowing switch be an expression, but I’m not too sure about that since it might be a bit more tricky to prove that at least one branch gets taken for all input. I guess requiring a default branch could be an option, but that feels a bit verbose. I have no intentions of making loops expressions, it might make some simple sum loops easier to write but has some kind of annoying corner cases that I don’t particularly want to deal with, like how break should be handled, should the output value be copied once or on every iteration, what happens if the body is never entered, etc. Solvable problems, but I’m not that interested in solving them.

I also have defer and goto. I really like defer and think it can help with resource management in unsafe languages, which is what ek is. My defer is bound to block scope, so doesn’t require dynamic allocations like Go’s. goto is probably a bit more controversial, but my understanding is that some code generation algorithms can use goto to produce more optimal code more efficiently than more structured approaches. goto is still discouraged in most situations, but it was kind of fun to implement how defer and goto might interact with eachother. They’re both probably unnecessary for the stuff I’ll be writing in the near future, but I’ve got to have some toys, right?

If you read my tri post, you will be aware that the platform has programmable logic operators. Perhaps surprisingly, I haven’t actually implemented them yet, mostly because I really don’t know what the syntax should look like. They’re really very basic unary/binary operators, but there are some options. Should the operators be lowercase or uppercase? Both? nop, NOP, i01? All mixed together? Can binary operators be compressed down somehow (nop nop nop is kind of long)? Should they be infix or function-like? How are they differentiated from regular IDs or INTEGERs?

Presumably the user should just define macros that have a name that explains what the operation being perfomed is for or does, so this decision probably isn’t that significant in the long run, but something I still haven’t decided on. If you have suggestions, feel free to email me.

Variadics and macros

The language doesn’t have variadic functions. I’ve found that variadics are effectively a syntactical thing, and should preferably be implemented with macros, like in Rust. As such, I imagine that stuff like println! would be implemented something like the following:

define println(...xs)
{
	const for x : xs {
		/* assuming fmt() returns a string representation of x */
		const s = x.fmt();
		print!(s);
		s.destroy();
	}
}

const for essentially just expands out the body, iterating over the AST arguments given by xs. The parser also accepts const if, expanding the selected branch, mainly intended for feature checks and so on. That’s about as much metaprogramming I’m interested in. Macro expansion is implemented, but const for and const if aren’t as of yet. I don’t expect their implementation to cause any significant issues, I just haven’t had the motivation to implement them since they’re more ’nice-to-have’ than anything strictly required to get assembly produced.

Additionally, functions can be marked extern for direct assembly calls. Technically these could also be used for some kind of FFI, but at the moment ek is the only language for my platform so eh.

extern some_func(); /* ek -> asm */
extern other_func() /* asm -> ek */
{
	/* ... */
}

Visibility and scoping

Top-level items (functions, types) are by default private to the file in which they are defined, and can appear in any order. There is one exception, which might be removed in the future. More about it in the types section. The pub keyword can be used to export items:

/* private_thing won't be visible if this file is imported somewhere else */
import "private_thing.ek"

/* public_thing, however, will automatically be exported */
pub import "public_thing.ek"

helper()
{
	println!("I help!");
}

pub worker()
{
	helper();
}

I actually intend to support global variables since they are sometimes useful, but the expressions that globals can be initialized to require some constant evaluation that isn’t actually implemented yet.

Shadowing is allowed in subscopes, but not in the same scope, so

some_func()
{
	const a = 10;
	/* const a = 20; not okay */
	{
		const a = 20; /* okay */
	}
}

Imports are actually ‘unique’, so

/* common.ek */
pub some_func(){}

/* a.ek */
pub import "common.ek"

/* b.ek */
pub import "common.ek"

/* main.ek */
import "a.ek"
import "b.ek"

/* do something with some_func() */

is perfectly valid, and a.ek, b.ek and main.ek all ‘see’ a single definition of some_func(). When main.ek imports a.ek, some_func() is first inserted into a set of visible functions. When b.ek is imported, the compiler can see that common.ek was previously imported, and can merge the definitions.

At least at the moment there are no search paths, and all imported files are relative to the current file. I’m not sure if this is something I want to change in the future, more on this a bit later.

Types

The type system is probably the most interesting part of this (intentionally) largely uninteresting language. I’m personally of the opinion that generics should be kept to a minimum, and preferably only be used for ‘containers’. When generics are used, it should be clear what ‘features’ some type is expected to have. As such, I’ve modeled the type system loosely (very loosely) on Rust’s, though with some major differences.

First, let’s get the basics out of the way:

There are two ‘primitive’ types:i27 and i9. Integers of width 27 and 9 trits. The underlying platform doesn’t support unsigned numbers, so these are all we get. The compiler also accepts bool for 0/1, with -1 or i also being accepted as true, though that’s not the canonical value. In practice, this gets mapped to i9, and mainly just exists for humans. No special optimizations are done at the moment, unlike major compilers.

Pointers use the familiar star syntax, but notably all types are read left to right. *i27 is a pointer to a i27. *(i27 => i27) is a function pointer, note how the signature doesn’t cause any weird spiral definitions. Pointers are dereferenced from the right, so

deref(*i27 p => i27)
{
	return p*;
}

Arrays also use the familiar square brackets, so [64]i27.

Simple enums (enum as in C, not Rust) are supported. In contrast to C, the enums are scoped, so

enum flags {A, B};

func(flags f)
{
	if f == A::flags {
		return 10;
	} else {
		return 20;
	}
}

Note that :: goes in ‘reverse’, mostly because it made the parser generator happy, but I suppose it also means that the ‘important’ bit comes first and is maybe easier to read.

Aliases are also supported:

typedef flags i27;

Now, let’s get to structures! Structures are, as previously stated, kind of Rust-inspired. Here’s a concrete example:

/* traits are defined with `define`, kind of like macros. They specify what
 * functions should be available for a given type. */
define some_trait {
	/* some_trait refers to whatever type implements this trait.
	 * parameter names can be omitted in function prototypes
	 */
	say(*some_trait, *i9);

	/* default implementations can also be given */
	say_hello(*some_trait self)
	{
		self*.say("Hello!");
	}
}

/* structures are defined with typedef, followed by a body */
typedef actual {
	/* this signifies that we want to implement some_trait.
	 * internally, this is just implemented as expanding some_trait and
	 * replacing the some_trait type with actual, like a macro.
	 * hence the exclamation mark.
	 */
	some_trait!;

	/* implement the functions that trait requires */
	say(*actual, *i9 msg)
	{
		/* self unused, so we can omit it as well */
		println!("actual says:", msg);
	}
}

/* this is a struct that requires it be given a type that implements some_trait */
typedef generic[some_trait T] {
	/* types can of course have variables */
	T var;

	/* exclamation mark is approximately "construct this type with these
	 * parameters", internally it's just more macro expansion! */
	do_something(*generic![T] self)
	{
		self*.say_hello();
	}
}

main()
{
	mut p = generic![actual]{
		.var = actual!{}
	};

	p.do_something();
	/* same as
	 * do_something::generic![actual](&p);
	 */

	/* should print "actual says: hello!" if we had a working print */
}

That’s a rather long example, but hopefully not too difficult to follow. A struct can of course implement multiple traits, trait1!; trait2!; ... Only structures can take types as parameters, so no generic functions. I thought about making traits generic, I concluded that such higher-order types are a bit too complex for my intended use case.

Now, let’s say you are given a type and a trait you should implement for it. In Rust, it turns out that there’s an orphan trait rule that forbids this if the type and trait are from different modules. There are good reasons for this, one example that often comes up is implementing hashing in two different places for the same type causing items to not be found in some third place (IIUC). I kind of just went ’eh, fuck it’ and allow traits to be arbitrarily implemented by using a definition continuation. I realize this is somewhat risky and potentially creates a hole in the type system where some very annoying bugs could hide, but I’m hoping that those kinds of bugs become serious only in huge projects. I don’t expect anyone to use this language for anything even half-serious, and the flexibility allowed by this approach seems like the more sensible option for my use case.

Continuations use the continue keyword:

continue actual {
	other_trait!;

	/* whatever other_trait requires */
}

continue, at least for now, must come after a typedef (kind of silly to continue something that hasn’t yet been defined, heh). No new variables can be defined within the continuation, but new functionality is allowed. Specifically new, any existing functions are not allowed to be reimplemeted. Each continuation is kind of an independent block, in that it must implement the trait, and cannot delay that work for some later continuation. Previous continuations and functions implemented within them can be used, however.

I also allow mixing public and private continuations, and I have some basic subchain/superchain detection so that the longer chain stays active if two versions of the same chain get imported. Chains are a bit more technical, and I plan on writing a second post just about the compiler internals where I’ll go into more detail. But it’s getting late and I have some cats to feed.