Ek
ek
This is partially a continuation of my earlier tri post. TL;DR: I’ve
been working on a programming language that compiles down to trinary assembly.
I’ve named it ek
, Swedish for oak. Currently hosted at
https://github.com/Kimplul/ek
It’s still not quite usable, but I recently got it to a state where I wanted to write down a few words about the project and ‘announce’ it, if you will. This is not intended as documentation, rather just a quick overview of the basics of this language.
PL nerds, avert thine eyes, for there is nothing of interest to you here. My language is boring :)
Goals
First and foremost, what’s the point of the language? Couldn’t I have just used
some other language or compiler toolchain and just ported it to tri
or
something?
I don’t think so. I did look at some of the less intimidating compiler frameworks like qbe (very cool) but even though the instruction set I currently have is semantically close to RISC-V, actually trying to add trinary values into the mix seemed to be a massive pain to deal with. Not necessarily saying it’s impossible, but I also just thought it might be fun to implement everything from scratch. A learning experience.
As such, for this project I’m more interested in compiler internals over language semantics, and the language itself has semantics very close to C. I did splurge on a slightly more complex type system, as a treat. More on it a bit later. I have other language projects that deal with slightly more exotic language semantics, see fwd (or, for a laugh, lyn).
Pretty much my only explicit requirement was that the language be more convenient than writing raw assembly. I’m sure you can do great things with just assembly, and it might be a fun challenge to see how far I could personally get, but I’m also rather lazy and want to have options to just get stuff done when I feel like it. Implicitly, the language would also have to be fairly simple so I can actually implement it in some semi-reasonable amount of time (and I’m glacially slow at developing stuff), so I kind of went for the lowest common denominator for the most part.
The language itself
Basic syntax
Let’s start off with an example snippet and dissect it a bit:
/* there are comments /* thery can nest */*/
// as are these
/*
* functions can only show up at the top level, => is equivalent to
*
* auto something() -> some_type
*
* in C++
*/
fib(i27 n => i27)
{
if n < 2 {
return 1;
} else {
return fib(n - 1) + fib(n - 2);
}
}
main()
{
/* variable definition, `const` or `mut` signify type deduction. Call
* fib() and place the return value into r */
mut r = fib(42); /* i27 r = ...; would also work */
/* ! denotes a macro expansion, like in Rust */
println!(r);
}
I expect this to look fairly familiar if you have any C background. Types are
where they should be, on the left. Function signatures are a bit different, with
a weird =>
inside the brackets, and if
(as well as for
and while
)
require curly braces, but parentheses around the condition are now optional.
The parser accepts switch
statements as well, but they’re not fully implemented yet.
Whitespace is completely insignificant.
I do have macros, but they are very limited and work on the AST level, so you can’t generate arbitrary code with them. They are also hygienic, i.e. create a new scope from which newly defined variables can’t escape. Note that you can still use a macro as an expression, since blocks ‘return’ the value of the last statement defined within.
Using !
to denote macro expansion is maybe a bit of a controversial feature, I
know some people seem to hate it, but it does make writing the compiler easier
and allows me to use some lexer tricks to bend the language to my will a bit
better. Essentially, println!
isn’t lexed as ID
+ !
, but rather a token
type APPLY
whose value is ID
. If you don’t know what I’m on about, don’t
worry, it’s not important, just something I thought was kind of neat in a hacky
way when I came up with it.
As a quality of life feature, I also allow if
to be used an expression, but
only when prefixed with the do
keyword:
mut a = do if some_cond {1} else {2};
This is my version of a ternary operator, a bit more verbose but I think it’s
more readable, especially when chained. I’ve also been playing around with the
idea of allowing switch
be an expression, but I’m not too sure about that
since it might be a bit more tricky to prove that at least one branch gets taken
for all input. I guess requiring a default
branch could be an option, but that
feels a bit verbose. I have no intentions of making loops expressions, it might
make some simple sum loops easier to write but has some kind of annoying corner
cases that I don’t particularly want to deal with, like how break
should be
handled, should the output value be copied once or on every iteration, what
happens if the body is never entered, etc. Solvable problems, but I’m not that
interested in solving them.
I also have defer
and goto
. I really like defer
and think it can help with
resource management in unsafe languages, which is what ek
is. My defer
is
bound to block scope, so doesn’t require dynamic allocations like Go’s. goto
is probably a bit more controversial, but my understanding is that some code
generation algorithms can use goto
to produce more optimal code more
efficiently than more structured approaches. goto
is still discouraged in most
situations, but it was kind of fun to implement how defer
and goto
might interact
with eachother. They’re both probably unnecessary for the stuff I’ll be
writing in the near future, but I’ve got to have some toys, right?
If you read my tri
post, you will be aware that the platform has programmable
logic operators. Perhaps surprisingly, I haven’t actually implemented them yet,
mostly because I really don’t know what the syntax should look like. They’re
really very basic unary/binary operators, but there are some options. Should the
operators be lowercase or uppercase? Both? nop
, NOP
, i01
? All mixed
together? Can binary operators be compressed down somehow (nop nop nop
is kind
of long)? Should they be infix or function-like? How are they differentiated
from regular ID
s or INTEGER
s?
Presumably the user should just define macros that have a name that explains what the operation being perfomed is for or does, so this decision probably isn’t that significant in the long run, but something I still haven’t decided on. If you have suggestions, feel free to email me.
Variadics and macros
The language doesn’t have variadic functions. I’ve found that variadics are
effectively a syntactical thing, and should preferably be implemented with
macros, like in Rust. As such, I imagine that stuff like println!
would be
implemented something like the following:
define println(...xs)
{
const for x : xs {
/* assuming fmt() returns a string representation of x */
const s = x.fmt();
print!(s);
s.destroy();
}
}
const for
essentially just expands out the body, iterating over the AST
arguments given by xs
. The parser also accepts const if
, expanding the
selected branch, mainly intended for feature checks and so on. That’s about as
much metaprogramming I’m interested in. Macro expansion is implemented, but
const for
and const if
aren’t as of yet. I don’t expect their implementation
to cause any significant issues, I just haven’t had the motivation to implement
them since they’re more ’nice-to-have’ than anything strictly required to get
assembly produced.
Additionally, functions can be marked extern
for direct assembly calls.
Technically these could also be used for some kind of FFI, but at the moment
ek
is the only language for my platform so eh.
extern some_func(); /* ek -> asm */
extern other_func() /* asm -> ek */
{
/* ... */
}
Visibility and scoping
Top-level items (functions, types) are by default private to the file in which
they are defined, and can appear in any order. There is one exception, which
might be removed in the future. More about it in the types section.
The pub
keyword can be used to export items:
/* private_thing won't be visible if this file is imported somewhere else */
import "private_thing.ek"
/* public_thing, however, will automatically be exported */
pub import "public_thing.ek"
helper()
{
println!("I help!");
}
pub worker()
{
helper();
}
I actually intend to support global variables since they are sometimes useful, but the expressions that globals can be initialized to require some constant evaluation that isn’t actually implemented yet.
Shadowing is allowed in subscopes, but not in the same scope, so
some_func()
{
const a = 10;
/* const a = 20; not okay */
{
const a = 20; /* okay */
}
}
Imports are actually ‘unique’, so
/* common.ek */
pub some_func(){}
/* a.ek */
pub import "common.ek"
/* b.ek */
pub import "common.ek"
/* main.ek */
import "a.ek"
import "b.ek"
/* do something with some_func() */
is perfectly valid, and a.ek
, b.ek
and main.ek
all ‘see’ a single
definition of some_func()
. When main.ek
imports a.ek
, some_func()
is first inserted
into a set of visible functions. When b.ek
is imported, the compiler can see
that common.ek
was previously imported, and can merge the definitions.
At least at the moment there are no search paths, and all imported files are relative to the current file. I’m not sure if this is something I want to change in the future, more on this a bit later.
Types
The type system is probably the most interesting part of this (intentionally) largely uninteresting language. I’m personally of the opinion that generics should be kept to a minimum, and preferably only be used for ‘containers’. When generics are used, it should be clear what ‘features’ some type is expected to have. As such, I’ve modeled the type system loosely (very loosely) on Rust’s, though with some major differences.
First, let’s get the basics out of the way:
There are two ‘primitive’ types:i27
and i9
. Integers of width 27 and 9 trits. The
underlying platform doesn’t support unsigned numbers, so these are all we get.
The compiler also accepts bool
for 0/1
, with -1
or i
also being accepted as
true
, though that’s not the canonical value. In practice, this gets mapped to
i9
, and mainly just exists for humans. No special optimizations are done at
the moment, unlike major compilers.
Pointers use the familiar star syntax, but notably all types are read left to
right. *i27
is a pointer to a i27
. *(i27 => i27)
is a function pointer,
note how the signature doesn’t cause any weird spiral definitions. Pointers are
dereferenced from the right, so
deref(*i27 p => i27)
{
return p*;
}
Arrays also use the familiar square brackets, so [64]i27
.
Simple enums (enum as in C, not Rust) are supported. In contrast to C, the enums are scoped, so
enum flags {A, B};
func(flags f)
{
if f == A::flags {
return 10;
} else {
return 20;
}
}
Note that ::
goes in ‘reverse’, mostly because it made the parser generator
happy, but I suppose it also means that the ‘important’ bit comes first and is
maybe easier to read.
Aliases are also supported:
typedef flags i27;
Now, let’s get to structures! Structures are, as previously stated, kind of Rust-inspired. Here’s a concrete example:
/* traits are defined with `define`, kind of like macros. They specify what
* functions should be available for a given type. */
define some_trait {
/* some_trait refers to whatever type implements this trait.
* parameter names can be omitted in function prototypes
*/
say(*some_trait, *i9);
/* default implementations can also be given */
say_hello(*some_trait self)
{
self*.say("Hello!");
}
}
/* structures are defined with typedef, followed by a body */
typedef actual {
/* this signifies that we want to implement some_trait.
* internally, this is just implemented as expanding some_trait and
* replacing the some_trait type with actual, like a macro.
* hence the exclamation mark.
*/
some_trait!;
/* implement the functions that trait requires */
say(*actual, *i9 msg)
{
/* self unused, so we can omit it as well */
println!("actual says:", msg);
}
}
/* this is a struct that requires it be given a type that implements some_trait */
typedef generic[some_trait T] {
/* types can of course have variables */
T var;
/* exclamation mark is approximately "construct this type with these
* parameters", internally it's just more macro expansion! */
do_something(*generic![T] self)
{
self*.say_hello();
}
}
main()
{
mut p = generic![actual]{
.var = actual!{}
};
p.do_something();
/* same as
* do_something::generic;
*/
/* should print "actual says: hello!" if we had a working print */
}
That’s a rather long example, but hopefully not too difficult to follow. A
struct can of course implement multiple traits, trait1!; trait2!; ...
Only structures can take types as parameters, so no
generic functions. I thought about making traits generic, I concluded that such higher-order
types are a bit too complex for my intended use case.
Now, let’s say you are given a type and a trait you should implement for it. In Rust, it turns out that there’s an orphan trait rule that forbids this if the type and trait are from different modules. There are good reasons for this, one example that often comes up is implementing hashing in two different places for the same type causing items to not be found in some third place (IIUC). I kind of just went ’eh, fuck it’ and allow traits to be arbitrarily implemented by using a definition continuation. I realize this is somewhat risky and potentially creates a hole in the type system where some very annoying bugs could hide, but I’m hoping that those kinds of bugs become serious only in huge projects. I don’t expect anyone to use this language for anything even half-serious, and the flexibility allowed by this approach seems like the more sensible option for my use case.
Continuations use the continue
keyword:
continue actual {
other_trait!;
/* whatever other_trait requires */
}
continue
, at least for now, must come after a typedef
(kind of silly to continue
something
that hasn’t yet been defined, heh). No new variables can be defined within the
continuation, but new functionality is allowed. Specifically new, any existing
functions are not allowed to be reimplemeted. Each continuation is kind of an
independent block, in that it must implement the trait, and cannot delay that
work for some later continuation. Previous continuations and functions
implemented within them can be used, however.
I also allow mixing public and private continuations, and I have some basic subchain/superchain detection so that the longer chain stays active if two versions of the same chain get imported. Chains are a bit more technical, and I plan on writing a second post just about the compiler internals where I’ll go into more detail. But it’s getting late and I have some cats to feed.