Making Fibonacci tail-recursive (kinda)

25 Mar, 2025

Context

I’ve been playing around with an idea for a programming language, that I’ve already dubbed fwd, where the main gimmick is not allowing functions/procedures to return. All computation happens by passing results to callbacks or closures, and I have some syntactic tricks to make code more easily readable. At its core, closures can be defined in a trailing way, so for example

do_something() => int result;
use_result(result);

would be equivalent to the following C++:

do_something([&](int result){
	use_result(result);
});

There are actual reasons for why I’m interested in this (besides just being a bit unusual and therefore cool), mainly revolving around resource management, but I won’t go into details in this post. I might write a proper introduction at some point, but you can get a feel for what I’m trying to do by reading the README over in https://metanimi.dy.fi/git/fwd/about.

Anycase, let’s just take it as a given that I want to develop a language that almost solely revolves around closures and callbacks. This almost immediately runs into issues that you probably wouldn’t need to think about in regular languages. Most notably for this blogpost, replacing return values with function calls means we use up a little bit of stack for each computed value. Generally, when a function has computed something, it returns the computed value (if a value was requested) and frees up the resources it used for the computation. Among these resources is stack space, i.e. every time you call a function, the compiler carves out a bit of memory to store variables etc. into. Since we’re never ‘returning’ anything, this memory never gets released, and we very quickly run out of memory.

A typical Linux installation has a default stack size of 8 MiB. How much a function call consumes stack space depends on quite a few factors, but even if we assume it only takes one byte, that leaves us with an absolute maximum of about 8 million function calls that our program could do in sequence. This number is laughably small for any non-trivial programs¹.

All hope is not lost, though. Tail calls are a special kind of call that reuses the stack frame of the function doing the call, giving us ‘free’ calls in terms of stack space. Different languages have different rules for how tail calls are done and in which situations they can be done, but in our case we would like them to be done as often as possible to save on as many function calls as possible.

Recursive algorithms in particular are interesting in this context. Here’s a recursive implementation of the Fibonacci series in C:

int fib(int n)
{
	if (n <= 2)
		return 1;

	return fib(n - 1) + fib(n - 2);
}

Even for fairly small n, the call count explodes. n = 42 does 535 828 591 calls, about half a billion. Significantly more than our hard limit of 8 million.

Since there are no return values in fwd, fib would have to be expressed something like the following:

fib(int n, (int) res)
{
	if n <= 2 {
		res(1);
	} else {
		fib(n - 1) => int f1;
		fib(n - 2) => int f2;
		res(f1 + f2);
	}
}

My current fwd compiler is capable of translating the above fairly directly to C++ that uses regular lambdas and we get the correct result for a given n. However, only for very small n, anything beyond about 20 segfaults due to running out of stack space. This is because each invocation of fib, as previously stated, builds up its own stack frame and doesn’t free it. Boo.

Can we do something about this?

Holey stacks

The idea is that we separate data and call stacks. The call stack uses the native stack, but the data stack has special frames that can mark themselves ’not needed’ anymore, and can be destroyed without having to traverse the call stack. Using the fibonacci sequence as an example, I’ve added some comments showing interesting properties of each line:

fib(int n, (int) res)
{
    if n <= 2 {
        res(1); /* exit out of fib completely */
    } else {
        fib(n - 1) => int f1; /* tailcall */
        fib(n - 2) => int f2; /* tailcall */
        res(f1 + f2); /* exit out of fib completely */
    }
}

In more conventional, C++-like syntax², this is equivalent to

/* res would be a lambda */
fwd_err_t fib(int n, auto res)
{
    if (n <= 2) {
        /* tail call, since either res of fib always gets called */
        return res(1);
    } else {
        /* tail call, since either res or fib always gets called */
        return fib(n - 1, [&](int f1){
            /* tail call, last thing in closure */
            return fib(n - 2, [&](int f2){
                /* tail call, last thing in closure */
                return res(f1 + f2);
            });
        });
    }
}

Note how there are two points in the code where we never return to this syntactic scope again, namely when calling res(). This can be thought of as saying “I’ve completed my calculation and used all the resources that I needed for it”. We would like to free the function context (all closures in a function share the same context, effectively a stack frame), but the outermost fib can be way below the current stack top, and cannot be destroyed if the function contexts are using a regular stack. The memory usage balloons until we run out of memory (at least for large enough n, think something like fib(42))

Anycase, with a holey stack, we can mark the current context as “to be destroyed”, or that the resources within have all been used. The top of the stack, when freed, can then jump all the way down to where the last non-free frame is, without traversing the call stack. It’s maybe a bit difficult to wrap your head around, and my explanation is probably not the best, but consider the following example of what C code generated from the fib.fwd example could look like:

#include <assert.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>

/* here's the holey stack implementation */
struct holey_stack {
	size_t sp;
	size_t size;
	int8_t *buf;
};

/* separator, each stack frame is surrounded by markers. `prev` points to the
 * previous marker, `next` to the next marker if one exists. */
struct holey_marker {
	size_t prev;
	size_t next; /* bit 0 is 0 if frame used, 1 if free */
};

static inline struct holey_stack create_holey_stack(size_t initial_size)
{
	/* arbitrary, but at least enough that we can populate the initial marker */
	assert(initial_size > sizeof(max_align_t));
	struct holey_stack stack = {
		.sp = 0,
		.size = initial_size,
		.buf = malloc(initial_size)
	};

	struct holey_marker *marker = (struct holey_marker *)stack.buf;
	marker->prev = 0;
	marker->next = 0;
	return stack;
}

/* ensures fastest possible alignment, at the cost of a bit extra stack usage */
static inline size_t max_align(size_t n)
{
	return n + (sizeof(max_align_t) - (n % sizeof(max_align_t)));
}

static inline size_t holey_alloc(struct holey_stack *stack, size_t bytes)
{
	size_t actual_bytes = max_align(bytes);
	size_t marker_bytes = max_align(sizeof(struct holey_marker));
	size_t next_sp = stack->sp + actual_bytes + marker_bytes;

	struct holey_marker *cur_marker = (struct holey_marker *)(stack->buf + stack->sp);
	cur_marker->next = next_sp;

	/* skip marker at start of allocation */
	size_t next_size = next_sp + marker_bytes;
	while (next_size >= stack->size)
		stack->buf = realloc(stack->buf, stack->size *= 2);

	struct holey_marker *next_marker = (struct holey_marker *)(stack->buf + next_sp);
	next_marker->prev = stack->sp;
	next_marker->next = 0;

	stack->sp = next_sp;
	return next_size;
}

static inline void holey_free(struct holey_stack *stack, size_t loc)
{
	size_t marker_bytes = max_align(sizeof(struct holey_marker));
	struct holey_marker *marker = (struct holey_marker *)(stack->buf + loc - marker_bytes);
	assert((marker->prev & 1) == 0);

	/* mark free */
	marker->next |= 1;

	/* jump sp down to first used frame again */
	marker = (struct holey_marker *)(stack->buf + stack->sp);
	while (marker->next & 1)
		marker = (struct holey_marker *)(stack->buf + marker->prev);

	stack->sp = (int8_t *)marker - stack->buf;
}

static inline void destroy_holey_stack(struct holey_stack *stack)
{
	free(stack->buf);
}

/* ----- START OF "GENERATED" CODE ----- */

typedef const char *err_t;
typedef err_t (*int_cb_t)(struct holey_stack *, int n, size_t ctx);

/* common context for fib and its closures */
struct fib_ctx {
	int n;
	int f1;
	void *res;
	size_t res_ctx;
};

static err_t main_closure0(struct holey_stack *stack, int res, size_t ctx)
{
	/* just print result */
	printf("%d\n", res);

	/* exit from main */
	holey_free(stack, ctx);

	/* NULL indicates no error, otherwise we could return some message */
	return NULL;
}

static err_t holey_fib(struct holey_stack *stack, int n, void *res, size_t res_ctx);

static err_t holey_fib_closure1(struct holey_stack *stack, int f2, size_t ctx_loc)
{
	struct fib_ctx *ctx = (struct fib_ctx *)(stack->buf + ctx_loc);
	/* technically speaking UB since the frame might get ripped out from under
	 * us but works for this simple example */
	holey_free(stack, ctx_loc);
	return ((int_cb_t)ctx->res)(stack, ctx->f1 + f2, ctx->res_ctx);
}

static err_t holey_fib_closure0(struct holey_stack *stack, int f1, size_t ctx_loc)
{
	struct fib_ctx *ctx = (struct fib_ctx *)(stack->buf + ctx_loc);
	ctx->f1 = f1;
	return holey_fib(stack, ctx->n - 2, holey_fib_closure1, ctx_loc);
}

static err_t holey_fib(struct holey_stack *stack, int n, void *res, size_t res_ctx)
{
    /* allocate new context when entering fib at the top level */
	size_t ctx_loc = holey_alloc(stack, sizeof(struct fib_ctx));
	struct fib_ctx *ctx = (struct fib_ctx *)(stack->buf + ctx_loc);

	if (n <= 2) {
        /* exiting top level for good, mark context free */
		holey_free(stack, ctx_loc);
		return ((int_cb_t)res)(stack, 1, res_ctx);
	} else {
        ctx->n = n;
        ctx->res = res;
        ctx->res_ctx = res_ctx;
        return holey_fib(stack, n - 1, holey_fib_closure0, ctx_loc);
    }
}

int main()
{
	struct holey_stack stack = create_holey_stack(4096);
	size_t ctx = holey_alloc(&stack, 1); /* 'sentinel' of sorts */

	/* fib(42) */
	holey_fib(&stack, 42, main_closure0, ctx);

	destroy_holey_stack(&stack);
}

Compiled with gcc -O2 (on amd64, at least), all holey_fib, holey_fib_closure0 and holey_fib_closure1 calls are converted into tail calls, so the call stack effectively doesn’t grow at all, O(1) memory usage. holey_stack usage matches the memory usage of a ‘regular’ fib implementation, approx. O(2n). This can be a bit more tricky to see, but we can maybe think of it like this:

When fib produces a value, it’s at the current top of the call stack. The produced value is either a constant 1 or it depends on fib(n - 1) or fib(n - 2). In either case, fib can mark its context to be freed.
Since fib(n - 1) produced a value, it must’ve called res() and marked its context to be freed. The same goes for fib(n - 2) and so on all the way down to the base case of n <= 2.
Therefore, each of fib(n - 1), fib(n - 2), ..., etc. in the call ‘stack’ (remember, they’re all tail calls, so no actual stack usage) has marked its context to be destroyed. Since we’re at the top of the call stack, and we want to destroy our context, all contexts in the call stack get destroyed, without having to climb back up the call stack!

Error handling of course complicates things a bit. If there’s a function in the call stack that wasn’t able to be tail called, due to an error block for example, we can only destroy contexts up to whatever function didn’t do a tail call. So the above solution is not a silver bullet by any means, and you can still easily write code that runs out of some kind of stack space (call/data, whatever), but at least this indicates that it’s possible to implement some highly recursive algorithms in fwd with approximately the same memory usage as conventional languages.

Performance is of course not as good as with a ’native’ fib. On my machine the above example runs in ~1.62s (-O2). A ’typical’ fib (with inlining disabled, GCC is scarily smart and it would be a bit apples to oranges otherwise) runs in about ~0.35s (-O2), so there definitely is a significant speed decrease. I’m still cautiously optimistic that this is fast enough (for some arbitrary definition of ’enough’), and that this can still be optimized a bit further. For instance, at the moment I recalculate the stack frame location in every closure, but in theory we could just set a maximum stack size (like Java) and save a bit on some arithmetic instructions. Messing about with the holey_stack is also not strictly necessary for leaf functions, among others (probably) and with some luck, fib is something of a worst case. Maybe, who knows.

Addendum 2025-03-27

By making the stack size fixed, using a slightly more clever linked list of used frames and passing the stack around ‘by value’, I managed to approximately halve the execution time, down to ~0.88s from ~1.62s. I don’t particularly like the idea of a completely fixed stack, so I also played around with allocating arenas dynamically for stacks, which brought the performance down to ~1.22s. Somewhat faster than the original approach, and pointers into the stack are now stable, which is nice.

Here’s the code for dynamic arenas, to get ~0.88s performance just comment out the bits I’ve marked with snip:

struct holey_stack {
	int8_t *sp;
};

/* prev points to the previous *used* marker *next* is equivalent */
struct holey_marker {
	struct holey_marker *prev;
	struct holey_marker *next;
};

/* 2M stack size, fairly common huge page size so kernel could theoretically
 * give us hugepages. 4M or 8M would also be possible. */
#define STACK_2M 2097152
static inline void *holey_stack_base(void *sp)
{
	/* since we allocate aligned regions, we can always find the start of
	 * allocation by just masking out the lowest bits */
	return (void *)((uintptr_t) sp & ~(STACK_2M - 1));
}

static inline struct holey_stack create_holey_stack()
{
	/* theoretically we could madvise(MADV_HUGEPAGE) but doesn't seem to
	 * make a significant difference */
	return (struct holey_stack){
		.sp = aligned_alloc(STACK_2M, STACK_2M),
	};
}

static inline void destroy_holey_stack(struct holey_stack *stack)
{
	free(holey_stack_base(stack->sp));
}

/* ensures fastest possible alignment, at the cost of a bit extra stack usage */
static inline size_t max_align(size_t n)
{
	return n + (sizeof(max_align_t) - (n % sizeof(max_align_t)));
}

static inline void *holey_alloc(struct holey_stack *stack, size_t bytes)
{
	size_t actual_bytes = max_align(bytes + sizeof(struct holey_marker));
	int8_t *next_sp = stack->sp + actual_bytes;

	/* --- snip if you don't want dynamic stacks -- */

	/* if we can't fit into the same stack arena, just allocate a new one */
	if (holey_stack_base(next_sp) != holey_stack_base(stack->sp))
		next_sp = aligned_alloc(STACK_2M, STACK_2M) + sizeof(max_align_t);

	/* -- snip --- */

	struct holey_marker *cur_marker = (struct holey_marker *)stack->sp - 1;
	struct holey_marker *next_marker = (struct holey_marker *)next_sp - 1;
	next_marker->prev = cur_marker;
	cur_marker->next = next_marker;

	void *frame = stack->sp;
	stack->sp = next_sp;
	return frame;
}

static inline void holey_free(struct holey_stack *stack, void *loc)
{
	struct holey_marker *marker = (struct holey_marker *)loc - 1;
	struct holey_marker *top = (struct holey_marker *)stack->sp - 1;

	if (marker->next != top) {
		struct holey_marker *next = marker->next;
		struct holey_marker *prev = marker->prev;
		next->prev = prev;
		prev->next = next;

		/* --- snip if you don't want dynamic stacks -- */

		/* if both our pointers are to other arenas, we must be the only
		 * frame left and the arena can be freed */
		void *base = holey_stack_base(marker);
		bool last_in_arena = (base != holey_stack_base(next))
			           & (base != holey_stack_base(prev));

		if (last_in_arena)
			destroy_holey_stack(base);

		/* --- snip --- */
		return;
	}

	 void *next_sp = top->prev + 1;

	 /* --- snip if you don't want dynamic arenas --- */

	 bool stack_changed = holey_stack_base(next_sp) != holey_stack_base(stack->sp);
	 if (stack_changed)
		 destroy_holey_stack(stack);

	 /* --- snip --- */

	 stack->sp = next_sp;
}

/* fib code effectively identical to below, omitted here for brevity */

An obvious point of comparison that I realize I forgot to mention is just using malloc()/free() for the function contexts. This has the advantage that now as soon as a function context is not needed, it will stop using memory, instead of possibly being blocked by some other frame higher up the ‘stack’, but means the runtime balloons to ~3.5s. There are ways to improve this, like making all contexts the same size and maintaining a cache of some sort. Functions with contexts greater than the max size could chain multiple contexts together, or the size of a context could be selected such that the largest context in the program fits into a single allocation. I wrote a simple test case and got ~1.26s, so comparable with my arena idea, except with (presumably) somewhat better memory usage.

The main bottleneck seems to be the fact that we have to maintain a pointer to the top of the free list, so the stack has to be passed around by reference and allocation/freeing requires an extra pointer derefence. Using the same trick as before, passing the stack by value and returning the top of the free list us to keep everything in a register bringing the runtime down to ~0.71s. Not too shabby, but sort of makes error handling in fwd a bit more difficult. I’ve been planning on going with a design where functions can return error strings, and having to return both the error and the stack top is a bit tricky. Some ABIs do allow two return values to be transferred in registers, amd64 SYSV being one of them, but is not really a given for all platforms, as far as I’m aware.

Anycase, here’s the ~0.71s code, note how it’s the shortest of the bunch :)

#include <assert.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>

/* 128 byte frame size by default */
#define FRAME_SIZE 128

struct holey_stack {
	struct holey_stack *prev;
};

struct holey_ret {
	const char *err;
	struct holey_stack stack;
};

static inline struct holey_stack create_holey_stack()
{
	return (struct holey_stack){.prev = NULL};
}

static inline void *holey_alloc(struct holey_stack *stack)
{
	if (stack->prev) {
		struct holey_stack *prev = stack->prev;
		stack->prev = prev->prev;
		return prev + 1;
	}

	struct holey_stack *new = malloc(FRAME_SIZE + sizeof(struct holey_stack));
	new->prev = NULL;
	return new + 1;
}

static inline void holey_free(struct holey_stack *stack, void *loc)
{
	struct holey_stack *freed = (struct holey_stack *)loc - 1;
	freed->prev = stack->prev;
	stack->prev = freed;
}

static inline void destroy_holey_stack(struct holey_stack *stack)
{
	struct holey_stack *prev = stack->prev;
	while (prev) {
		struct holey_stack *cur = prev;
		prev = cur->prev;
		free(cur);
	}
}

/* ----- START OF "GENERATED" CODE ----- */

typedef struct holey_ret (*int_cb_t)(struct holey_stack, int n, void *ctx);

/* common context for fib and its closures */
struct fib_ctx {
	int n;
	int f1;
	void *res;
	void *res_ctx;
};

#define RETURN(s, e) (struct holey_ret){.stack = s, .err = e}

static struct holey_ret main_closure0(struct holey_stack stack, int res, void *ctx)
{
	/* just print result */
	printf("%d\n", res);

	/* exit from main */
	holey_free(&stack, ctx);

	/* NULL indicates no error, otherwise we could return some message */
	return RETURN(stack, NULL);
}

static struct holey_ret holey_fib(struct holey_stack stack, int n, void *res, void *res_ctx);

static struct holey_ret holey_fib_closure1(struct holey_stack stack, int f2, void *ctx_loc)
{
	struct fib_ctx *ctx = ctx_loc;
	/* technically speaking UB since the frame might get ripped out from under
	 * us but works for this simple example */
	holey_free(&stack, ctx_loc);
	return ((int_cb_t)ctx->res)(stack, ctx->f1 + f2, ctx->res_ctx);
}

static struct holey_ret holey_fib_closure0(struct holey_stack stack, int f1, void *ctx_loc)
{
	struct fib_ctx *ctx = ctx_loc;
	ctx->f1 = f1;
	return holey_fib(stack, ctx->n - 2, holey_fib_closure1, ctx_loc);
}

static struct holey_ret holey_fib(struct holey_stack stack, int n, void *res, void *res_ctx)
{
	/* allocate new context when entering fib at the top level */
	void *ctx_loc = holey_alloc(&stack);
	struct fib_ctx *ctx = ctx_loc;

	if (n <= 2) {
		/* exiting top level for good, mark context free */
		holey_free(&stack, ctx_loc);
		return ((int_cb_t)res)(stack, 1, res_ctx);
	} else {
		ctx->n = n;
		ctx->res = res;
		ctx->res_ctx = res_ctx;
		return holey_fib(stack, n - 1, holey_fib_closure0, ctx_loc);
    }
}

int main()
{
	struct holey_stack stack = create_holey_stack();
	/* main context */
	void *ctx = holey_alloc(&stack);

	/* fib(42) */
	struct holey_ret r = holey_fib(stack, 42, main_closure0, ctx);

	destroy_holey_stack(&r.stack);
}

Addendum 2025-03-28

I mulled over yesterday’s results and wrote one final version that doesn’t do error handling and passes arguments on the stack, to closer match what might ‘actually’ get generated by my hypotethical fwd -> c compiler. Passing arguments on the stack is important because I’m using [[clang::musttail]], which in theory guarantees that a call is turned into a tail-call, even without optimizations. This attribute requires that the function signature (along with a few other things) is identical between the caller and callee, which was not the case with previous approaches. musttail is sort of the lowest common denominator of tail-calls, and depending on the compiler optimizations and architecture, tail-calls can happen in more relaxed contexts as well, but we would really prefer a strong guarantee.

Anyway, clang supports the attribute since version 13, and gcc is apparently getting support for it in version 15, and is ignored in earlier versions. I went ahead and compiled the development version of gcc to test the attribute out, and I got the following results:

Compiler and flags	Runtime (s)
`clang-19 -O0`	5.317
`clang-19 -O1`	1.402
`clang-19 -O2`	1.402
`gcc-dev -O1`	1.456
`gcc-dev -O2`	0.733

Looks like we more or less match the performance champion. Not quite sure how, since I would expect argument passing via memory would we significantly slower, but I guess gcc-dev was able to find some clever trick somewhere to speed things up. I had a look at the generated binary and noticed it was significantly longer, but I’m not really clever enough to figure out what the code is actually doing, unfortunately. gcc-dev -O0 rejected the code outright with an error of

address of caller arguments taken

Not quite sure what this means, I’m guessing it has something to do with function arguments possibly being on the stack and the tail-call checker is a bit too eager to reject the code. Works on clang at least, and adding any optimization flags clears this hurdle so I’ll assume it’s a bug?

Below is the final version of the code. This code is a bit more difficult to read, being closer to something generated but the core difference is just that we allocate a new frame for ‘fresh’ function calls and pass arguments via this frame into the function. Closures can reuse the frame for their own arguments. Interestingly, the 0.73 runtime was achieved by changing the struct holey_stack argument passed to functions to just be a typedef void *. Not quite sure why.

This approach does kind of have the drawback that a new function call must by necessity allocate a new frame, with some of the previous approaches the function could itself decide if that was necessary, but this seems to not have much significance.

#include <assert.h>
#include <stddef.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdio.h>

/* 128 byte frame size by default */
#define FRAME_SIZE 128

typedef void *stack_t;

static inline stack_t create_holey_stack()
{
	return NULL;
}

static inline void *holey_alloc(stack_t *top)
{
	if (*top) {
		stack_t *prev = *top;
		*top = *prev;
		return prev + 1;
	}

	stack_t *new = malloc(FRAME_SIZE + sizeof(stack_t));
	*new = NULL;
	return new + 1;
}

static inline void holey_free(stack_t *stack, void *loc)
{
	stack_t *freed = (stack_t *)loc - 1;
	*freed = *stack;
	*stack = freed;
}

static inline void destroy_holey_stack(stack_t *stack)
{
	stack_t *prev = *stack;
	while (prev) {
		stack_t *cur = prev;
		prev = *cur;
		free(cur);
	}
}

/* ----- START OF "GENERATED" CODE ----- */
#define CONTAINER_OF(ptr, type, member) \
	(type *)((char *)(ptr) - offsetof(type, member))

typedef stack_t (*fwd_call_t)(stack_t stack, void *ctx);

/* common context for fib and its closures */
struct int_args {
	int a0;
	fwd_call_t a1;
	void *a1_ctx;
};

struct int_arg {
	int a0;
};

struct void_arg {
};

struct fib_ctx {
	struct int_args args_fib;
	struct int_arg args_fib_closure0;
	struct int_arg args_fib_closure1;

	int f1;
};

struct main_ctx {
	struct void_arg args;
	struct int_arg args_main_closure0;
};

static stack_t main_closure0(stack_t stack, void *args)
{
	struct main_ctx *ctx = CONTAINER_OF(args, struct main_ctx, args);
	/* just print result */
	printf("%d\n", ctx->args_main_closure0.a0);

	/* exit from main */
	holey_free(&stack, ctx);
	return stack;
}

static stack_t holey_fib(stack_t stack, void *args);

static stack_t holey_fib_closure1(stack_t stack, void *args)
{
	struct fib_ctx *ctx = CONTAINER_OF(args, struct fib_ctx, args_fib_closure1);

	fwd_call_t res = ctx->args_fib.a1;
	struct int_arg *res_ctx = ctx->args_fib.a1_ctx;
	res_ctx->a0 = ctx->f1 + ctx->args_fib_closure1.a0;
	holey_free(&stack, ctx);
	[[clang::musttail]] return res(stack, res_ctx);
}

static stack_t holey_fib_closure0(stack_t stack, void *args)
{
	struct fib_ctx *ctx = CONTAINER_OF(args, struct fib_ctx, args_fib_closure0);
	ctx->f1 = ctx->args_fib_closure0.a0;

	struct int_args *fib_ctx = holey_alloc(&stack);
	fib_ctx->a0 = ctx->args_fib.a0 - 2;
	fib_ctx->a1 = holey_fib_closure1;
	fib_ctx->a1_ctx = &ctx->args_fib_closure1;
	[[clang::musttail]] return holey_fib(stack, fib_ctx);
}

static stack_t holey_fib(stack_t stack, void *args)
{
	struct fib_ctx *ctx = CONTAINER_OF(args, struct fib_ctx, args_fib);
	if (ctx->args_fib.a0 <= 2) {
		/* exiting top level for good, mark context free */
		fwd_call_t res = ctx->args_fib.a1;
		struct int_arg *res_ctx = ctx->args_fib.a1_ctx;
		res_ctx->a0 = 1;
		holey_free(&stack, ctx);
		[[clang::musttail]] return res(stack, res_ctx);
	} else {
		struct int_args *fib_ctx = holey_alloc(&stack);
		fib_ctx->a0 = ctx->args_fib.a0 - 1;
		fib_ctx->a1 = holey_fib_closure0;
		fib_ctx->a1_ctx = &ctx->args_fib_closure0;
		[[clang::musttail]] return holey_fib(stack, fib_ctx);
    }
}

static stack_t holey_main(stack_t stack, void *args)
{
	struct main_ctx *ctx = CONTAINER_OF(args, struct main_ctx, args);
	/* fib(42) */
	struct int_args *fib_ctx = holey_alloc(&stack);
	fib_ctx->a0 = 42;
	fib_ctx->a1 = main_closure0;
	fib_ctx->a1_ctx = &ctx->args_main_closure0;
	[[clang::musttail]] return holey_fib(stack, fib_ctx);
}

int main()
{
	stack_t stack = create_holey_stack();
	void *main_ctx = holey_alloc(&stack);
	stack = holey_main(stack, main_ctx);
	destroy_holey_stack(&stack);
}

Granted, with tools like ulimit we can increase the stack size, but it’s rather silly and doesn’t really solve our problem in the general case. ↩︎
C++ lambdas are effectively just anonymous objects and must be destroyed after use, and as such can’t be the target of a tail call. The comments in the code should be read more like “this is where we would like to do a tail call” versus “this is where C++ can do a tail call.” Currently fwd outputs C++, just due to convenience, but due to the above limitation I’ll probably move over to a C generator at some point to get more control. ↩︎

#fwd