Making thing 3x faster by disabling SSE

31 Mar, 2025

This is a quickie, continuing from by previous post. I went ahead and implemented the approach I found most promising in my last post, and fwd can now compile some very simple recursive programs such that the binary doesn’t run out of memory. However, when comparing the generated C to my hand-written equivalent, I was surprised to see a ~3x performance slowdown.

Here’s what I’m trying to compile:

fib(i64 n, (i64) res)
{
	if n <= 2 {
		res(1);
	} else {
		fib(n - 1) => i64 f1;
		fib(n - 2) => i64 f2;
		res(f1 + f2);
	}
}

Currently, the compilation is a bit cumbersome, but from the root directory of the fwd repository, I’m currently running:

./fwd examples/fib.fwd > /tmp/fib.c
gcc -Iinclude -Ilib -Lmod -Wl,-rpath=mod -O2 -g /tmp/fib.c -lfwdio -o /tmp/fib

Here’s the problematic generated C code for the top-level:

static fwd_stack_t fib_s0(fwd_stack_t stack, fwd_args_t args)
{
  static_assert(FWD_FRAME_SIZE >= sizeof(struct fib_s0_ctx), "context exceeds frame size");
  struct fib_s0_ctx *ctx = fwd_stack_alloc(&stack);
  *((struct fib_s0_params *)&ctx->fib_s0_start) = *((struct fib_s0_params *)args);
  ctx->global_args = args;
  if (ctx->n_s1 <= 2)
  {
    fwd_call_t fib_s0_call0 = ctx->res_s1.call;
    struct fib_s0_call0 *fib_s0_call0_ctx = ctx->res_s1.args;
    fib_s0_call0_ctx->a0 = 1;
    fwd_stack_free(&stack, ctx);
    FWD_MUSTTAIL return fib_s0_call0(stack, fib_s0_call0_ctx);
  }
  else
  {
    struct fib_s0_call1 *fib_s0_call1 = ctx->global_args;
    fib_s0_call1->a0 = ctx->n_s1 - 1;
    fib_s0_call1->a1 = (fwd_closure_t){fib_s0_closure2, &ctx->fib_s0_closure2_start};
    FWD_MUSTTAIL return fib_s0(stack, fib_s0_call1);
  }
}

Granted, not the easiest to read, so I’ll just highlight the most interesting line:

*((struct fib_s0_params *)&ctx->fib_s0_start) = *((struct fib_s0_params *)args);

This compiles down to these two instructions:

movdqu (%rbx),%xmm0
movups %xmm0,0x8(%rdi)

Cool, two vector instructions to copy parameters from the parameter stack to the function call frame. Looks good. Except it isn’t, because the program runs in ~4.72s and profiling with perf reports over 50% of the runtime being spent on these two instructions.

Recompile the same code with -mno-sse, and we get (roughly) these instructions:

mov    %rax,0x8(%rdi)
mov    0x8(%rbx),%rdx
mov    %rdx,0x10(%rdi)
mov    0x10(%rbx),%rsi
mov    %rbx,(%rdi)
mov    %rsi,0x18(%rdi)

This runs in ~1.48s. Is the vector operation unaligned and the CPU is handling it poorly? Is it hitting some really unfortunate cache issue? Are vector operations themselves just that slow? No clue, I haven’t really looked into this in any depth further than this, just wanted to put this out there. I can sort of work around this by tweaking the C code generator, but I’m not sure if this should be considered a bug in GCC or my CPU doing something funky or a tricky case of user error. Collecting more data by playing around with other algorithms would probably be beneficial. If you happen to know or have a theory as to what’s going on, please do let me know.

#fwd