metastuff

Generic hierarchies in C

I realized that I use two different methods of creating generic trees in C, one based on accessor macros/tagged unions and one based on function pointers, which made want to compare the two approaches. This blog post goes through my findings in the hopes that they may be useful for someone else as well. I’ll mainly focus on performance, but I’ll also briefly touch on usability.

What do I mean by generic hierarchies?

Any situation where you have multiple ‘kinds’ of nodes/components/whatever that each should do something unique, and have some kind of relationship to other nodes/components/whatever (in other words, form some kind of graph). Abstract syntax trees (ASTs) in programming languages, or component trees in system simulators, and in my case specifically ek versus gran.

If you’re unsure what the above means, let’s consider the example or calculating a + b * c as a computer. We can think of this as doing the following steps:

  1. Get the value in c
  2. Get the value in b
  3. Multiply the value we got in step 1 with the value we got in step 2
  4. Get the value in a
  5. Add the value we got from step 4 and add it to the value we got in step 3
  6. Done!

This can be represented as a graph (and specifically, a tree), that looks something like the following:

a d d m u a l c b

The question now is, how should the above be represented in C? I’m sure there are plenty of methods, but I’ll show the two I’ve used and will be comparing.

Luckily, the two methods provide an identical API, so the above graph can be implemented as follows, and the rest of this example just boils down to how the following should be implemented:

struct node *program =
  gen_add(
    gen_var(a),
    gen_mul(
        gen_var(b),
        gen_var(c)));

  run(program);

Method 1: Acessor macros into a tagged ‘union’

Roughly, the idea is that all nodes use a fat structure with all the field that might be needed, plus an enum that tells us what kind of node is being accessed. Essentially a tagged union and some assertions. Note ‘union’ in quotes, because I typically don’t use C unions and just let all variables exist as they are in the struct. Makes the type definition a bit easier to read and generally doesn’t really waste any significant amounts of space.

Code probably explains this best:

struct node {
  /* all the different kinds a node can be */
  enum kind {
    NODE_ADD,
    NODE_MUL,
    NODE_VAR
  } kind;

  /* generic names in struct */
  struct node *n0;
  struct node *n1;

  /* assuming struct var is defined somewhere else, it's not important for this
   * example */
  struct var *v;
};

/* generic accessors that check we're accessing the kind of node that we think
 * we are. Note that this uses GCC extensions, and a ref/deref trick to make
 * the elements read/write. */
#define get_n0(k, x)  (*({assert((x)->kind == (k)); &(x)->n0;}))
#define get_n1(k, x)  (*({assert((x)->kind == (k)); &(x)->n1;}))
#define get_var(k, x) (*({assert((x)->kind == (k)); &(x)->v;}))

/* generic node constructor */
static inline struct node *gen_node(enum kind kind,
                                    struct node *n0,
                                    struct node *n1,
                                    struct var *v)
{
  struct node *n = calloc(1, sizeof(struct node));
  if (!n)
    return NULL;

  n->kind = kind;
  n->n0 = n0;
  n->n1 = n1;
  n->v = v;
  return n;
}

/* specific constructors and accessors, each node type defines
 * its own macros and associates a specific name with a generic name
 * (left -> n0, right -> n1, etc.) */
#define gen_add(left, right) gen_node(NODE_ADD, (left), (right), NULL)
#define add_left(x)  get_n0(NODE_ADD, (x))
#define add_right(x) get_n1(NODE_ADD, (x))

/* similarly for mul/var */


/* implementing the functionality of a node, in this case running the
 * calculation that this tree represents */

/* predeclare */
static int run(struct node *n);

static int run_add(struct node *a)
{
  /* get the value of left child node and value of right child node and add them
   * together */
  int l = run(add_left(a));
  int r = run(add_right(a));
  return l + r;
}

/* similarly for mul/var */

/* implementation of main run function */
static int run(struct node *n)
{
  switch (n->kind) {
    case NODE_ADD: return run_add(n);
    case NODE_MUL: return run_mul(n);
    case NODE_VAR: return run_var(n);
    default: abort(); /* or some other error handling */
  }

  return 0;
}

The neat thing about this method is that functionality is somewhat orthogonal to the data representation. Adding new functionality is very straight forward, for example if we want to destroy the tree, we can just write:

static void destroy_tree(struct node *n)
{
  if (!n)
    return;

  /* recursively destroy tree, since we don't have to care what the exact type
   * of this node is */
  destroy_tree(n->n0);
  destroy_tree(n->n1);

  /* I'll assume that var is just a reference and freed somewhere else */
  free(n);
}

The drawbacks are that the size of each node is the ’total’ of the individual node kinds, so all nodes have to pay some memory penalty and carry around extra baggage. Additionally, the accessor macros can be a bit verbose and might cause a performance hit due to having to check assertions on every property access. If the user is absolutely certain they know what they’re doing, I guess they can just use n0/n1/etc. (or named macros without assert) directly, although that opens up the possibility of confusing one node type for another and reading garbage data. The implementation above uses the GCC Statements and Declarations in Expressions extension, so the above is not really portable C. You can also defined inline functions that do roughly the same thing, but it’s a bit more verbose.

From a documentation/code readability standpoint, having the functionality of a node be split into different parts of a codebase can also make it a bit more difficult to get a full picture of what exactly a node does.

Method 2: Function pointers

This is essentially the same as interfaces from C++, but we get to implement them manually. In short, we define a ‘base’ structure that just contains some function pointers. All nodes must then include an instance of this interface, and return a pointer to it during construction.

Example code:

/* predeclare */
struct node;
typedef int (*run_node_t)(struct node *);

/* functions a node can support */
struct node {
  run_node_t run;
};

static inline int run_node(struct node *n)
{
  /* use the appropriate function pointer and pass the node to 'itself', since
   * really only it knows how to read its own structure. */
  assert(n->run);
  return n->run(n);
}

/* each node type is its own C type */
struct add_node {
  /* functionality of this node */
  struct node node;

  /* data of this node, invisible to other nodes */
  struct node *l;
  struct node *r;
};

/* implementation of the run interface */
static int run_add(struct add_node *a)
{
  int l = run_node(a->l);
  int r = run_node(a->r);
  return l + r;
}

static struct node *gen_add(struct node *l, struct node *r)
{
  struct add_node *a = calloc(1, sizeof(struct add_node));
  if (!)
    return NULL;

  /* since struct node is first in struct add_node, we can directly type swap
   * the pointers (at least in practice) */
  a->node.run = (run_node_t)run_add;
  a->l = l;
  r->r = r;
  return &a->node;
}

/* etc. for mul/var */

This method binds functionality and data representation together more strongly, which means that adding functionality to the tree is a lot more cumbersome, since each node kind has to have its own definition of the functionality. For example, destroying the tree means we have to define destroy_add an destroy_mul and destroy_var, and bind them to their respective nodes during node creation. You might get away with defining a generic visit function for each node and then reusing it for different operations, but it’s not a full solution unfortunately.

On the upside, no extra assertion checks at runtime, and now the compiler can help type check our code more effectively. Accessing the node struct directly also helps keep the code a bit less verbose. On the flipside, function pointers can potentially be difficult for hardware to predict, leading to more pipeline stalls then the equivalent switch + direct function call in method 1.

Note that technically speaking the above is UB, since we’re conflating function pointer types together, and a more standards-compliant version would use some kind of container_of macro to convert a struct node * pointer to a struct add_node *, but in practice it’s not really needed. If you know of a system where the above wouldn’t work, please do send me an email!

Test setup

I wrote a small benchmark that constructs an AST for a matrix multiplication, matching the following bit of C:

static void matrix_mult(intptr_t *A, intptr_t *B, intptr_t *C)
{
	for (size_t i = 0; i < MATRIX_SIZE; ++i)
	for (size_t j = 0; j < MATRIX_SIZE; ++j) {
		intptr_t r = 0;
		for (size_t k = 0; k < MATRIX_SIZE; ++k)
			r += A[i * MATRIX_SIZE + k] * B[k * MATRIX_SIZE + j];

		C[i * MATRIX_SIZE + j] = r;
	}
}

where A, B, and C are 2D matrixes of size MATRIX_SIZE * MATRIX_SIZE. I chose to use intptr_t as the data type so that I can directly return pointers from nodes to store values into the output matrix C.

The full testbench code can be found over in my playground. See main.c for the full node tree, struct.c for the equivalent of method 1 and fptr.c for the equivalent of method 2. MATRIX_SIZE is defined to be 100.

I benchmarked struct and fptr with the hyperfine tool, once with assertions enabled and once with assertions disabled.

As secondary points of comparison, I also noted how many lines of code it took to implement each method with cloc and how much memory the tree structure used up with valgrind, setting MATRIX_SIZE to 0.

Results

Runtime

With assertions enabled:

hyperfine --warmup 10 ./fptr ./struct
Benchmark 1: ./fptr
  Time (mean ± σ):      41.9 ms ±   1.4 ms    [User: 41.3 ms, System: 0.7 ms]
  Range (min … max):    39.5 ms …  44.8 ms    69 runs

Benchmark 2: ./struct
  Time (mean ± σ):      60.5 ms ±   1.9 ms    [User: 60.3 ms, System: 0.2 ms]
  Range (min … max):    56.1 ms …  63.4 ms    47 runs

Summary
  ./fptr ran
    1.44 ± 0.07 times faster than ./struct

With -DNDEBUG, i.e. assertions disabled:

$ hyperfine --warmup 10 ./fptr ./struct
Benchmark 1: ./fptr
  Time (mean ± σ):      39.6 ms ±   1.1 ms    [User: 39.1 ms, System: 0.5 ms]
  Range (min … max):    37.8 ms …  42.7 ms    74 runs

Benchmark 2: ./struct
  Time (mean ± σ):      71.4 ms ±   1.9 ms    [User: 70.8 ms, System: 0.6 ms]
  Range (min … max):    66.4 ms …  76.3 ms    41 runs

Summary
  ./fptr ran
    1.80 ± 0.07 times faster than ./struct

Line count

$ cloc --by-file struct.c fptr.c
       2 text files.
       2 unique files.
       0 files ignored.

github.com/AlDanial/cloc v 2.04  T=0.01 s (342.1 files/s, 144371.6 lines/s)
-------------------------------------------------------------------------------
File                             blank        comment           code
-------------------------------------------------------------------------------
fptr.c                              92              4            391
struct.c                            72             20            265
-------------------------------------------------------------------------------
SUM:                               164             24            656
-------------------------------------------------------------------------------

Sorry about the slightly off-kilter columns, hugo/html/something doesn’t seem to use monospace whitespace for whatever reason, couldn’t find anything after a quick search online.

Memory use

$ valgrind ./fptr
==33760== Memcheck, a memory error detector
==33760== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==33760== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info
==33760== Command: ./fptr
==33760==
0
==33760==
==33760== HEAP SUMMARY:
==33760==     in use at exit: 0 bytes in 0 blocks
==33760==   total heap usage: 88 allocs, 88 frees, 3,488 bytes allocated
==33760==
==33760== All heap blocks were freed -- no leaks are possible
==33760==
==33760== For lists of detected and suppressed errors, rerun with: -s
==33760== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
$ valgrind ./struct
==33796== Memcheck, a memory error detector
==33796== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==33796== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info
==33796== Command: ./struct
==33796==
0
==33796==
==33796== HEAP SUMMARY:
==33796==     in use at exit: 0 bytes in 0 blocks
==33796==   total heap usage: 88 allocs, 88 frees, 7,424 bytes allocated
==33796==
==33796== All heap blocks were freed -- no leaks are possible
==33796==
==33796== For lists of detected and suppressed errors, rerun with: -s
==33796== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Summary

I must say, I’m rather surprised that disabling assertions caused a performance regression in struct. Were the assertions guiding the processor’s branch predictor somehow? Or keeping some important things in cache? Don’t know, but regardless it seems that there’s a significant difference in runtime between the two approaches.

Method 2 runs faster and uses less memory, but is more rigid. I would recommend it for when performance is more important than ergonomics, and the number of different kinds of operations any given node is expected to participate in is relatively low. In other cases, method 1 seems to be a bit easier to work with, but unfortunately leaves some performance on the table.

#playground #C