Struct Iteration through (Ab)use of the C Preprocessor

Posted 2015-10-13

Struct Iteration through (Ab)use of the C Preprocessor

A program I've been working involves reading and writing different kinds of C structs to/from disk, and I've been trying to find a good way to reduce the code complexity and maintence work involved in this task. In the course of these experiments, I created a somewhat interesting partial solution I thought would be worth sharing.

At this stage, the structs only consist of primitive elements (int, float, char, etc). Handling nested structs, unions, and pointers would require additional work (that may be the subject of a future post.)

My first approach involved writing separate serializing and deserealizing functions for each struct type. The serialize (pack) function would convert the struct into a buffer of bytes. This buffer could then be written to disk. Later on, those bytes could be read back from the disk, and the deserealize (unpack) routine could convert the raw bytes back into a useful struct. This approach works, but is brittle to change. The functions will need to be updated whenever the struct definition is changed (say by adding or deleting a member, or changing a member's name).

An example would be something like this: (Error checking has been omitted for clarity.)

void pack_struct_a(struct a *sa, unsigned char *buff) {
    memcpy(buff, sa->first, sizeof(sa->first));
    memcpy(buff+sizeof(sa->first), sa->second, sizeof(sa->second));
    ...
}

These functions need to be defined for every struct, involve repetitive code, and are a pain to update when the structure definition changes. Wouldn't it be great if we had a single function that was able to handle any kind of struct with any number of members?

One of the most obvious ways to accomplish this is by writing the memory representation of the struct directly to disk.

write(file, &data_struct, sizeof(data_struct));

The problem with this approach is that compilers add padding between the struct's fields to align the data which increases performance. However, the amount of padding added is implementation defined, so the in-memory representation of structs could differ between different architectures, or different compilers. So a struct written from one build of the program might not be readable to a different build of the same program.

So instead, we should only write out the non-padding bytes, which should be compatible between compilers (assuming proper endian compensation). This is equivalent to writing out the data from all the members of the struct. However, doing this in a programmatic fashion would require iterating over all the members of the struct, an ability that is not built in to the C language. We could expand the iterations for each different kind of struct by hand, but then we'd be right back to the same functions we were originally trying to get rid of.

However, there is a (somewhat awkward) way that iteration over a struct's elements can be accomplished in C. As the title suggests, it involves the C preprocessor.

In order to do this, I've used some tricks that may not be particularly well known, so I will take some time to explain them first before presenting the code. If you are already familiar with these techniques, you can skip to the code.

X Macros

X Macros are a technique that uses the C preprocessor to cut down on the amount of repetitive typing. The X Macro has two parts: a list and a function. You define a list oh invokations of a macro (X for example).

#define LST \
    X(a) \
    X(b) \
    X(c) \
    X(d)

Then, whenever you want to do something involving a,b,c,d you can #define X(n) to your operation, and the list will call your macro on every value in the list. When the operation is done, make sure you #undef X so the compiler won't complain about redefining a macro the next time you do an X operation on the list.

For example this can be used for things like declarations:

#define X(n) int n;
LST
#undef X

Will produce:

int a; int b; int c; int d;

Or even counting the elements in the list:

#define X(n) 1 + 
int len = LST 0;
#undef X

Will produce:

int len = 1 + 1 + 1 + 1 + 0;

Expanding Macro Arguments

Ordinarily, C macro arguments are not evaluated when used in stringification (#), or concatenation (##).

So having something like:

#define CAT dog
#define FOOD hot_##CAT

Would produce hot_CAT, not the hot_dog we wanted. However, this can be bypassed, as the C preprocessor expands a macros arguments before it passes them to a macro. So, to get the result we want, we must create 2 functions: One to expand the macro term, and another to peform the stringification or concatenation.

#define
#define MACRO(A) hot_##A
#define MACRO(A) CAT_NOEXPAND(A)
MACRO(CAT)

Will produce hot_dog like we wanted.

Compound Literals

Compound literals look like a cast and an initializer. We use them here to store the address of our array literals into pointers. Unlike C++, initializers in C are not temporary objects, so it is safe to keep a pointer pointing to them.

Designated Initializers

Designated initializers are a C99 feature that lets you initialize structs by referring to each element by name.

For example:

struct p {int x; int y }

struct p pt  = {.x = 0, .y = 0};

offsetof()

offsetof(type, member) is a standard macro that will determine the offset of a member of a struct type in bytes from the 0th byte of that type.

Code

#ifndef STRUCT_FMT_DEF
    /* One time only definitons */
    #define STRUCT_FMT_DEF
    struct struct_fmt {
        char const *struct_name;
        size_t num_members;
        size_t struct_size;
        size_t packed_size;
        size_t *offsets;
        size_t *sizes;
        char const **names;
    };
#endif

/* Error Checking */

#ifndef STRUCT_NAME
    #error "Did not define STRUCT_NAME before including fmtgen.h"
#endif

#ifndef STRUCT_FIELDS
    #error "Did not define STRUCT_FIELDS before including fmtgen.h"
#endif

#define STR_NOEXPAND(A) #A
#define STR(A) STR_NOEXPAND(A)

#define CAT_NOEXPAND(A, B) A ## B
#define CAT(A, B) CAT_NOEXPAND(A, B)

struct STRUCT_NAME {
    #define X(L, R) L R;
    STRUCT_FIELDS
    #undef X
};

struct struct_fmt CAT(STRUCT_NAME, _fmt) = {

    .struct_name = STR(STRUCT_NAME),

    .num_members = (
        #define X(L, R) 1 +
        STRUCT_FIELDS
        #undef X
    0),

    .struct_size = sizeof(struct STRUCT_NAME),

    .packed_size = (
        #define X(L, R) sizeof(L) +
        STRUCT_FIELDS
        #undef X
    0),

    .offsets = (size_t[]){
        #define X(L, R) offsetof(struct STRUCT_NAME, R),
        STRUCT_FIELDS
        #undef X
    },

    .sizes = (size_t []){
        #define X(L, R) sizeof(L),
        STRUCT_FIELDS
        #undef X
    },

    .names = (char const *[]){
        #define X(L, R) #L ":" #R,
        STRUCT_FIELDS
        #undef X
    },
};

#undef STRUCT_FIELDS
#undef STRUCT_NAME
#undef STR_NOEXPAND
#undef STR
#undef CAT_NOEXPAND
#undef CAT

Notice that only definition of the struct_def struct is the only thing in #include guards. That is because we actually do want the remainder of the header file to be included multiple time. Each it is #included it will declare a new struct_def.

In order to use this, the "caller" must set up the parameters appropriately. This is done by #defining STRUCT_NAME to be the name of the struct, and STRUCT_FIELDS to be an X Macro list of the fields of that struct. Currently, as you can see the macro only accepts a type and a name. This could be exanded in the future. The struct is then generated by "calling" the header by #including it. This will cause the C preprocessor to execute it and will generate the structures we need.

Note that the STRUCT_NAME is a macro, so we must use the trick listed above to stringify and concatenate the STRUCT_NAME to make the name for the struct_fmt and to store the name of the struct as a string.

The code is genrated using the X Macros technique. The X Macros list is the fields of the struct and their types, and the X(L,R) implementations will generate the actual struct for you, and generate the struct_def by counting the number of elements, summing the sizes of each element, storing the size of each element, storing the offset of each element, and storing the names of each element. The information in the struct_fmt will be used in the serializing function. The other information is good for debugging.

Note how the compound literals allow us to assign the generated data directly to the pointers in the struct_fmt struct without having to create an unneeded array, or having to specify the pointer as an array with a size instead.

At the end, we #undef all the macros used during this file. Because the file will be included more than once, we need to be careful about what #including it will leave in the global namespace.

An example of the use of the code is as follows:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#define STRUCT_NAME t
#define STRUCT_FIELDS \
    X(int, a) \
    X(float, b) \
    X(char, c)
#include "fmtgen.h"

#define STRUCT_NAME s
#define STRUCT_FIELDS \
    X(long, l) \
    X(double, g) \
    X(unsigned long long, f) \
    X(int, x)
#include "fmtgen.h"

void
print_buffer(unsigned char *buffer, size_t size)
{
    for (size_t j = 0; j < size; j++) {
        printf(" %02x", buffer[j]);
    }
}

size_t
struct_pack(struct struct_fmt *fmt, void *structure, unsigned char *buffer)
{ 
    size_t pos = 0;

    for (size_t i = 0; i < fmt->num_members; i++) {
        memcpy(buffer+pos, ((unsigned char*)structure)+fmt->offsets[i], fmt->sizes[i]);
        pos += fmt->sizes[i];
    }

    return pos;
}

size_t
struct_unpack(struct struct_fmt *fmt, unsigned char *buffer, void *structure)
{
    size_t pos = 0;

    for (size_t i = 0; i < fmt->num_members; i++) {
        memcpy(((unsigned char*)structure)+fmt->offsets[i], buffer+pos, fmt->sizes[i]);
        pos += fmt->sizes[i];
    }

    return pos;
}

void
struct_print(struct struct_fmt *fmt, void *structure)
{
    printf("%s:\n", fmt->struct_name);
    for (size_t i = 0; i < fmt->num_members; i++) {
        printf("\t%s: %zu %zu =", fmt->names[i], fmt->offsets[i], fmt->sizes[i]);
        print_buffer(((unsigned char*)structure)+fmt->offsets[i], fmt->sizes[i]);
        printf("\n");
    }
}


int
main(int argc, char **argv)
{
    struct t tst = {.a = 5, .b = 0.0f, .c = 'a'};
    struct s sst = {.l = 1024, .g = 0.0, .f = 2048};

    unsigned char tbuff[t_fmt.packed_size];
    struct_pack(&t_fmt, &tst, tbuff);
    printf("t packed:\n\t");
    print_buffer(tbuff, sizeof(tbuff));
    printf("\n");

    unsigned char sbuff[s_fmt.packed_size];
    struct_pack(&s_fmt, &sst, sbuff);
    printf("s packed:\n\t");
    print_buffer(sbuff, sizeof(tbuff));
    printf("\n");

    struct_print(&t_fmt, &tst);
    struct_print(&s_fmt, &sst);

    struct t tst2;
    struct s sst2;

    struct_unpack(&t_fmt, tbuff, &tst2); 
    struct_unpack(&s_fmt, sbuff, &sst2); 

    struct_print(&t_fmt, &tst2);
    struct_print(&s_fmt, &sst2);

    return 0;
}

In this example usage, we show how easy it becomes to serialize any struct with a struct_fmt. We can change our structs, but the pack/unpack functions remain the same. Any changes to the struct will be dealt with automatically by the code generation in the header file.

With the offsets and sizes of each element from the struct_def, we can easily skip all the padding bytes, and only iterate over the actual member data of the struct.

Conclusion

In this post, I specified that we are only working with structs consisting of primitive values at the moment. It would be interesting to see if this approach could be expanded to deal with members that are structs, unions, or pointers.

It would also be interesting to see if there were any other uses for this technique. In the demo code, I showed a function that could print out information about the struct. Perhaps this could be extended further to allow more runtime introspection into the structure of the struct itself.

This has been an interesting exercise, but I wouldn't recommend using this in any actual production code. It is certainly confusing, (arcane even?), and will introduce maintenance hassles of its own, as well as a very high potential for a lot of bugs.