Friday, October 7, 2011

C++ Source-to-Source Translation

I've been on annual leave this week, so I've taken the opportunity to do some work on cmonster. I've added preliminary support for source-to-source translation by introducing a wrapper for Clang's "Rewriter" API. My fingers have been moving furiously so it's all a bit rough, but it does work.

The API flow is:

  1. Parse translation unit, returning an Abstract Syntax Tree (AST).
  2. Walk AST to find bits of interest.
  3. Insert/replace/erase text in the original source, using the location stored in each declaration/statement/token.

Motivating Example

Logging is a necessary evil in complex software, especially when said software is running on a customer's system, inaccessible to you. To make problem determination easier, we want a decent amount of information: file names, line numbers, function names, time of day, thread ID, ... but all of this comes at a cost. I'm not talking just cost in terms of CPU usage, though that is a major concern. I'm talking cost in terms of source code quality and maintainability.

We'll start off with a trivial C program:

int main(int argc, const char *argv[])
{
    if (argc % 2 == 0)
    {
        return 1;
    }
    else
    {
        return 0;
    }
}

Let's say our needs are fairly humble: we just want to log the entry and exit of this function. Logging entry is easy: add a blob of code at the top of the function. We can get the function name and line number using __func__ (C99, C++11) and __LINE__. What about __func__ in C89? C++98? There's various alternatives, but some compilers have nothing. And that makes writing a cross-platform logging library a big old PITA. The information is there in the source code - if only we could get at it! In the end, we're more likely to forego DRY, and just reproduce the function name as a string literal.

Getting the function name and line number isn't a huge problem, but how about adding function exit logging? Now we're going to have to insert a little bit of code before our return statements. So we'll have something like:

int main(int argc, const char *argv[])
{
    const char *function = "main";
    printf("Entering %s:%s:%d\n", function,
           __FILE__, __LINE__);
    if (argc % 2 == 0)
    {
        printf("Leaving %s:%s:%d\n", function,
               __FILE__, __LINE__);
        return 1;
    }
    else
    {
        printf("Leaving %s:%s:%d\n", function,
               __FILE__, __LINE__);
        return 0;
    }
    return 0;
}

Ugh. And that's just the start. It gets much nastier when we need to turn logging on/off at runtime, filter by function name, etc. We could make it much nicer with a variadic macro. Something like LOG(format...), which calls a varargs function with the 'function' variable, __FILE__, __LINE__ and the format and arguments you specify. Unfortunately variadic macros are not supported by some older compilers. The first version of Visual Studio to support them was Microsoft Visual Studio 2005. So there goes that idea...

Hmmm, what to do, what to do? Wouldn't it be nice if we could just tag a function as requiring entry/exit logging, and have our compiler toolchain to the work? Entry/exit logging is the sort of thing you want to be consistent, so it should suffice to define one set of rules that covers all functions. Let's take a little peek at what we could do with cmonster.

First, we'll parse the source to get an AST. We'll locate all functions defined in the main file, and insert an "Entry" logging statement at the beginning of the body, and an "Exit" logging statement before each return statement in the body. At the end we dump the rewritten source to stdout, and we have a program, with logging, ready to be compiled.


Tada! Running this, we're given:

#include <stdio.h>
int main(int argc, const char *argv[])
{
    printf("Entering main at line 2\n");
    if (argc % 2 == 0)
    {
        printf("Returning from main at line 6\n");
        return 1;
    }
    else
    {
        printf("Returning from main at line 10\n");
        return 0;
    }
}

Future Work

What we can't do yet is insert, replace, erase or modify declarations or statements directly in the AST, and have that reflected as a text insertion/replacement/erasure. For example, maybe I want to rename a function? Why can't I just go "function_declaration.name = 'new_name'". At the moment we'd need to replace the text identified by a source range... a bit clunky and manual. So I may add a more direct API in at a later stage. It should be doable, but may be a lot of work.

Also, the Visitor class defined in the above example could be called minimal at best. If there were any statements in the code that weren't handled by our visitor, the translation program would barf. I'll eventually build a complete Visitor class into cmonster to be reused. This should make writing translation tools a breeze; in our example, we would just override "visit_ReturnStatement" in the visitor.

Now, I think it's about time I learnt Go.