Sunday, October 23, 2011

Writing a Compiler, or Learning a Language the Hard Way

For a while now I've been lamenting the lack of progress in systems programming languages. I work primarily in C++ (professionally, anyway), a language which leaves a lot to be desired. The end result, assuming you know what you're doing, is likely to be fast code. But getting there can be a tedious and precarious journey. Don't get me wrong, I'm not a C++ hater. I just think we can do better.

So for the past few I've been storing away ideas for writing my own language. (Groan, another one.) The vast majority of languages these days seem to be JVM based, or some other VM such as the CLR. This is fine for many applications, but for a lot of systems programming you typically want to be much closer to the hardware. What I want is something with the ease and rapidity of development of, say, Python, with the power of a lower-level language such as C/C++.



The Easy: Ideas

So what are those ideas I'd stored away then? Well to be honest, many of them boil down to syntactic sugar (syntactic caramel?), so here are the more functional ones:

  • Implicit Interfaces. The idea is that classes should be able to implement an interface without explicitly saying so, just by implementing the methods defined in the interface. Poor-man's duck-typing if you will.
  • Link Time Optimisation (LTO) should be the norm. Say you write a function in a library which does everything and the kitchen sink. If the calling program only uses a function from your library in a certain way, then the function should be optimised for that usage.
  • Pure Functions. This kind of fits under LTO. I want simpler meta-programming: it should be possible to mark functions as being "pure", which means (in my terminology) that the function either has no side-effects, or affects only temporary variables. Calls to these functions could then be evaluated at compile time, e.g. to calculate complex mathematical expressions. I guess this just comes under "fancy optimiser"?
  • Pattern Matching. I haven't really fleshed this one out, but I think it would be handy to be able to match functions not just based on signature, but based on the values of the arguments.
  • No preprocessor, no headers. When code is distributed, the modules should be self describing, such as with Java's compiled .class files. This would eliminate problems such as compiling against an old version of an interface, and linking against a new, incompatible implementation.
  • For the love of God, no garbage collection. What can I say? I like my RAII.
  • I'm not really sure what title to give this one, but I wanted to eliminate the problems you get with linking objects together that have been compiled with different options in C++ (e.g. with/out RTTI, exceptions, or perhaps against an entirely different STL/standard library.)
There were some other requirements I had in mind, such as having a single reference implementation à la CPython, and including a parser with the runtime to simplify tool development.


The Hard: Writing a Compiler

Anyone can come up with ideas. Even coming up with practical ideas isn't too hard, but implementing them is something else altogether. Especially so for someone like me who has had no formal education in compilers, or language design/implementation. Hell, my university course didn't even cover formal grammars or parsing. I've picked some things up by myself, but it's a very academic field.

Back when I was in uni, I toyed around with C--, which at the time looked like quite a promising "portable assembly language". This language is (or at least was) used as the intermediate language for the Glasgow Haskell Compiler. My self-taught knowledge of language design/implementation and code generation were not really up to scratch, so I didn't get much further than writing toy languages.

Fast forward a few years, and I've got the itch to implement a compiler again. I've recently been playing around with LLVM. It boasts a relatively easy to use API, which takes out some of the drudgery of writing the "bitcode" (an intermediate code that can be translated to machine code). I've got a bit more knowledge and programming skill behind my belt now, so let's get to work on that language, right!?

I read something recently which is summarised as: learn intimately that which you wish to replace. Now I don't have any illusions of replacing any existing language, but I think the message is still relevant. I've never implemented a language OR a compiler before, and now I'm going to both at once? How about I make this a bit easier on myself, and write a compiler for an existing language. If I still want to implement my own language later, I can use the experience I gained.

I've been looking for an excuse for a while now to learn the Go Programming Language. It has some of the same design goals as my hypothetical language, so it seemed like a good fit.


The Fun: Learning Go and LLVM

I've been learning Go in my precious spare time for the last couple of weeks. There's a nifty web-app which provides an interactive tutorial in Go, called A Tour of Go. It's a bit of fun, and I recommend it to anyone wanting to learn the language.

So anyway, I'm now playing around with writing a Go compiler using LLVM. Why?
  • To learn LLVM, and more generally how to write a compiler and generate code.
  • To learn Go.
  • Potentially to fill a gap in JIT compilation of Go programs.
  • Why not?
Writing the compiler will be made much easier by the fact that the Go runtime includes a Go parser, and someone has already implemented Go bindings for LLVM. I haven't made a great deal of progress yet, but it seems achievable. When I've got something vaguely useful, I'll chuck it over on GitHub.

If you know anything about Go, you'll probably have noticed that at least one of the ideas that I presented above is present in Go: implicit interfaces. I really can't say whether this concept is new or not - it's probably not - but I did at least come up with it independently. Just sayin'!

I'll write a follow-up post when I get some time, describing a bit more about the Go compiler, the challenges I've come across, and my thoughts on how to solve them. 

Friday, October 7, 2011

C++ Source-to-Source Translation

I've been on annual leave this week, so I've taken the opportunity to do some work on cmonster. I've added preliminary support for source-to-source translation by introducing a wrapper for Clang's "Rewriter" API. My fingers have been moving furiously so it's all a bit rough, but it does work.

The API flow is:

  1. Parse translation unit, returning an Abstract Syntax Tree (AST).
  2. Walk AST to find bits of interest.
  3. Insert/replace/erase text in the original source, using the location stored in each declaration/statement/token.

Motivating Example

Logging is a necessary evil in complex software, especially when said software is running on a customer's system, inaccessible to you. To make problem determination easier, we want a decent amount of information: file names, line numbers, function names, time of day, thread ID, ... but all of this comes at a cost. I'm not talking just cost in terms of CPU usage, though that is a major concern. I'm talking cost in terms of source code quality and maintainability.

We'll start off with a trivial C program:

int main(int argc, const char *argv[])
{
    if (argc % 2 == 0)
    {
        return 1;
    }
    else
    {
        return 0;
    }
}

Let's say our needs are fairly humble: we just want to log the entry and exit of this function. Logging entry is easy: add a blob of code at the top of the function. We can get the function name and line number using __func__ (C99, C++11) and __LINE__. What about __func__ in C89? C++98? There's various alternatives, but some compilers have nothing. And that makes writing a cross-platform logging library a big old PITA. The information is there in the source code - if only we could get at it! In the end, we're more likely to forego DRY, and just reproduce the function name as a string literal.

Getting the function name and line number isn't a huge problem, but how about adding function exit logging? Now we're going to have to insert a little bit of code before our return statements. So we'll have something like:

int main(int argc, const char *argv[])
{
    const char *function = "main";
    printf("Entering %s:%s:%d\n", function,
           __FILE__, __LINE__);
    if (argc % 2 == 0)
    {
        printf("Leaving %s:%s:%d\n", function,
               __FILE__, __LINE__);
        return 1;
    }
    else
    {
        printf("Leaving %s:%s:%d\n", function,
               __FILE__, __LINE__);
        return 0;
    }
    return 0;
}

Ugh. And that's just the start. It gets much nastier when we need to turn logging on/off at runtime, filter by function name, etc. We could make it much nicer with a variadic macro. Something like LOG(format...), which calls a varargs function with the 'function' variable, __FILE__, __LINE__ and the format and arguments you specify. Unfortunately variadic macros are not supported by some older compilers. The first version of Visual Studio to support them was Microsoft Visual Studio 2005. So there goes that idea...

Hmmm, what to do, what to do? Wouldn't it be nice if we could just tag a function as requiring entry/exit logging, and have our compiler toolchain to the work? Entry/exit logging is the sort of thing you want to be consistent, so it should suffice to define one set of rules that covers all functions. Let's take a little peek at what we could do with cmonster.

First, we'll parse the source to get an AST. We'll locate all functions defined in the main file, and insert an "Entry" logging statement at the beginning of the body, and an "Exit" logging statement before each return statement in the body. At the end we dump the rewritten source to stdout, and we have a program, with logging, ready to be compiled.


Tada! Running this, we're given:

#include <stdio.h>
int main(int argc, const char *argv[])
{
    printf("Entering main at line 2\n");
    if (argc % 2 == 0)
    {
        printf("Returning from main at line 6\n");
        return 1;
    }
    else
    {
        printf("Returning from main at line 10\n");
        return 0;
    }
}

Future Work

What we can't do yet is insert, replace, erase or modify declarations or statements directly in the AST, and have that reflected as a text insertion/replacement/erasure. For example, maybe I want to rename a function? Why can't I just go "function_declaration.name = 'new_name'". At the moment we'd need to replace the text identified by a source range... a bit clunky and manual. So I may add a more direct API in at a later stage. It should be doable, but may be a lot of work.

Also, the Visitor class defined in the above example could be called minimal at best. If there were any statements in the code that weren't handled by our visitor, the translation program would barf. I'll eventually build a complete Visitor class into cmonster to be reused. This should make writing translation tools a breeze; in our example, we would just override "visit_ReturnStatement" in the visitor.

Now, I think it's about time I learnt Go.