Sunday, October 23, 2011

Writing a Compiler, or Learning a Language the Hard Way

For a while now I've been lamenting the lack of progress in systems programming languages. I work primarily in C++ (professionally, anyway), a language which leaves a lot to be desired. The end result, assuming you know what you're doing, is likely to be fast code. But getting there can be a tedious and precarious journey. Don't get me wrong, I'm not a C++ hater. I just think we can do better.

So for the past few I've been storing away ideas for writing my own language. (Groan, another one.) The vast majority of languages these days seem to be JVM based, or some other VM such as the CLR. This is fine for many applications, but for a lot of systems programming you typically want to be much closer to the hardware. What I want is something with the ease and rapidity of development of, say, Python, with the power of a lower-level language such as C/C++.

The Easy: Ideas

So what are those ideas I'd stored away then? Well to be honest, many of them boil down to syntactic sugar (syntactic caramel?), so here are the more functional ones:

  • Implicit Interfaces. The idea is that classes should be able to implement an interface without explicitly saying so, just by implementing the methods defined in the interface. Poor-man's duck-typing if you will.
  • Link Time Optimisation (LTO) should be the norm. Say you write a function in a library which does everything and the kitchen sink. If the calling program only uses a function from your library in a certain way, then the function should be optimised for that usage.
  • Pure Functions. This kind of fits under LTO. I want simpler meta-programming: it should be possible to mark functions as being "pure", which means (in my terminology) that the function either has no side-effects, or affects only temporary variables. Calls to these functions could then be evaluated at compile time, e.g. to calculate complex mathematical expressions. I guess this just comes under "fancy optimiser"?
  • Pattern Matching. I haven't really fleshed this one out, but I think it would be handy to be able to match functions not just based on signature, but based on the values of the arguments.
  • No preprocessor, no headers. When code is distributed, the modules should be self describing, such as with Java's compiled .class files. This would eliminate problems such as compiling against an old version of an interface, and linking against a new, incompatible implementation.
  • For the love of God, no garbage collection. What can I say? I like my RAII.
  • I'm not really sure what title to give this one, but I wanted to eliminate the problems you get with linking objects together that have been compiled with different options in C++ (e.g. with/out RTTI, exceptions, or perhaps against an entirely different STL/standard library.)
There were some other requirements I had in mind, such as having a single reference implementation à la CPython, and including a parser with the runtime to simplify tool development.

The Hard: Writing a Compiler

Anyone can come up with ideas. Even coming up with practical ideas isn't too hard, but implementing them is something else altogether. Especially so for someone like me who has had no formal education in compilers, or language design/implementation. Hell, my university course didn't even cover formal grammars or parsing. I've picked some things up by myself, but it's a very academic field.

Back when I was in uni, I toyed around with C--, which at the time looked like quite a promising "portable assembly language". This language is (or at least was) used as the intermediate language for the Glasgow Haskell Compiler. My self-taught knowledge of language design/implementation and code generation were not really up to scratch, so I didn't get much further than writing toy languages.

Fast forward a few years, and I've got the itch to implement a compiler again. I've recently been playing around with LLVM. It boasts a relatively easy to use API, which takes out some of the drudgery of writing the "bitcode" (an intermediate code that can be translated to machine code). I've got a bit more knowledge and programming skill behind my belt now, so let's get to work on that language, right!?

I read something recently which is summarised as: learn intimately that which you wish to replace. Now I don't have any illusions of replacing any existing language, but I think the message is still relevant. I've never implemented a language OR a compiler before, and now I'm going to both at once? How about I make this a bit easier on myself, and write a compiler for an existing language. If I still want to implement my own language later, I can use the experience I gained.

I've been looking for an excuse for a while now to learn the Go Programming Language. It has some of the same design goals as my hypothetical language, so it seemed like a good fit.

The Fun: Learning Go and LLVM

I've been learning Go in my precious spare time for the last couple of weeks. There's a nifty web-app which provides an interactive tutorial in Go, called A Tour of Go. It's a bit of fun, and I recommend it to anyone wanting to learn the language.

So anyway, I'm now playing around with writing a Go compiler using LLVM. Why?
  • To learn LLVM, and more generally how to write a compiler and generate code.
  • To learn Go.
  • Potentially to fill a gap in JIT compilation of Go programs.
  • Why not?
Writing the compiler will be made much easier by the fact that the Go runtime includes a Go parser, and someone has already implemented Go bindings for LLVM. I haven't made a great deal of progress yet, but it seems achievable. When I've got something vaguely useful, I'll chuck it over on GitHub.

If you know anything about Go, you'll probably have noticed that at least one of the ideas that I presented above is present in Go: implicit interfaces. I really can't say whether this concept is new or not - it's probably not - but I did at least come up with it independently. Just sayin'!

I'll write a follow-up post when I get some time, describing a bit more about the Go compiler, the challenges I've come across, and my thoughts on how to solve them.