Monday, November 7, 2011

Writing a Compiler, or Learning a Language the Hard Way (Part Deux)

A couple of weeks ago I wrote that I would be attempting to write a Go compiler using the LLVM compiler infrastructure. The blog post got a bit of attention after being linked on Lamer News (and a bit more attention than intended!), so I guess there's at least a couple of people out there interested in hearing whether anything comes of this effort.

I've been a mixture of sick and busy (maybe a little bit slack?), but this weekend I finally found some time to put into my shiny new Go compiler. Yesterday I buffed the corners of what I had written so far, and pushed it to github. Behold, llgo! llgo does not currently compile much of the language, and it only compiles completely standalone programs (imports aren't handled). I'll be adding in features as I get time, but this is largely an educational endeavour for me, so I'm not really in a rush.

Let's look briefly at how it's put together. llgo is written in Go, and uses the standard Go parser/ast packages. For code generation, the gollvm (Go bindings for LLVM) package is used. So it's just a matter of calling the parser to generate an AST, then walking the AST and generating LLVM instructions. The latter part is what I'm most new to, and I'm just starting to get to grips with LLVM and SSA (Single Static Assignment), and their nuances. It's mostly pretty straightforward though.

Forward, ho!

There's plenty of things to implement, but there's a bunch of big ticket items that I'm particularly interested in solving. They are, in no particular order:
  • Imports.
  • Interfaces.
  • Goroutines and channels.
  • Deferred function calls.
  • Closures.
  • cgo
  • Garbage Collection

Imports

If you were to attempt to compile a program with llgo - "Hello, World!", say - then I'm sure you'd find one giant gaping hole in the lack of support for imports. So you wouldn't be able to import fmt and do fmt.Println. Actually I have implemented the println builtin, but that's beside the point. The module system is pretty fundamental, so I'll have to address this soon.

The picture I have in my head is that each package will compile to (ideally) machine-independent LLVM bitcode libraries, which will go somewhere in the $GOROOT/pkg hierarchy. Just as Go examines archives to determine what they export, so llgo will load and examine the modules defined by the bitcode.

Somewhat related to imports is the runtime. I dare say that most of the imports people will ever do will be importing standard libraries, which will at some time or another use the Go runtime (e.g. reflection, and string comparisons, system calls). So I'll have to start thinking seriously about which parts of the runtime I'll be able be able to reuse, and which parts I'll rewrite.

Interfaces


In my previous blog post I talked about the idea of pseudo duck-typing in a statically compiled language ("implicit interfaces"), so this feature has a special place in my heart. I have some ideas of how to implement them, but I'll have to implement user-defined types first.

Goroutines and Channels

I'm not going for efficiency at this stage, I'm just going for functionality. So I intend, initially, to implement goroutines 1-1 with respect to threads. The gc compiler/runtime does M-N with a preemptive application-level scheduler; I think gccgo still does 1-1. I also do not intend, initially at least, to implement split stacks. These things can well be considered functionality, especially since the language intends to make creating many goroutines inexpensive and efficient. I have to prioritise, though, so I'll tackle efficiency and scalability later.

I've implemented channel-like data structures in C++ before, so I don't expect that to be too difficult. I'll just start out with a simple mutex/condition-based data structure, with or without an underlying FIFO array/queue depending on whether or not the channel is buffered.

Deferred Function Calls

As a general rule, I'm trying to do this... not clean room, but not just by copying what's done in gc/gccgo either. After all, I'm trying to teach myself something here, and I find doing things is a great way of learning for me. Sometimes things go awry, but then I know what not do next time. It also serves as a good background when reading about how others have solved the problem.

Russ Cox wrote an interesting article on how deferred function calls are implemented in Go. Actually the article was nominally about how breaking abstractions can be a good thing, and I tend to agree. LLVM adds another level of abstraction, which means some of these functions don't end up being quite as efficient as when they're implemented directly in assembler or machine code.

Closures

If you're not familiar with this term, it's essentially a function that has captured some variables in its defining environment. In Go you can define function literals, which are anonymous functions. If a function literal refers to a variable defined in the outer scope, then the function will be defined as a closure.

I had been wondering about how I would go about implementing closures in llgo. I was thinking, broadly, that I would need to store the variables in memory alongside the function code. How would I do this in LLVM? I could treat closures differently, representing them as a structure containing the captured variables, and a pointer to the function, which would need to have additional arguments defined to accept the captured variables. Then the caller of the closure would have to treat it differently to a simple function. This seems a bit ugly, so I wondered, how does gc do it?

The gc implementation is very elegant: it allocates memory on the heap, with PROT_EXEC enabled, and stores dynamically generated machine code in it. At the end of the code, the captured variables are stored. The machine code loads the variables from memory onto the stack for use in the closure. Elegant, but how can we do that in LLVM?

LLVM abstracts away the details of the underlying target system, which means you deal with an intermediate representation of instructions rather than machine-specific instructions. We could dynamically generate code using LLVM, but that would mean requiring LLVM in every program, which seems a bit heavyweight. Or we could just reuse the code from gc, since it's pretty well contained, but that means adding in platform-specifics where there were none before. I think that's a better solution, but I might have to see what other LLVM-based compilers have done. I guess the Glasgow Haskell Compiler might be a good first place to look.

cgo

This one has the potential to be quite interesting. There's already a mature LLVM-based C compiler: clang. So llgo could potentially leverage it to implement a cgo equivalent. Both clang and llgo will emit bitcode; llgo (or the cgo equivalent) will analyse the bitcode emitted from clang, and determine the available functions and types.

Garbage Collection


I know conceptually how mark & sweep algorithms work, but I have never implemented one, nor even analysed an implementation. LLVM provides some support for garbage collection, which will presumably make things easier.

Over and Out

Rather than continuing to talk about doing stuff, I'm going to go and do stuff. If you're interested in following my progress, I suggest that you watch llgo on github.

Without further ado, adieu.