Writing a Compiler, or Learning a Language the Hard Way (Part Deux)

A couple of weeks ago I wrote that I would be attempting to write a Go compiler using the LLVM compiler infrastructure. The blog post got a bit of attention after being linked on Lamer News (and a bit more attention than intended!), so I guess there's at least a couple of people out there interested in hearing whether anything comes of this effort.

I've been a mixture of sick and busy (maybe a little bit slack?), but this weekend I finally found some time to put into my shiny new Go compiler. Yesterday I buffed the corners of what I had written so far, and pushed it to github. Behold, llgo! llgo does not currently compile much of the language, and it only compiles completely standalone programs (imports aren't handled). I'll be adding in features as I get time, but this is largely an educational endeavour for me, so I'm not really in a rush.

Let's look briefly at how it's put together. llgo is written in Go, and uses the standard Go parser/ast packages. For code generation, the gollvm (Go bindings for LLVM) package is used. So it's just a matter of calling the parser to generate an AST, then walking the AST and generating LLVM instructions. The latter part is what I'm most new to, and I'm just starting to get to grips with LLVM and SSA (Single Static Assignment), and their nuances. It's mostly pretty straightforward though.

Forward, ho!

There's plenty of things to implement, but there's a bunch of big ticket items that I'm particularly interested in solving. They are, in no particular order:
  • Imports.
  • Interfaces.
  • Goroutines and channels.
  • Deferred function calls.
  • Closures.
  • cgo
  • Garbage Collection

Imports

If you were to attempt to compile a program with llgo - "Hello, World!", say - then I'm sure you'd find one giant gaping hole in the lack of support for imports. So you wouldn't be able to import fmt and do fmt.Println. Actually I have implemented the println builtin, but that's beside the point. The module system is pretty fundamental, so I'll have to address this soon.

The picture I have in my head is that each package will compile to (ideally) machine-independent LLVM bitcode libraries, which will go somewhere in the $GOROOT/pkg hierarchy. Just as Go examines archives to determine what they export, so llgo will load and examine the modules defined by the bitcode.

Somewhat related to imports is the runtime. I dare say that most of the imports people will ever do will be importing standard libraries, which will at some time or another use the Go runtime (e.g. reflection, and string comparisons, system calls). So I'll have to start thinking seriously about which parts of the runtime I'll be able be able to reuse, and which parts I'll rewrite.

Interfaces


In my previous blog post I talked about the idea of pseudo duck-typing in a statically compiled language ("implicit interfaces"), so this feature has a special place in my heart. I have some ideas of how to implement them, but I'll have to implement user-defined types first.

Goroutines and Channels

I'm not going for efficiency at this stage, I'm just going for functionality. So I intend, initially, to implement goroutines 1-1 with respect to threads. The gc compiler/runtime does M-N with a preemptive application-level scheduler; I think gccgo still does 1-1. I also do not intend, initially at least, to implement split stacks. These things can well be considered functionality, especially since the language intends to make creating many goroutines inexpensive and efficient. I have to prioritise, though, so I'll tackle efficiency and scalability later.

I've implemented channel-like data structures in C++ before, so I don't expect that to be too difficult. I'll just start out with a simple mutex/condition-based data structure, with or without an underlying FIFO array/queue depending on whether or not the channel is buffered.

Deferred Function Calls

As a general rule, I'm trying to do this... not clean room, but not just by copying what's done in gc/gccgo either. After all, I'm trying to teach myself something here, and I find doing things is a great way of learning for me. Sometimes things go awry, but then I know what not do next time. It also serves as a good background when reading about how others have solved the problem.

Russ Cox wrote an interesting article on how deferred function calls are implemented in Go. Actually the article was nominally about how breaking abstractions can be a good thing, and I tend to agree. LLVM adds another level of abstraction, which means some of these functions don't end up being quite as efficient as when they're implemented directly in assembler or machine code.

Closures

If you're not familiar with this term, it's essentially a function that has captured some variables in its defining environment. In Go you can define function literals, which are anonymous functions. If a function literal refers to a variable defined in the outer scope, then the function will be defined as a closure.

I had been wondering about how I would go about implementing closures in llgo. I was thinking, broadly, that I would need to store the variables in memory alongside the function code. How would I do this in LLVM? I could treat closures differently, representing them as a structure containing the captured variables, and a pointer to the function, which would need to have additional arguments defined to accept the captured variables. Then the caller of the closure would have to treat it differently to a simple function. This seems a bit ugly, so I wondered, how does gc do it?

The gc implementation is very elegant: it allocates memory on the heap, with PROT_EXEC enabled, and stores dynamically generated machine code in it. At the end of the code, the captured variables are stored. The machine code loads the variables from memory onto the stack for use in the closure. Elegant, but how can we do that in LLVM?

LLVM abstracts away the details of the underlying target system, which means you deal with an intermediate representation of instructions rather than machine-specific instructions. We could dynamically generate code using LLVM, but that would mean requiring LLVM in every program, which seems a bit heavyweight. Or we could just reuse the code from gc, since it's pretty well contained, but that means adding in platform-specifics where there were none before. I think that's a better solution, but I might have to see what other LLVM-based compilers have done. I guess the Glasgow Haskell Compiler might be a good first place to look.

cgo

This one has the potential to be quite interesting. There's already a mature LLVM-based C compiler: clang. So llgo could potentially leverage it to implement a cgo equivalent. Both clang and llgo will emit bitcode; llgo (or the cgo equivalent) will analyse the bitcode emitted from clang, and determine the available functions and types.

Garbage Collection


I know conceptually how mark & sweep algorithms work, but I have never implemented one, nor even analysed an implementation. LLVM provides some support for garbage collection, which will presumably make things easier.

Over and Out

Rather than continuing to talk about doing stuff, I'm going to go and do stuff. If you're interested in following my progress, I suggest that you watch llgo on github.

Without further ado, adieu.

Posted November 7, 2011.

Writing a Compiler, or Learning a Language the Hard Way

For a while now I’ve been lamenting the lack of progress in systems programming languages. I work primarily in C++ (professionally, anyway), a language which leaves a lot to be desired. The end result, assuming you know what you’re doing, is likely to be fast code. But getting there can be a tedious and precarious journey. Don’t get me wrong, I’m not a C++ hater. I just think we can do better.

So for the past few I’ve been storing away ideas for writing my own language. (Groan, another one.) The vast majority of languages these days seem to be JVM based, or some other VM such as the CLR. This is fine for many applications, but for a lot of systems programming you typically want to be much closer to the hardware. What I want is something with the ease and rapidity of development of, say, Python, with the power of a lower-level language such as C/C++.



The Easy: Ideas

So what are those ideas I’d stored away then? Well to be honest, many of them boil down to syntactic sugar (syntactic caramel?), so here are the more functional ones:

  • Implicit Interfaces. The idea is that classes should be able to implement an interface without explicitly saying so, just by implementing the methods defined in the interface. Poor-man’s duck-typing if you will.
  • Link Time Optimisation (LTO) should be the norm. Say you write a function in a library which does everything and the kitchen sink. If the calling program only uses a function from your library in a certain way, then the function should be optimised for that usage.
  • Pure Functions. This kind of fits under LTO. I want simpler meta-programming: it should be possible to mark functions as being “pure”, which means (in my terminology) that the function either has no side-effects, or affects only temporary variables. Calls to these functions could then be evaluated at compile time, e.g. to calculate complex mathematical expressions. I guess this just comes under “fancy optimiser”?
  • Pattern Matching. I haven’t really fleshed this one out, but I think it would be handy to be able to match functions not just based on signature, but based on the values of the arguments.
  • No preprocessor, no headers. When code is distributed, the modules should be self describing, such as with Java’s compiled .class files. This would eliminate problems such as compiling against an old version of an interface, and linking against a new, incompatible implementation.
  • For the love of God, no garbage collection. What can I say? I like my RAII.
  • I’m not really sure what title to give this one, but I wanted to eliminate the problems you get with linking objects together that have been compiled with different options in C++ (e.g. with/out RTTI, exceptions, or perhaps against an entirely different STL/standard library.)
There were some other requirements I had in mind, such as having a single reference implementation à la CPython, and including a parser with the runtime to simplify tool development.


The Hard: Writing a Compiler

Anyone can come up with ideas. Even coming up with practical ideas isn’t too hard, but implementing them is something else altogether. Especially so for someone like me who has had no formal education in compilers, or language design/implementation. Hell, my university course didn’t even cover formal grammars or parsing. I’ve picked some things up by myself, but it’s a very academic field.

Back when I was in uni, I toyed around with C–, which at the time looked like quite a promising “portable assembly language”. This language is (or at least was) used as the intermediate language for the Glasgow Haskell Compiler. My self-taught knowledge of language design/implementation and code generation were not really up to scratch, so I didn’t get much further than writing toy languages.

Fast forward a few years, and I’ve got the itch to implement a compiler again. I’ve recently been playing around with LLVM. It boasts a relatively easy to use API, which takes out some of the drudgery of writing the “bitcode” (an intermediate code that can be translated to machine code). I’ve got a bit more knowledge and programming skill behind my belt now, so let’s get to work on that language, right!?

I read something recently which is summarised as: learn intimately that which you wish to replace. Now I don’t have any illusions of replacing any existing language, but I think the message is still relevant. I’ve never implemented a language OR a compiler before, and now I’m going to both at once? How about I make this a bit easier on myself, and write a compiler for an existing language. If I still want to implement my own language later, I can use the experience I gained.

I’ve been looking for an excuse for a while now to learn the Go Programming Language. It has some of the same design goals as my hypothetical language, so it seemed like a good fit.


The Fun: Learning Go and LLVM

I’ve been learning Go in my precious spare time for the last couple of weeks. There’s a nifty web-app which provides an interactive tutorial in Go, called A Tour of Go. It’s a bit of fun, and I recommend it to anyone wanting to learn the language.

So anyway, I’m now playing around with writing a Go compiler using LLVM. Why?
  • To learn LLVM, and more generally how to write a compiler and generate code.
  • To learn Go.
  • Potentially to fill a gap in JIT compilation of Go programs.
  • Why not?
Writing the compiler will be made much easier by the fact that the Go runtime includes a Go parser, and someone has already implemented Go bindings for LLVM. I haven’t made a great deal of progress yet, but it seems achievable. When I’ve got something vaguely useful, I’ll chuck it over on GitHub.

If you know anything about Go, you’ll probably have noticed that at least one of the ideas that I presented above is present in Go: implicit interfaces. I really can’t say whether this concept is new or not - it’s probably not - but I did at least come up with it independently. Just sayin’!

I’ll write a follow-up post when I get some time, describing a bit more about the Go compiler, the challenges I’ve come across, and my thoughts on how to solve them. 

Posted October 23, 2011.

C++ Source-to-Source Translation

I’ve been on annual leave this week, so I’ve taken the opportunity to do some work on cmonster. I’ve added preliminary support for source-to-source translation by introducing a wrapper for Clang’s “Rewriter” API. My fingers have been moving furiously so it’s all a bit rough, but it does work.

The API flow is:

  1. Parse translation unit, returning an Abstract Syntax Tree (AST).
  2. Walk AST to find bits of interest.
  3. Insert/replace/erase text in the original source, using the location stored in each declaration/statement/token.

Motivating Example

Logging is a necessary evil in complex software, especially when said software is running on a customer’s system, inaccessible to you. To make problem determination easier, we want a decent amount of information: file names, line numbers, function names, time of day, thread ID, … but all of this comes at a cost. I’m not talking just cost in terms of CPU usage, though that is a major concern. I’m talking cost in terms of source code quality and maintainability.

We’ll start off with a trivial C program:

int main(int argc, const char *argv[])
{
if (argc % 2 == 0)
{
return 1;
}
else
{
return 0;
}
}

Let’s say our needs are fairly humble: we just want to log the entry and exit of this function. Logging entry is easy: add a blob of code at the top of the function. We can get the function name and line number using func (C99, C++11) and LINE. What about func in C89? C++98? There’s various alternatives, but some compilers have nothing. And that makes writing a cross-platform logging library a big old PITA. The information is there in the source code - if only we could get at it! In the end, we’re more likely to forego DRY, and just reproduce the function name as a string literal.

Getting the function name and line number isn’t a huge problem, but how about adding function exit logging? Now we’re going to have to insert a little bit of code before our return statements. So we’ll have something like:

int main(int argc, const char *argv[])
{
const char *function = “main”;
printf(“Entering %s:%s:%d\n”, function,
FILE, LINE);
if (argc % 2 == 0)
{
printf(“Leaving %s:%s:%d\n”, function,
FILE, LINE);
return 1;
}
else
{
printf(“Leaving %s:%s:%d\n”, function,
FILE, LINE);
return 0;
}
return 0;
}

Ugh. And that’s just the start. It gets much nastier when we need to turn logging on/off at runtime, filter by function name, etc. We could make it much nicer with a variadic macro. Something like LOG(format…), which calls a varargs function with the ‘function’ variable, FILE, LINE and the format and arguments you specify. Unfortunately variadic macros are not supported by some older compilers. The first version of Visual Studio to support them was Microsoft Visual Studio 2005. So there goes that idea…

Hmmm, what to do, what to do? Wouldn’t it be nice if we could just tag a function as requiring entry/exit logging, and have our compiler toolchain to the work? Entry/exit logging is the sort of thing you want to be consistent, so it should suffice to define one set of rules that covers all functions. Let’s take a little peek at what we could do with cmonster.

First, we’ll parse the source to get an AST. We’ll locate all functions defined in the main file, and insert an “Entry” logging statement at the beginning of the body, and an “Exit” logging statement before each return statement in the body. At the end we dump the rewritten source to stdout, and we have a program, with logging, ready to be compiled.


Tada! Running this, we’re given:

#include <stdio.h>
int main(int argc, const char *argv[])
{
printf(“Entering main at line 2\n”);
if (argc % 2 == 0)
{
printf(“Returning from main at line 6\n”);
return 1;
}
else
{
printf(“Returning from main at line 10\n”);
return 0;
}
}

Future Work

What we can’t do yet is insert, replace, erase or modify declarations or statements directly in the AST, and have that reflected as a text insertion/replacement/erasure. For example, maybe I want to rename a function? Why can’t I just go “function_declaration.name = ‘new_name’“. At the moment we’d need to replace the text identified by a source range… a bit clunky and manual. So I may add a more direct API in at a later stage. It should be doable, but may be a lot of work.

Also, the Visitor class defined in the above example could be called minimal at best. If there were any statements in the code that weren’t handled by our visitor, the translation program would barf. I’ll eventually build a complete Visitor class into cmonster to be reused. This should make writing translation tools a breeze; in our example, we would just override “visit_ReturnStatement” in the visitor.

Now, I think it’s about time I learnt Go.

Posted October 7, 2011.

cmonster update


I’ve been quietly beavering away at cmonster, and thought I should share an update.

Note: the changes I describe below are not yet part of a cmonster release. I’ll release something once I’ve stabilised the API and tested it more thoroughly. In the mean time you can pull the source from github.

In the last month or so I have been adding C/C++ parsing capabilities to cmonster, by exposing the underlying Clang parser API. I’ve been wanting a simple, scriptable framework for analysing and manipulating C++ source for a few years now. The reason for wanting such a thing is so that I, and others, can more rapidly develop tools for writing better C++, and eliminate some of the drudgery. I’ve only just made a start, but cmonster now provides an API for parsing a C/C++ source file, returning a Python object to inspect the result.

So what’s cmonster looking like now? We now have a preprocessor and a parser interface, the former now being incorporated into the latter. The parser interface will parse a single source file, and return its Abstract Syntax Tree (AST). As is typical with parsers, there are many classes involved to describe each declaration, statement, type, etc. So I’ve added Cython into the mix to speed up the process of defining Python equivalents for each of these classes.

Unfortunately Cython does not yet support PEP 384 (Py_LIMITED_API), so at the moment cmonster is back to requiring the full Python API, and thus must be rebuilt for each new release. I’ve had a tinker with Cython to get its output to compile with Py_LIMITED_API, and hope to provide a patch in the near future.

What’s next? Once I get the AST classes mapped out, I intend to introduce a source-to-source translation layer. I’m not entirely sure how this’ll work yet, but I think ideally you’d just modify the AST and call some function to rewrite the main source file. Elements of the AST outside of the main source file would be immutable. That’s the hope, but it may end up being something a little more crude, using Clang’s “Rewriter” interface directly to replace source ranges with some arbitrary string. I expect this will be a ways off yet, though.

Posted September 25, 2011.

Hello, Mr. Hooker.

I’ve been procrastinating on cmonster. I have some nasty architectural decisions to make, and I keep putting it off. In the mean time I’ve been working on a new little tool called “Mr. Hooker” (or just mrhooker).

Introducing Mr. Hooker

The idea behind mrhooker is very simple: I wanted to be able to write LD_PRELOAD hooks in Python. If you’re not familiar with LD_PRELOAD, it’s a mechanism employed by various UNIX and UNIX-like operating systems for “preloading” some specified code in a shared library. You can use this to provide your own version of native functions, including those in standard libraries such as libc.

Anyway, I occasionally find the need for an LD_PRELOAD library to change the behaviour of a program that I can’t easily recompile. Often these libraries will be throw-away, so it might end up taking just as long to write the LD_PRELOAD library. So I wrote mrhooker to simplify this.

It turns out there’s very little to do, since Cython (and friends) do most of the hard work. Cython is a programming language that extends Python to simplify building Python extensions. It also has an interface for building these extensions on-the-fly. So mrhooker doesn’t need to do much - it takes a .pyx (Pyrex/Cython source) and compiles it to a shared library using Cython. Mrhooker takes this, and some common code, and loads it into a child process using LD_PRELOAD.

Example - Hooking BSD Sockets


Let’s look at an example of how to use mrhooker. Hooks are defined as external functions in a Cython script. Say we want to hook the BSD sockets “send” function. First we’d find the signature of send (man 2 send), which is:

ssize_t send(int sockfd, const void *buf, size_t len, int flags);

Given this, we can produce a wrapper in Cython, like so:

cdef extern ssize_t send(int sockfd, char *buf, size_t len, int flags) with gil:


There’s a couple of important things to note here. First, the parameter type for “buf” drops const, since Cython doesn’t know about const-ness. Second, and crucially, the function must be defined “with gil”. This ensures that the function acquires the Python Global Interpreter Lock before calling any Python functions. Okay, with that out of the way, let’s go on…

We’ll want to do something vaguely useful with this wrapper. Let’s make it print out the argument values, and then continue on with calling the original “send” function. To do that we’ll use dlsym/RTLD_NEXT to find the next function called “send”.

cdef extern ssize_t send(int sockfd, char buf, size_t len, int flags) with gil:
print “====> send(%r, %r, %r, %r)” % (sockfd, buf[:len], len, flags)
real_send = dlsym(RTLD_NEXT, “send”)
if real_send:
with nogil:
res = (<ssize_t(
)(int, void, size_t, int) nogil>real_send)(
sockfd, buf, len, flags)
return res
else:
return -1

We’ll also need to declare dlsym and RTLD_NEXT. Let’s do that.

# Import stuff from <dlfcn.h>
cdef extern from “dlfcn.h”:
void
dlsym(void, char)
void* RTLD_NEXT

Now you just run:

mrhooker <script.pyx> <command>


And there we go. This is trivial - it would also be fairly trivial to write a C program to do this. But if we wanted to do anything more complex, or if we were frequently changing the wrapper function, I’d much rather write it in Python - or Cython, as it were.

Enjoy!


Edit: I just noticed that it’s broken if you don’t have a certain config file. I always had one while testing… until I got to work.
You’ll get an error “ConfigParser.NoSectionError: No section: ‘default’“. I’ll fix the code at home, but in the mean time you can do this:

$ mkdir ~/.mrhooker
$ echo [default] > ~/.mrhooker/mrhooker.config

P.S. if you add “build_dir = <path>” in that section, or a per-module section, mrhooker/Cython will store the shared library that it builds. Then if you don’t change the source it’ll be used without rebuilding.

Posted September 5, 2011.

Google App Engine Agent

A couple of months ago I wrote about my foray into the world of Google App Engine. More recently, I’d gotten the itch again, and had some ideas of how to fix the problems I found when attempting to get Pushy to work in Google App Engine.

The root of most of the problems is that Google App Engine is stateless in nature. Server instances can be spun up or spun down without notice, and so we can’t store complex state, which Pushy really requires. So a couple of weeks ago I set to investigating implementing a server-initiated RPC mechanism that is asynchronous and (mostly) stateless.

How would it work? Well, earlier this year I read that ProtoRPC was released, which brought RPC services to Google App Engine. In our case, Google App Engine is the client, and is calling the agent - but we can at least reuse the API to minimise dependencies and hopefully simplify the mechanism. Okay, so we have a ProtoRPC service running on a remote machine, consumed by our Google App Engine application. How do they talk?

One thing I wanted to avoid was the need for polling, as that’s both slow and expensive. Slow in that there will necessarily be delays between polls, and expensive in that unnecessary polls will burn CPU cycles in Google App Engine, which aren’t free. Long-polling isn’t possible, either, since HTTP requests are limited to 30 seconds of processing time. If you read my last post, you probably already know what I’m going to say: we’ll use XMPP.

What’s XMPP? That’s the Extensible Messaging and Presence Protocol, which is the protocol underlying Jabber. It is also the primary protocol that Google Talk is built on. It’s an XML-based, client-server protocol, so peers do not talk directly to each other. It’s also asynchronous. So let’s look at the picture so far…

  • The client (agent) and server (GAE application) talk to each other via XMPP.
  • The agent serves a ProtoRPC service, and the GAE application will consume it.
Because our RPC mechanism will be server-initiated, we’ll need something else: agent availability discovery. Google App Engine provides XMPP handlers for agent availability (and unavailability) notification. When an agent starts up it will register its presence with the application. When agent is discovered, the application will request the agent’s service descriptor. The agent will respond, and the application will store it away in Memcache.

We (ab)use Memcache for sharing of data between instances of the application. When you make enough requests to the application, Google App Engine may dynamically spin up a new instance to handle requests. By storing the service descriptor in Memcache, it can be accessed by any instance. I said abuse because Memcache is not guaranteed to keep the data you put in it - it may be expelled when memory is constrained. Really we should use Datastore, but I was too lazy to deal with cleaning it up. “Left as an exercise for the reader.” One thing I did make a point of using was to use the new Python Memcache CAS API, which allows for safe concurrent updates to Memcache.

Orrite. So now we have an agent and application which talk to each other via XMPP, using ProtoRPC. The application discovers the agent, and, upon request, the agent describes its service to the application. How can we use it? Well the answer is really “however you like”, but I have created a toy web UI for invoking the remote service methods.


Wot ‘ave we ‘ere then? The drop-down selection has all of the available agent JIDs (XMPP IDs). The textbox has some Python code, which will be executed by the Google App Engine application. Yes, security alert! This is just a demonstration of how we can use the RPC mechanism - not a best practice. When you hit “Go!”, the code will be run by the application. But before doing so, the application will set a local variable “agent”, which is an instance of the ProtoRPC service stub bound to the agent selected in the drop-down.

ProtoRPC is intended to be synchronous (from the looks of the comments in the code, anyway), but there is an asynchronous API for clients. But given that application requests can only take up to 30 seconds to service a request, our application can’t actively wait for a response. What to do? Instead, we need to complete the request asynchronously when the client responds, and convey some context to the response handler so it knows what to do with it.

In the demo, I’ve done something fairly straight forward with regards to response handling. When the UI is rendered, we create an asynchronous channel using the Channel API. We use this to send the response back to the user. So when the code is executed, the service stub is invoked, and the channel ID is passed as context to the client. When the client responds, it includes the context. Once again, security alert. We could fix security concerns by encrypting the context to ensure the client doesn’t tamper with it. Let’s just assume the client is friendly though, okay? Just this once!

So we finally have an application flow that goes something like this:
  1. Agent registers service.
  2. Server detects agent’s availability, and requests client’s service descriptor.
  3. Client sends service descriptor, server receives and stores it in Memcache.
and then…
  1. User hits web UI, which server renders with a new channel.
  2. User selects an agent and clicks “Go!”.
  3. Server instantiates a service stub, and invokes it with the channel ID as context. The invocation sends an XMPP message to the agent.
  4. Agent receives XMPP message, decodes and executes the request. The response is sent back to the server as an XMPP message, including the context set by the server.
  5. The server receives the response, and extracts the response and channel ID (context). The response is formatted and sent to the channel.
  6. The web UI’s channel Javascript callback is invoked and the response is rendered.

Fin

I’ve put my code up on GitHub, here: http://github.com/axw/gaea. Feel free to fork and/or have a play. I hope this can be of use to someone. If nothing else, I’ve learnt a few new tricks!

Posted September 1, 2011. Tags: google app engine, python, rpc, xmpp.

It's Alive!


I've been quietly hacking away on cmonster (née csnake). I like this name even more: I think it describes my creation better. If you thought preprocessors were horrible before, well...


What is cmonster?

cmonster is a C preprocessor with a few novel features on top of the standard fare:

  • Allows users to define function macros in Python, inline.
  • Allows users to define a callback for locating #include files, when the file can not be found in the specified include directories.
  • Provides a Python API for iterating over tokens in the output stream.
cmonster is built on top of Clang, a modern C language family compiler, which contains a reusable, programmable preprocessor. At present, cmonster requires Clang 3.0 APIs, which has not yet been released. I have been working off Clang's subversion trunk.

I have just uploaded a binary distribution of the first alpha version (0.1) of cmonster to pypi. I have only built/tested it on Linux 32-bit, Python 3.2, and I don't expect it will work on anything else yet. If you want to play around with it, you can install cmonster using "easy_install cmonster" or by grabbing it off pypi and installing it manually.


Demonstration

Seeing is believing - how does this thing work? We'll ignore everying except for inline Python macros for now, because that's the most stable part of the API.


It is possible to define macros inline in cmonster, using the builtin "py_def" macro. For example:

py_def(factorial(n))
import math
return str(math.factorial(int(str(n))))
py_end

When cmonster sees this, it will grab everything up to "py_end", and define a Python function. It will also create a preprocessor macro with the function's name (as given in the py_def heading), and this macro will be directed to call the Python function. The Python function will be passed the argument tokens that the macro was invoked with. It can return either a sequence of tokens, or a string that will subsequently be tokenised.


Addressing Shortcomings

In my previous post about csnake I mentioned a few things that needed to be done. I have addressed some of these things in cmonster:

A way of detecting (or at least configuring) pre-defined macros and include paths for a target compiler/preprocessor. A standalone C preprocessor isn't worth much. It needs to act like or delegate to a real preprocessor, such as GCC.
 I have added support for emulating GCC. This is done by consulting GCC for its predefined macros (using "gcc -dM"), and using the new "include locator" callback feature. By setting an include locator callback on a cmonster preprocessor, you provide cmonster with a second-chance attempt at locating an include file when the file can not be found in the specified include directories. This method can be used to determine GCC's system include directories: whenever we can't find an include file, we consult GCC and add parse the output to determine the location of the file on disk. I intend to add support for more (common) compilers/preprocessors in the future. Namely, MSVC.

I had another crazy idea for (ab)using include locators: handling specially formatted #include directives to feed off-disk files into the preprocessor. Buh? I mean, say, the ability to pull down headers from a hosted repository such as github (e.g. #include <github.yourproject/header.h>), and feeding them into the preprocessor as in-memory files. Or generating headers on the fly (e.g. #include <myapi.proto>, to automatically generate and include Protobuf headers).

A #pragma to define Python macros in source, or perhaps if I'm feeling adventurous, something like #pydefine.
Support for inline Python macros has been implemented: see the "Demonstration" section above. It's unlikely I'll attempt to create a #pydefine, as it would be more work than it's worth.


A simple command line interface with the look and feel of a standard C preprocessor
The distribution now contains a "cmonster" script, which invokes the preprocessor on a file specified on the command-line. This will need a lot of work: presently you can't add user include directories or (un)define macros. Neither of these things are difficult to add, they just haven't been top priorities.


Future Work

Still remaining to do (and sticking out like sore thumbs) are unit-tests and documentation. Now that I've got something vaguely usable, I will be working on those next.

Once I've tested, documented and stabilised the API, I'll look at (in no definite order):
  • Improvement the command line interface. Add the standard "-I", "-D" and "-U" parameters.
  • Portability. Probably Windows first, since it's common and I have access to it.
  • Emulation of additional preprocessors.
  • Passing in-memory files ("file like" Python objects) to #include.

Posted August 4, 2011.

Controlling Remote Agents from Google App Engine

A month or so ago I was brainstorming ideas related to Google App Engine (GAE), as I had been wanting a reason to play with it for a while. One idea that stuck was connecting a remote Python process to GAE via Pushy, so we could either control GAE or GAE could control the remote process. I’m still working on the C/Python Preprocessor thingy, but I took a break from that this weekend to look into the possibility of a GAE Pushy transport.

So yesterday morning I signed up for an account, and started tinkering. I had been reading the docs already, and I figured there were a few possible approaches:

  • The obvious: use HTTP. This has one major drawback in that it is inherently synchronous and wholly driven by the client. Moreover, GAE only allows request handlers around 30s to complete, so no kind of fancy long-polling is possible here.
  • Channel API. This sounds like the right kind of thing to use, but it’s aimed at interacting with Javascript in a webpage.
  • XMPP. Huh? Isn’t that for instant messaging? Exactly. The client (remote Python process) and server (GAE) are peers in XMPP, and either one can initiate sending messages to the other. Let’s look into this a bit more…
So I did a quick search for Python XMPP libraries, and a few came up. I settled on xmpppy, but to be honest I didn’t find any of them particularly compelling. The APIs are a bit clunky. Anyway, the approach I took was to have an XMPP handler in my GAE application create a persistent Pushy connection object associated with a pair of read/write files that wrapped the XMPP API. When an XMPP message came in, the application would extract the Pushy request from base-64 encoded body of the message, and return the result in a similar manner.

And it worked, but only just. I had to make a few hacks to Pushy to get all of this work. There were some oddities I had to work around, such as the “eval” built-in in GAE’s Python not taking any keyword arguments. Unfortunately, I don’t think this particular transport is very useful in the flaky state it’s in at the moment. Also, it’s not terribly valuable to build a transport to control a GAE application, since other APIs exist for that purpose (ProtoRPC, remote_api). More useful would be to have the GAE application control the remote process without the need for polling. I’ll be looking into this further.

Posted July 17, 2011.

Le sigh

I’ve been coming across more problems with Boost Wave. The current ones blocking me are:

  • A lack of an efficient way to conditionally disable a macro. The “context policy” provides hooks for handling macro expansions, and its return value is meant to control whether the expansion takes place. It doesn’t work. I’ll write up a bug report when I get some time.
  • Wave isn’t very forgiving about integer under/overflow. For example, the GNU C library’s header “/usr/include/bits/wchar.h” has the following tidbit to determine the sign of wide characters, which Boost Wave barfs on:
#elif L’\0’ - 1 > 0

I think the latter “problem” might actually be reasonable - I believe the standards say that handling of overflow is undefined, and preprocessor/compiler-specific. That doesn’t help me much though. I could fix this by writing code to parse the expressions, which seems silly, or by passing off to the target preprocessor (e.g. GCC), which seems like overkill.

I’m going to have a look at how hard it would be to use LLVM/Clang’s preprocessor instead. If that’s a better bet, I may go that way. Otherwise, it might be time to get approval to send patches to the Wave project.

Posted June 21, 2011.

Inline Python macros in C

I had a bit of time today to do some more work on that which is soon to be called something other than csnake. I’ve added a couple of features:

  • You can now define custom pragmas, providing a Python handler function. Unfortunately Boost Wave, which csnake uses for preprocessing, only provides callbacks for pragmas that start with “#pragma wave”.
  • Built-in pragmas and macros to support defining Python macros inline in C/C++ code.
  • A main program in the csnake package. So you can now do “python3.2 -m csnake ”, which will print out the preprocessed source.
So for example, you can do something like as follows, entirely in one C++ file:

// factorial macro.
py_def(factorial(n))
import math
f = math.factorial(int(str(n)))
return [Token(T_INTLIT, f)]
py_end

int main()
{
std::cout << factorial(3) << std::endl;
return 0;
}

This works as follows: py_def and py_end are macros, which in turn use the _Pragma operator with built-in pragmas. Those pragmas are handled by csnake, and signal to collect the tokens in between. When the py_end macro is reached, the tokens are concatenated and a Python function macro is born.

I’m intending to do additonal “Python blocks”, including at least a py_for block, which will replicate the tokens within the block for each iteration of a loop.

There’s one big problem with the py_def support at the moment, which is that the tokens go through the normal macro replacement procedure. I think I’ll have a fix for that soon.

Posted June 18, 2011.