Writing a BF Compiler for .net (Part 1: Explanation of the language and interpreter in C#)

Today I'm starting a small series about how to write a Compiler to .net IL. I wanted to understand IL better and wanted to learn how .net really works. Even though I'm a Web Developer (which usually means working on a much higher level of abstraction), at some point there is this impossible issue that has no cause and no resolution and where Reflector and WinDbg have to come in.

My main focus is understanding IL, not implementing a language. So I picked an extremely simple but otherwise complete language - Brainf**k (abbreviated BF from here on). Yes, it is a very esoteric language, but as said, focus is on the IL Part, not the actual language.

BF only has 8 instructions, represented by a single character each:

> Moves the memory pointer forward
< Moves the memory pointer backwards
+ Increments the value at the current memory pointer
- Decrements the value at the current memory pointer
. Outputs the value at the current memory pointer
, Reads a single byte from some input, storing it at the memory pointer
[ Go to the next instruction if the byte at the memory pointer is not 0, otherwise move it past the matching ] instruction
] Go to the instruction after the matching [ if the byte at the memory pointer is not 0, else move it past the ]

So basically with BF you have three objects: Some Memory, a Pointer to a byte in the Memory and an instruction pointer. The memory can be implemented as a simple byte array and the memory pointer as an index into it.

Let's say we have 4 bytes of memory. At the beginning, everything is initialized to 0 (P = Memory Pointer):
Memory: 00 00 00 00 P:0

Now we execute the following code:
++>+++>++++>+<-

The result is this:
Memory: 02 03 03 01 P:2

The [ and ] instructions are more complicated, but essentially they form a while loop - while(memory[pointer] != 0) { // instructions between the brackets }. One example is to increase the value of a register without having many + signs. Let's say we want to increase memory[1] to 25. For that, we use memory[0] as a counter:
+++++[>+++++<-]

After running this, the memory will look like this:
Memory: 00 19 00 00 P:0
(Remember that this is hex, so 19h is 25 decimal)

What did we do? We increased memory[0] to 5. Then we entered a while-block. First, we move the memory pointer to memory[1] and increase it 5 times, to 5. Then we move back to memory[0] and decrease it to 4. The ] then checks if memory[0] is 0, which it isn't, so we return to the instruction after the [, which means we move again to memory[1], increase it 5 more times, move back to memory[0], decrease it to 3 and repeat until memory[0] is 0.

As said, not terribly complicated once you understood how the brackets and the memory pointer work. Let's look at Hello World in BF:
++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.
(Note that this needs 5 bytes of memory instead of 4)

This looks scary, but is also quite simple if you are familiar with ASCII Encoding. The . command outputs the value at the memory pointer to the console and the console then needs to decide what to do with it. Again, everything in a PC is essentially just a stream of bytes, and a byte can mean different things to different applications. The Console tries to display the byte by looking up which character maps to a byte. In ASCII, the letter H is mapped to byte 48h or 72 decimal. So if you want to display the letter H, you need to increase the value 72 times.

The Hello World example uses the while loop to save a lot of +. Also instead of using 13 Bytes (The Letters H, e, l, l, o, [space], W, o, r, l, d, !, [line break]), the example only uses 5 bytes by reusing existing memory locations. Look up the values for each letter in the ASCII table and it should become clear how this works. The Wikipedia Article contains a lot more explanations.

Now, before I started building a real compiler, I first wanted to see how this would work and could be implemented, so I started with an interpreter in C# that reads text from one TextBox and outputs results into another one (no support for the , instruction here though). The code for that is somewhat naive, but it helps understanding how this could work:

byte[] memory = new byte[Int16.MaxValue];
Int16 memoryPointer = 0;

private void Execute(char[] code)
{
    int instructionPointer = 0;
    while (instructionPointer < code.Length)
    {
        var currentCommand = code[instructionPointer];
        instructionPointer++;
        switch (currentCommand)
        {
            case '>':
                memoryPointer++;
                break;
            case '<':
                memoryPointer--;
                break;
            case '+':
                memory[memoryPointer]++;
                break;
            case '-':
                memory[memoryPointer]--;
                break;
            case '.':
                // Weird casting because AppendText wants a string, but we have to
                // convert the byte to a char first.
                textBox2.AppendText(((char)memory[memoryPointer]).ToString());
                break;
            case ',':
                // read 1 byte of input - not implemented
                break;
            case '[':
                int currentIndex = instructionPointer;
                int bracketCounter = 1;
                while (bracketCounter > 0 &&
                       instructionPointer < code.Length &&
                       code[instructionPointer] > 0)
                {
                    if (code[instructionPointer] == '[') bracketCounter++;
                    if (code[instructionPointer] == ']') bracketCounter--;
                    instructionPointer++;
                }
                if (bracketCounter == 0)
                {
                    // Change previous ] to \0 so that the recursive call
                    // returns at the end of the block
                    code[instructionPointer - 1] = '\0';
                    while(memory[memoryPointer] > 0) {
                       Execute(code.Skip(currentIndex).ToArray());
                    }
                    code[instructionPointer - 1] = ']';
                    break;
                }
                break;
            case ']':
                // The ] bracket is normally handled as part of the [ case above
                throw new InvalidOperationException("Unbalanced Brackets");
                break;
            case '\0':
                return;
        }
    }
}

First, we declare our memory. As this is .net and I didn't want to use unsafe code/pointers, I'm declaring an Array of Bytes. The total size is 32767 bytes. Then I declare a memoryPointer which is just an index into the memory array (it's not a pointer in the traditional sense, I just called it that).

Then we have the Execute method, which takes an array of bytes - the BF code. We iterate through this array and perform the appropriate action for each instruction. This should be really straight-forward, except for the [, ] and \0 cases. If we encounter a [, we go all the way to the end in order to find until the instruction pointer hits the end of the code or a 0-byte. The bracketCounter is there for us to make sure our brackets are balanced - you can nest brackets, but they have to be balanced. [++[++]++ is illegal code as the first [ does not have a matching ]. As soon as our bracketCounter hits 0 again, we stop this loop. The important side effect of this: The instructionPointer is now behind the ].

We replace the ] with a 0-byte and recursively call the Execute function with the fraction of the code after the [ (that's the job of the currentIndex variable: Remember where the [ was). The recursive call will return as soon as it hits the 0-byte. The while-loop then repeats this recursive call until the memory at the pointer is 0 and re-inserts the ].

If you execute this in a WinForms application, your textBox2 should display Hello World!.

Phew, I hope you could follow me all the way down here and got an understanding of how BF works. Tomorrow we will translate the BF code into C# code and look at the generated IL, and then I will post articles explaining what each of these instructions do, how IL OpCodes work, what the fact that .net is Stack based actually means, and then we will have a real BF -> .net Compiler.

Expect a fun little series 🙂

Comments (2)

[...] the last posting, we looked at BF as a language and how to write an interpreter in C#. The point of doing that was [...]

[...] final two commands we're looking at are [ and ]. Their description in the first article was a bit cryptic, [ was described as Go to the next instruction if the byte at the memory pointer [...]