Writing a BF Compiler for .net (Part 5: [ and ] – while loops in IL)
The final two commands we're looking at are [ and ]. Their description in the first article was a bit cryptic, [ was described as
Go to the next instruction if the byte at the memory pointer is not 0, otherwise move it past the matching ] instruction
while ] was described as
Go to the instruction after the matching [ if the byte at the memory pointer is not 0, else move it past the ]
In C# code, this is a lot simpler:
// BF Code for this: [-] while (memory[pointer] > 0) { // Instructions between [ and ] // The following instruction is only to have a body memory[pointer]--; }
It's a while-loop. It's important to note that we have to use a pre-test loop, that is a loop that checks the condition before executing the loop (as opposed to a do-while loop which executes the code block at least once and checks afterwards).
So how does a while loop look in .net IL?
// See note below regarding .s suffix on br.s and bgt.s IL_0000: br.s IL_001f // This is the memory[pointer]-- instruction IL_0002: ldsfld uint8[] BFHelloWorldCSharp.Program::memory IL_0007: ldsfld int16 BFHelloWorldCSharp.Program::pointer IL_000c: ldelema [mscorlib]System.Byte IL_0011: dup IL_0012: ldobj [mscorlib]System.Byte IL_0017: ldc.i4.1 IL_0018: sub IL_0019: conv.u1 IL_001a: stobj [mscorlib]System.Byte // This is the while loop IL_001f: ldsfld uint8[] BFHelloWorldCSharp.Program::memory IL_0024: ldsfld int16 BFHelloWorldCSharp.Program::pointer IL_0029: ldelem.u1 IL_002a: ldc.i4.0 IL_002b: bgt.s IL_0002
GOTO considered harmful?
Okay, this looks complicated, but it is easy. To explain it, we have to open Pandora's Box and look at the dirtiest secret there is in development: At Machine Level, GOTOs are essential.
Ha, take that Dijkstra!
Regardless how much you abstract it away, control structures like while have to be translated as "GOTO's", or more precisely as jumps to addresses to continue execute code from. In .net, this is not called GOTO though, it's called Branch.
Our code has three parts: A single GOTO/Branch instruction at the beginning, the body of the loop (in our case the single memory[pointer]-- instruction) and then the while check.
So we start with br.s, which is described as
Unconditionally transfers control to a target instruction (short form).
In other words, this is a GOTO and it goes to IL_001f. The code starting from here does the while-check: Load memory and pointer onto the stack. Then load the value of memory[pointer] onto the stack as Unsigned 8-Bit Int. Afterwards, push the number 0 to the stack.
Our evaluation stack now contains the value of memory[pointer] and the number 0. Then we have the new bgt.s command:
Transfers control to a target instruction (short form) if the first value is greater than the second value.
In other words and Pseudocode: if(memory[pointer] > 0) goto IL_0002;
The code starting from IL_0002 is our memory[pointer]-- instruction which will be executed and then we'll do the while-check again.
In Debug mode, the bgt instruction is not used. Instead, the check is done much more complicated. Feel free to look it up using ILDASM, but Debug Mode uses this C# Pseudocode to capture the result of the comparison into a local variable:
bool DoJump = memory[pointer] > 0; if(DoJump) goto IL_0002;
This is useful for Debugging (who would've thought it, given that it's a debug build?), but rather heavy compared to Release mode (8 instructions and a local variable compared to 5 instructions without).
Looking at that, you can easily imagine what the difference between a while and a do while loop is: The do while loop does not have the br.s instruction at the beginning. It therefore executes the method body at least once before it enters the while-check.
Before I end this post, I want to talk about short form commands.
What is "Short Form"?
If you look at the IL Commands, some say "Short Form". What does this mean? Well, normally all addresses are 32 Bit, that is 4 Bytes. If you want an unconditional jump, you would use the br command with the target address. However, this means you'll have 5 bytes in the target file - 1 for the Br Instruction and 4 for the target. As this instruction is so common, it would be a massive overhead to always have to write 5 bytes to the file.
Short Form commands only take 1 byte for the target address. The target here is described as
1-byte signed offset from the beginning of the instruction following the current instruction
So instead of giving an absolute address, we give a relative address to jump to instead. This only works if the target is less than ~125 bytes away (signed offset!) of course, so it's a lot less flexible and your compiler needs to know the distance between the target and the jump instructions. However, the savings are huge as short form only requires 2 bytes, less than half of the full instructions.
This concludes the command overview. Part 6 will finally show how we will write our compiler.