Writing a BF Compiler for .net (Part 6: The actual compiler)

In the last 5 parts we looked at each of the 8 instructions, how it works in BF, how it would work in C# and how it looks in IL. Now, what is a compiler actually?

Quick Compiler Theory
Basically our end result has to be a series of bytes. For example, the IL Instruction ldc.i4.0 is compiled as 0x16. Some instructions take parameters. For example, the br instruction is compiled as 0x38 0x?? 0x?? 0x?? 0x??. That's 5 bytes: 0x38 for the actual br and then 4 bytes for the address to jump to.

However, there is more. .net assemblies are usually .exe or .dll files and pure IL may only help you on mono (not sure if that's true and if you can feed mono with a pure IL file). If you're running on Windows, you need some additional "stuff" to make a proper .exe - see this Article.

Now, how do we write a compiler? In what language do we write compilers? If you looked into Compiler development you may have heard the term "bootstrapping" which means "Write a compiler for language X in language X through an intermediary compiler written in language Y". So if you want to write the very first C# compiler you can't write it in C# as there are no C# compilers to compile your C# code with, so you use another language. You could use C, C++, Delphi, Assembler or any other language.

Needless to say, this whole process is very complicated and involves a lot of plumbing (many lines of code for simple things because you have to manage everything). Luckily, Microsoft gave us a fantastic built-in functionality to compile .net Assemblies: The System.Reflection.Emit Namespace.

Reflection.Emit and how a Console Application works
What does Reflection.Emit allow us to do? It allows us to emit IL instructions into an assembly - we just feed it the IL instructions and it takes care of everything else. As a side effect, this allows us to write our compiler in any .net language! In our case, it means we will write our Compiler in C#.

But, what will we actually need to emit? Can we really just take the BF code and emit IL from it? No, not really. Each .net Assembly needs an Entry Point: A method. Create a new C# Console Application and notice that you start with a method "Main":

static void Main(string[] args)
{
}

Try renaming the method to something else and compile your Application. You will get this error message:
Program 'test.exe' does not contain a static 'Main' method suitable for an entry point

Each C# Application needs a method called Main. This is the entry point of the Application, this is the very first method that .net calls. Does it have to be called "Main"?

Not really. In .net, any method can be an entry point. It's just that the C# guys at Microsoft decided that for C# it shall be called Main, possibly because that's how it's usually called in C/C++.

Additionally, you may notice that the Main method is part of a class called Program. Do methods have to be part of a Type? In C# yes, but in C++/CLI you can have global methods. So does .net as a whole require a Type to hold methods? Yes, it does. Let me quote from Partition I, 9.8: "The CTS does not have the notion of global statics: all statics are associated with a particular class.". There is an annotation to that: "[...] the CLI supports languages that supports global statics by creating a special class names <module>".

So to conclude, we need a class which contains a method. So, what will we actually compile now?

The BF Compiler we are going to write will create a Console Application. Specifically, we will create this class (shown as a C# class for illustration purposes):

public static class Program
{
    private static byte[] memory = new byte[short.MaxValue]; // 32767 bytes
    private static short pointer = 0;

    public static void Execute() {
        // Our BF Code here
    }
}

As you see, we have a byte-array that serves as our memory. We offer BF a generous 32 kilobyte of memory. Additionally, we have a pointer into the memory so that we know where we are in the memory. Our containing class is called Program and our entry point is called Execute (just to show that we are so cool, we don't need a method called Main). This method will then contain our compiled BF Code.

Creating the bare minimum compiler skeleton
Let's start simple. Create a new C# Console Application and use this code for your main method:

static void Main(string[] args)
{
    string sourceFileName = args[0];
    string outputName = args[1];

    var sourceCode = File.ReadAllText(sourceFileName);
    BFCompiler.Compile(outputName, outputName + ".exe", sourceCode);
    Console.WriteLine("{0} compiled to {1}.exe",sourceFileName,outputName);
}

Then, create a new class BFCompiler:

public static class BFCompiler
{

    public static void Compile(string assemblyName, string outputFileName, string sourceCode)
    {
    }
}

Create a new text file and call it "hello.bf":

++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.

In Visual Studio, set "Copy to Output Directory" to "Copy Always" and compile. You can now invoke your compiler using
bfnetc.exe hello.bf hello
and it should print "hello.bf compiled to hello.exe" on the command line.

Okay, that was our skeleton: Read the .bf source code, call the compiler with the desired assembly name, output filename and source code.

Creating an Assembly
Now we need to actually create the assembly. Change BFCompiler.Compile to contain this code:

public static void Compile(string assemblyName, string outputFileName, string sourceCode)
{
    var asm = AppDomain.CurrentDomain.DefineDynamicAssembly(new AssemblyName(assemblyName),
                                                AssemblyBuilderAccess.Save);

    var mod = asm.DefineDynamicModule(assemblyName, outputFileName);

    var mainClassTypeName = assemblyName + ".Program"; // e.g., hello.Program
    var type = mod.DefineType(mainClassTypeName, TypeAttributes.Class | TypeAttributes.Sealed | TypeAttributes.Abstract | TypeAttributes.Public);

    var pointerField = type.DefineField("pointer", typeof (Int16), FieldAttributes.Static | FieldAttributes.Private);
    var memoryField = type.DefineField("memory", typeof(Byte[]), FieldAttributes.Static | FieldAttributes.Private);

    var constructor = type.DefineConstructor(MethodAttributes.Static, CallingConventions.Standard, null);
    var cctorIlGen = constructor.GetILGenerator();
    GenerateConstructorBody(cctorIlGen,pointerField,memoryField);

    var mainMethod = type.DefineMethod("Execute", MethodAttributes.Public | MethodAttributes.Static);
    var ilGen = mainMethod.GetILGenerator();

    ilGen.Emit(OpCodes.Ret);
    type.CreateType();
    asm.SetEntryPoint(mainMethod);
    asm.Save(outputFileName);
}

Whoa, that looks complicated! (And also it doesn't compile as GenerateConstructorBody doesn't exist). Let us walk through the code step by step:

var asm = AppDomain.CurrentDomain.DefineDynamicAssembly(new AssemblyName(assemblyName),
                                            AssemblyBuilderAccess.Save);

var mod = asm.DefineDynamicModule(assemblyName, outputFileName);

var mainClassTypeName = assemblyName + ".Program"; // e.g., hello.Program
var type = mod.DefineType(mainClassTypeName, TypeAttributes.Class | TypeAttributes.Sealed | TypeAttributes.Abstract | TypeAttributes.Public);

We start by creating a new Assembly that has a certain name (in our example that would be "Hello"). This Assembly exists in the current Application Domain (This could be a security risk - please read up about application domains in .net if you want to know more). The second parameter specifies what can be done with this assembly. This is another security feature, as we specified that "The dynamic assembly can be saved, but not executed.". The "but not executed" part does not mean that we generate an assembly that can't be run (that would be stupid, right?). It merely means that the Assembly (which only exists in memory for now) cannot be executed in the context of our running compiler. We can save it to disk though.

We then specify a module. Assemblies are collections of modules, and each assembly needs to contain at least one module. Usually, it contains exactly one module as Visual Studio doesn't support us in packaging multiple modules into one assembly. If you come from a C/C++ background, think of modules as static compilations. If you don't come from a C/C++ background, read up on ilmerge and netmodules.

So our module is called "hello" and has a filename of "hello.exe".

We then create a new class (or Type) called "Program". This is a public sealed abstract class. Uhm... "sealed abstract"? Well, .net does not support static classes. But C# does? How? By cheating: A static class in C# is a class that is both abstract and sealed (something that's legal in .net but illegal in C#). By declaring it abstract it is ensured that nobody can instantiate it and by making it sealed it is ensured that nobody can subclass it. (For some trivia around C# static classes, see this article).

The other thing: What is the mainClassTypeName doing? It generates a namespace for us. Our typename in this example isn't Program - it's "hello.Program". During compilation, AssemblyBuilder will create a namespace "hello" and a Type "Program" automatically.

So now we have the "outer shell" of our compiled Application: An assembly called "hello" containing a module "hello.exe" containing a public "static" (abstract sealed) class "Program" in a namespace "hello".

Next, we define two fields in the Program Class:

var pointerField = type.DefineField("pointer", typeof (Int16), FieldAttributes.Static | FieldAttributes.Private);
var memoryField = type.DefineField("memory", typeof(Byte[]), FieldAttributes.Static | FieldAttributes.Private);

These fields are private and static and of type short (=Int16) and Byte-Array respectively. But wait, don't we specify the size of them?

No, we don't. See, the C# way of writing

private static byte[] memory = new byte[short.MaxValue];

is just a convenience feature. If you compile this assembly and look at the IL, you will notice that the assignment is done in a static constructor. This is why we define a constructor:

var constructor = type.DefineConstructor(MethodAttributes.Static, CallingConventions.Standard, null);

The first parameter declares that we want to create a static constructor (.cctor). The second parameter means that the constructor is a normal method (as opposed to a virtual method or one that takes variable arguments). The third parameter defines any arguments the method takes - static constructors are always parameterless, hence we pass null.

We have our constructor method now, next we need to put some code into it:

var cctorIlGen = constructor.GetILGenerator();
GenerateConstructorBody(cctorIlGen,pointerField,memoryField);

GetILGenerator is the magic that makes compilers happen. This thing allows you to write code into the method. Add a new method to the BFCompiler class:

private static void GenerateConstructorBody(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    // construct the memory as byte[short.MaxValue]
    ilGen.Emit(OpCodes.Ldc_I4,0x7fff);
    ilGen.Emit(OpCodes.Newarr,typeof(Byte));
    ilGen.Emit(OpCodes.Stsfld,memoryField);

    // Construct the pointer as short = 0
    ilGen.Emit(OpCodes.Ldc_I4_0);
    ilGen.Emit(OpCodes.Stsfld, pointerField);

    ilGen.Emit(OpCodes.Ret);
}

Whoa, another highly complex method. Well, actually it's rather simple, just many lines of code. Remember, you are now directly generating IL, so no convenience features. You have to do all the plumbing yourself. What are we doing? First, we call ldc.i4 0x7fff which pushes the 32-Bit value 0x7fff (=32767 or short.MaxValue) to the stack. Then we call newarr typeof(byte) which creates a new Byte Array. Important: It's typeof(Byte), not typeof(Byte[])! newarr pops the length from the stack, so this now created a new Byte[32767] for us.

stsfld puts it into the memoryField.

So the static constructor now looks like this:

static Program(){
    memory = new Byte[0x7fff];
}

The next two instructions of the constructor initialize the pointer field: Push 0 to the stack, replace pointer with the value on the stack.

You may wonder about the last instruction:

ilGen.Emit(OpCodes.Ret);

This generates a return; statement. Why a return statement on a void method? Is this required? Technically, it's not. But in order to create verifiable assemblies, you have to emit it. Veri-what? Verifiable Code is a security feature in .net. It essentially allows tools (like PEVerify) to determine that no security boundaries are violated (that is: no unsafe pointer manipulation and a well defined "flow" through the application). In a nutshell: By emitting the ret statement at the end of a method, any static analysis tool knows for sure that at this point, control is returned to the calling method. So always generate ret at the end of each method, even for void methods.

So that's our constructor. We now have a class Program with two fields (pointer and memory) that are properly initialized. Great! Now we create the main method:

var mainMethod = type.DefineMethod("Execute", MethodAttributes.Public | MethodAttributes.Static);
var ilGen = mainMethod.GetILGenerator();
ilGen.Emit(OpCodes.Ret);

This is almost the same as above, except we call DefineMethod. The method is called "Execute" and is public static. We then get the ILGenerator for the method and emit a single instruction: Ret. Ignoring the issue with verifiable code outlined above, do we really need the ret statement? Yes, we do. No, actually we don't. We just need any statement. Why?

Because each method must have a method body. If you generate a method without any instructions in the body, you will get an exception:
System.InvalidOperationException: Method 'Execute' does not have a method body.

So we have to emit something, and we just return. We could also have chosen nop (no-operation), but since we want to create verifiable code, we emit ret.

The next three instructions then actually create the Type and Assembly:

type.CreateType();
asm.SetEntryPoint(mainMethod);
asm.Save(outputFileName);

CreateType essentially "compiles" the type - until this moment, the Type only exists as some description, but we can't use it from other code.

SetEntryPoint then points to the Execute method. If you don't set an entry point then the Assembly is still generated, but you can't run it ("hello.exe is not a valid Win32 application").

Save then finally writes the Assembly to disk.

Run your compiler again and you should get an hello.exe which does nothing, but at least it runs! Look at it in ildasm or Reflector and you should see this code:

class public abstract auto ansi sealed Program
   extends [mscorlib]System.Object
{
   .method privatescope specialname rtspecialname static void .cctor() cil managed
   {
       .maxstack 1
       L_0000: ldc.i4 0x7fff
       L_0005: newarr uint8
       L_000a: stsfld uint8[] Program::memory
       L_000f: ldc.i4.0
       L_0010: stsfld int16 Program::pointer
       L_0015: ret
   }

   .method public static void Execute() cil managed
   {
       .entrypoint
       .maxstack 0
       L_0000: ret
   }

   .field private static uint8[] memory
   .field private static int16 pointer
}

Whoa, nice! Notice that Program derives from System.Object - we don't have to do that manually, Reflection.Emit does that for us automatically. Also, we don't have to worry about all this other stuff that is required to create a Windows Executable - AssemblyBuilder does that as well.

Compiling the BF Code
We now have a working .net assembly that sets up some fields, but it doesn't do much. So let's do something with the BF Source Code! Add the following code to the BFCompiler.Compile method:

var ilGen = mainMethod.GetILGenerator();
// New Code
foreach(char c in sourceCode)
{
    switch (c)
    {
        case '>':
            GenerateMovePointerForwardInstruction(ilGen, pointerField, memoryField);
            break;
        case '<':
            GenerateMovePointerBackwardsInstruction(ilGen, pointerField, memoryField);
            break;
        case '+':
            GenerateIncrementInstruction(ilGen, pointerField, memoryField);
            break;
        case '-':
            GenerateDecrementInstruction(ilGen, pointerField, memoryField);
            break;
        case '.':
            GenerateWriteInstruction(ilGen, pointerField, memoryField);
            break;
        case ',':
            GenerateReadInstruction(ilGen, pointerField, memoryField);
            break;
        case '[':
            GenerateOpenBracketInstruction(ilGen, pointerField, memoryField);
            break;
        case ']':
            GenerateCloseBracketInstruction(ilGen, pointerField, memoryField);
            break;
    }
}
// New Code
ilGen.Emit(OpCodes.Ret);
type.CreateType();

We are walking to the sourceCode, character by character. For each of the 8 known BF instructions we call another function that generates the proper IL. Let's quickly look at each of them:
< and > instructions

private static void GenerateMovePointerForwardInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldc_I4_1);
    ilGen.Emit(OpCodes.Add);
    ilGen.Emit(OpCodes.Conv_I2);
    ilGen.Emit(OpCodes.Stsfld, pointerField);
}

private static void GenerateMovePointerBackwardsInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldc_I4_1);
    ilGen.Emit(OpCodes.Sub);
    ilGen.Emit(OpCodes.Conv_I2);
    ilGen.Emit(OpCodes.Stsfld, pointerField);
}

We've been through them in Part 3: Load the pointer value, add/subtract 1 from it, convert the result to an Int16, store it in the pointer field again.

+ and - instructions

private static void GenerateIncrementInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    ilGen.Emit(OpCodes.Ldsfld, memoryField);
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldelema, typeof(Byte));
    ilGen.Emit(OpCodes.Dup);
    ilGen.Emit(OpCodes.Ldobj, typeof(Byte));
    ilGen.Emit(OpCodes.Ldc_I4_1);
    ilGen.Emit(OpCodes.Add);
    ilGen.Emit(OpCodes.Conv_U1);
    ilGen.Emit(OpCodes.Stobj, typeof(Byte));
}

private static void GenerateDecrementInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    ilGen.Emit(OpCodes.Ldsfld, memoryField);
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldelema, typeof(Byte));
    ilGen.Emit(OpCodes.Dup);
    ilGen.Emit(OpCodes.Ldobj, typeof(Byte));
    ilGen.Emit(OpCodes.Ldc_I4_1);
    ilGen.Emit(OpCodes.Sub);
    ilGen.Emit(OpCodes.Conv_U1);
    ilGen.Emit(OpCodes.Stobj, typeof(Byte));
}

These were covered in part 2. Load the memory and the pointer, get the memory-byte at the pointer address, add/subtract 1 from it, convert it to a byte (UInt8), replace it in the byte array.

. and , instructions

private static void GenerateWriteInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    ilGen.Emit(OpCodes.Ldsfld, memoryField);
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldelem_U1);
    ilGen.Emit(OpCodes.Call,
               typeof(Console).GetMethod("Write", BindingFlags.Public | BindingFlags.Static, null,
                                          new Type[] { typeof(Char) }, null));
}

private static void GenerateReadInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    ilGen.Emit(OpCodes.Ldsfld, memoryField);
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldelem_U1);
    ilGen.Emit(OpCodes.Call, typeof(Console).GetMethod("Read", BindingFlags.Public | BindingFlags.Static));
    ilGen.Emit(OpCodes.Conv_U1);
    ilGen.Emit(OpCodes.Stelem_I1);
}

Part 4 covered these two, but now it gets a bit interesting. We need to call a method, Console.Read and Console.Write.

OpCodes.Call expects a MethodInfo and optionally an array of Types.

For Console. Read, We get the Console Type through typeof. Then we get a Method called "Read" which is defined as Public and Static. We pass this to the Call opcode. Console.Read writes it's return value to the stack, so we convert it to a Byte again and store it in memory[pointer].

With Console.Write it is a little bit more complicated because there are a lot of overloads to this methods. Remember that overload resolution is a convenience function of a compiler - as we are generating raw IL we need to precisely define which function we want to have. So we start again from the Console type, getting a Method called "Write". The fourth parameter is an array of types: The Argument signature. As we want the Console.Write(char) overload, we create a new Array containing the Char type. How do we pass parameters to Console.Write? By putting memory[pointer] onto the stack before.

[ and ] instructions
Part 5 covered the while loop. As you know, we need to create br instructions to jump into some parts of the method. Do we have to dynamically calculate the locations? No, we don't.

Add a new field to the BFCompiler class:

private readonly static Stack<System.Reflection.Emit.Label> _bracketStack = new Stack<System.Reflection.Emit.Label>();

Then add this instruction to the Compile method:

public static void Compile(string assemblyName, string outputFileName, string sourceCode)
{
    _bracketStack.Clear(); // Ensure that the BracketStack is clear (if Compile is called multiple times)

Then add these two methods to implement the [ and ] commands:

private static void GenerateOpenBracketInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{
    var firstLabel = ilGen.DefineLabel();
    var secondLabel = ilGen.DefineLabel();
    ilGen.Emit(OpCodes.Br, secondLabel);
    ilGen.MarkLabel(firstLabel);
    _bracketStack.Push(firstLabel);
    _bracketStack.Push(secondLabel);
}

private static void GenerateCloseBracketInstruction(ILGenerator ilGen, FieldInfo pointerField, FieldInfo memoryField)
{

    var secondLabel = _bracketStack.Pop();
    var firstLabel = _bracketStack.Pop();
    ilGen.MarkLabel(secondLabel);
    ilGen.Emit(OpCodes.Ldsfld, memoryField);
    ilGen.Emit(OpCodes.Ldsfld, pointerField);
    ilGen.Emit(OpCodes.Ldelem_U1);
    ilGen.Emit(OpCodes.Ldc_I4_0);
    ilGen.Emit(OpCodes.Bgt, firstLabel);
}

This looks terribly complicated, but it's relatively simple. We are essentially implementing a GOTO. But we are not jumping to an address, we are jumping to a Label. The [ instruction defines two labels, firstLabel and secondLabel. These are essentially just uninitialized variables. We then emit a jump to the secondLabel - this is not defined yet, but it doesn't have to be.

MarkLabel then defines the Label firstLabel at the current location in the method. But wait, something is missing: We need to define secondLabel and we need to jump back to the first Label. We do this on the ] instruction. But we somehow need to pass the two labels from the [ to the ] function, and we need to support nested [ ] loops. This is why we defined a Stack<Label> in our Compiler: Whenever [ is called, we push the two labels onto it. Whenever ] is called, we pop the last two labels out of it.

So with the two labels available to the ] instruction, we mark the location of the secondLabel, then generate the comparison of memory[pointer] with 0. What we do was outlined in Part 5: If memory[pointer] is greater than 0, we jump to firstLabel.

Compile your Application again and run the compiler on hello.bf again. If everything went fine, you should now be able to run the generated hello.exe and see a satisfying "Hello World!" on the console.

Congratulations - you now have a BF compiler that generates a .net Console Application!

You can find the source code as a Visual Studio 2010 solution on http://github.com/mstum/bfnetc

This could conclude this series, but I'm going to write a Part 7 containing a few more considerations: EXE vs. DLL, Release vs. Debug builds and some more.

The one dynamic language I think Microsoft needs to embrace in .net…

So a few days ago, Jimmy Schementi announced the death of IronRuby. Oh, sorry, "IronRuby isn't dead, it's just back in the hands of the community" which is essentially the same.

Now, while there may be a chance for IronRuby to survive, I personally think it's not something that Microsoft should do, so I think it makes sense for them to kill it off internally. Same for IronPython. I think the DLR is a great addition to .net though, and I think there is a far better language already available.

JavaScript.

Many of us perceive JavaScript as a Browser-only language that is hampered by different implementations in different Browsers (textContent vs. innerText anyone?), something which is being addressed through the use of frameworks like jQuery.

But think it one step further: How many languages do you need to write a Web Application? At least two: JavaScript and whatever backend language you use, for example C#, Java, Ruby or PHP. How do you write Form Validation? You write separate Code for the Client Side and for the Server Side so that users get instant feedback without being able to compromise the system by turning JavaScript off. This is stupid and a severe violation of DRY. Also there are subtle differences between the Code, for example because JavaScript's RegEx Engine works slightly different than the .net one.

.net is at it's core a language-independent technology that doesn't only support but encourage using multiple languages. Sure, there are only 3-4 really supported languages which are rarely mixed (Some people mix C++/CLI with C# or VB.net for some COM stuff, or F# with C# for financial/statistic stuff) and some languages that have some limited support (Boo is my favorite in this category).

So why not do what's logical and embrace JavaScript as a Server-Side technology? Node.js arrived recently and showed it's possible. Microsoft already somewhat supports JavaScript in Visual Studio (although the experience is far less than stellar) and they even had their own bastard child in form of JScript.net.

There is a big discussion whether or not C# is a good language to write Web Applications in. People point at Cucumber and talk about how great Ruby works with the Web while C# feels like a chore because of it's static typing, verbose Syntax and need for IoC Containers to do Unit Testing properly. Other people point at the insecurity of dynamic languages, about the lack of compiler errors, about the confusion created by not having to declare variables which leads to subtle bugs like `$total = 0; foreach(item in items) $totel += item.price`.

I say: Use both. Use C# for your backend code, for your Business Classes. Get Compiler Errors when you screw up. Make sure everything needs to be explicitly casted to whatever it has to be and that all variables have to be defined officially before use. For the frontend, use JavaScript. Create your Views and View Models like you would create them in the browser, through DOM Manipulation. Write and maintain your verification code exactly once, in one language, browser and server side. Feel free to whatever crazy manipulation you need to do without having to declare and cast tons of variables. If you screw up, it's "only" your Views, not your data structure because that's C# code.

Utopia? Maybe. JavaScript is not without it's faults and it may not be as elegant as Ruby in places. People would have to learn two languages, and people may ask "Why should we switch to an insecure language if C# served us well for years?" - well, the latter people are usually the ones who don't think they need Unit Testing and believe that WebForms is a perfectly good technology, so let them continue to use it and let the rest of us move forward to face the challenges of 2010 and beyond.

Customers want better apps. They want AJAX, they want snappy, cool looking UI. The Browser market is not a "Make sure it runs in IE and doesn't suck in Netscape 4.78" market anymore. We need to create cool apps, and we need to run them on Internet Explorer, Firefox, Safari on OS X, Safari on iPhone and Android. Our Web Apps need to do more in less time.

And Microsoft is in a great position because they have the foundation already built. Make Visual Studio a kick-ass JavaScript Development tool and bring it to the server side. Give us great debugging because even with FireBug and the IE Development Tools, it still sucks. Look what Node.js did. Look what you are doing with Internet Explorer 9's JS Engine. Look what you have with Active Scripting and look what you did with JScript.net.

It's incredible to see that Microsoft has all this technology already lying around and that no one had the idea to just combine them all together for ASP.net. Sure, Internet Explorer got a lot of (well deserved) crap for issues in it's JavaScript implementation, but IE9's previews look really promising.

Because let's face it: Any .net Language not supported by Microsoft in Visual Studio is doomed.

64-Bit Bitfield Cheat Sheet

Just as a cheat sheet for me, a 64-Bit Bitfield in Dec and Hex.

Bit Int Hex
1 1 0x1
2 2 0x2
3 4 0x4
4 8 0x8
5 16 0x10
6 32 0x20
7 64 0x40
8 128 0x80
9 256 0x100
10 512 0x200
11 1024 0x400
12 2048 0x800
13 4096 0x1000
14 8192 0x2000
15 16384 0x4000
16 32768 0x8000
17 65536 0x10000
18 131072 0x20000
19 262144 0x40000
20 524288 0x80000
21 1048576 0x100000
22 2097152 0x200000
23 4194304 0x400000
24 8388608 0x800000
25 16777216 0x1000000
26 33554432 0x2000000
27 67108864 0x4000000
28 134217728 0x8000000
29 268435456 0x10000000
30 536870912 0x20000000
31 1073741824 0x40000000
32 2147483648 0x80000000
33 4294967296 0x100000000
34 8589934592 0x200000000
35 17179869184 0x400000000
36 34359738368 0x800000000
37 68719476736 0x1000000000
38 137438953472 0x2000000000
39 274877906944 0x4000000000
40 549755813888 0x8000000000
41 1099511627776 0x10000000000
42 2199023255552 0x20000000000
43 4398046511104 0x40000000000
44 8796093022208 0x80000000000
45 17592186044416 0x100000000000
46 35184372088832 0x200000000000
47 70368744177664 0x400000000000
48 140737488355328 0x800000000000
49 281474976710656 0x1000000000000
50 562949953421312 0x2000000000000
51 1125899906842624 0x4000000000000
52 2251799813685248 0x8000000000000
53 4503599627370496 0x10000000000000
54 9007199254740992 0x20000000000000
55 18014398509481984 0x40000000000000
56 36028797018963968 0x80000000000000
57 72057594037927936 0x100000000000000
58 144115188075855872 0x200000000000000
59 288230376151711744 0x400000000000000
60 576460752303423488 0x800000000000000
61 1152921504606846976 0x1000000000000000
62 2305843009213693952 0x2000000000000000
63 4611686018427387904 0x4000000000000000
64 9223372036854775808 0x8000000000000000

Checking if a bit is set:

// AND: Only return 1 if both bits are 1
// 0011 & 0100 = 0000
// 0111 & 0100 = 0100
isSet = (value & 0x4) == 0x4;
isSet = (value & 0x4) > 0;

Setting a bit:

// OR: If either operand is 1, return 1.
// 0011 | 0100 = 0111
newvalue = oldvalue | 0x4; // |=

Unsetting a bit:

// NOT: Invert the Bits
// ~0100 = 1011
// AND: Return 1 if both bits are 1
// 0011 & 1011 = 0011
// 0111 & 1011 = 0011
newvalue = oldvalue & (~0x4) // &= ~0x4;

Toggling a bit:

// XOR: If both bits are equal, return 0
// 0111 ^ 0100 = 0011
// 0011 ^ 0100 = 0111
newvalue = oldvalue ^ 0x4; // ^=

Is the SharePoint Object Model too weak for excellent Applications?

I'm doing SharePoint since about 3 years now, starting with SharePoint 2007 and moving into 2010 in November when the Beta was released. While I can't say that I'm in love with the development experience, I do think it's a very capable product for the users. During the years, I've learned many of the quirks and tricks of SharePoint and despite it's many little issues, I liked SharePoint 2007 development.

SharePoint 2010 added a ton of new Features, including a separation into Service Applications (replacing the SSP) and many new Social Features like tagging and commenting. Also, the Development tools radically improved. However, I think that the Object Model didn't scale well over the years. I think it's downright broken in SharePoint 2010 and that the next version of SharePoint needs a completely new Object Model with properly separated APIs/Modules.

Why? Here are some of the big and small issues I've encountered while I wanted to do some really simple things.

SPMetal doesn't generate all field Types
Create a new List in SharePoint, then enable the Managed Metadata and Enterprise Keywords option on the List or add a new Field to the List which is a Managed Metadata column. Run SPMetal against the list. Look how the Proxy doesn't have the Keyword field.

It's bad enough that the generated proxy is unnecessarily fragile, but not supporting all of the *built-in* field types makes it unusable for all but the most simple queries. Granted, Managed Keywords is a separate Feature of Standard/Enterprise, but it's still an official Microsoft out of the box feature.

Querying Managed Keywords through CAML is only possible by Name, not by ID
Let's say you want to query the List and get all items that have two different keywords. As Keywords can have the same name (if they are in different paths of the Term Set), I thought it would be best to query with the Guid of the Term.

Well, turns out you can't. You can query by their WssId though (which is an int that seems to be assigned uniquely on each Site Collection) by adding the LookupId="TRUE" attribute. However, try to chain two queries together with an <And> or by adding the value twice. What happens? You get all items that have any Keyword. It's an Or, not And.

The only way is to Query by the Title and hope you never have duplicate titles or that you can enable the Full Path option on the Field. Now, for the standard Keywords this isn't that much of a problem as they don't have a hierarchy and therefore no duplicates, but you aren't always that lucky.

This is a deeper issue though, it's a problem with the LookupMulti field from which the Multi-Taxonomy Field inherits.

No way to query the User Profile store effectively
This is something that is easy in pretty much every CMS on the market: Give me a distinct list of departments in the company. If the user profile store has a field that holds the department, it's literally a SELECT DISTINCT(Department) from USERS ORDER BY Department.

In SharePoint, there is no way to do that. You could query the User Information list on the main site collection, but that may not contain all users. You can query Active Directory directly, but what's the point of the User Profile store in SharePoint then?

If you want to get a list of all departments, you get a UserProfileManager and loop through all profiles, then fill a List<string> or HashSet. This is slow and resource intensive.

No way to get some tag statistics efficiently
Another really simple scenario: I have a list that contains a lot of items. Users can use the Tags & Notes feature of SharePoint to tag items. We want to get statistics: Give me the top 10 tags that start with 'su' and how often they were used on this list.

The first problem is obviously that SharePoint stores social tags against an exact URL. And when I mean exact URL, then I mean:

  • Alternate Access Mappings are not supported. If one person uses http://internalportal and another person uses https://portal.internal.example.com, they won't see each other's tags and notes
  • If you have a Ribbon open and your URL has the "InitialTabID=...." QueryString, then your Tags and Notes will not be visible to other people who don't have it

So that's bad enough. But what about getting the statistics? Using SQL, such a feature is developed in 5 minutes since all you need is this query:

SELECT TOP 10 InputTermLabel, COUNT(InputTermLabel) AS Count
FROM dbo.SocialTags
WHERE InputTermLabel LIKE 'su%' AND UrlID IN (
SELECT UrlID
FROM dbo.Urls
WHERE Url like 'http://myportal/Lists/MyList/DispForm.aspx%')
GROUP BY InputTermLabel
ORDER BY Count DESC

Using the Object Model, it is impossible to do this effectively. The closest I got is this:

var result = new Dictionary<string, int>();
var baseUrl = "http://myportal/Lists/MyList/DispForm.aspx";
var stm = new SocialTagManager(SPServiceContext.GetContext(site));
var terms = stm.GetAllTerms(new Uri(baseUrl),0);
foreach (var term in terms)
{
    var name = term.Term.Name;
    if (!name.StartsWith("su", StringComparison.InvariantCultureIgnoreCase)) continue;
    var tc = stm.GetUrls(term.Term);
    int usageCount = tc.Count(url => url.AbsoluteUri.StartsWith(baseUrl));
    result[name] = usageCount;
}
result = result.OrderByDescending(kvp => kvp.Value)
             .Take(10).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

As you may see, this is incredibly inefficient. The SQL Server sends me ALL the terms (at least I can specify a base URL!) and only on the client I can filter the ones starting with 'su'. Then I have to ask for every single Term to get the URLs. If you have a couple hundred tags, this is an incredible resource hog (tons of SQL Queries, a lot of network traffic, high memory usage). That would be the moment to wrap it into a Timer Job, cache it and not have those statistics in Real time.

The last one made me break one of the golden rules with SharePoint
The golden rule is: Never ever directly talk to the SQL Database, always go through to the SharePoint Object Model.

Today, I broke that rule, and I don't feel dirty or guilty. That last point was the final straw. The Object Model simply doesn't cut it anymore, it's weaker than what I can do with some low end PHP CMS Systems, weaker than stuff like XOOPS or PHP-Nuke. The whole social functionality is poorly integrated.

After trying to get Tag-Statistics for 6 hours, I spent 30 Minutes to write a small wrapper: Iterate through all the SPDatabases in the farm, find the SocialDB that belongs to the current Web Application, use Reflection to get the SqlSession property and use Reflection again to call ExecuteReader on it. As a result, I get a nice SqlDataReader back.

No fighting with the Object Model, just plain "I want this, give it to me without killing all my resources".

Obviously, I now need to be careful with Service Packs and Hotfixes. I've set up a little tool that can compare SQL Schemas. That way, whenever a Hotfix/Service Pack gets released I run it to see if any of the Databases/Tables I use changed and adapt my code accordingly. Even if it takes me 15 Minutes to make the changes, Microsoft would have to release 24 Schema-Changing Updates to break even with the 6 hours I wasted today on this.

Granted, I use it strictly for SELECTs and only as a last resort and I wouldn't release code containing it, but as some point I have to get pragmatic about it. I can't spend a month developing something that should only take a week at maximum just because SharePoint doesn't like me.

Sorry for ranting so much, but SharePoint 2010 simply feels like a 10 year old application in places. There is so much stuff that was bad when it was still SharePoint Portal Server 2001 aka. Tahoe (ONET.xml...) and that's worse now that we have nicer technologies.

Why are pretty much all of the collections non-generic? Why is SPListItemCollection still a non-generic Collection of Object and not a List<SPListItem>?
Why are there no standalone SharePoint MSBuild tasks, thus requiring Visual Studio 2010 to be installed on a build server?
Why does creating Content Types through XML Files require me to specify fields twice? Why does creating a List for that Content Type through XML require me to specify the fields a third time?
Why does a product than can cost hundreds of thousands of dollars in licensing still require me to depend on Reflector and looking at the Stored Procedures just to understand simple things?
Why is there no easy way to check for the existence of certain items in some collections without iterating through it completely? Usually, there is only an indexer that throws an Exception if an item isn't found.
Why are so many useful functions and/or classes internal or sometimes even internal sealed? Many collections have internal "GetItem" functions that return null if the item wasn't found, which is great but as said, they are internal. The external functions usually wrap the GetItem call and throw an Exception if null - ARGH!
Why is there no MS Connect site to report issues and ask someone from MS to resolve them in a patch? Oh wait, there is a SharePoint Community Connect Site. It even has some bugs and the occasional Microsoft poster - except that all the real issues are ignored there.
Why are the simple things so hard and the hard things impossible to do?

The weird thing is, I'm not even that unhappy developing for it. As said, I still believe that SharePoint is a good product for the users to use and does so many things right there.
But I also believe that the Object Model needs a huge refresh.

I assume Microsoft wants to make upgrading as painless as possible, seeing how SharePoint is targeted at Enterprises. So very little changes, dragging the old stuff along.
I just hope that the next SharePoint version will be based on the .net 4 CLR and that the breaking changes there (unlike 3.0 and 3.5, .net 4.0 introduced a new CLR) would allow some people at Microsoft to show some bravery by revamping the whole OM and turning SharePoint from an ancient but modern looking product into a modern product.

</rant>

A few more thoughts about SWiki

It's been some time since I wrote a post about me re-thinking SWiki. In the meantime, I have experimented a bit with several approaches, and the recent announcements of IIS Express and SQL CE 4 sparked some new interest in this project.

As I said earlier, my problem was that I can't display Images that don't have a URL in the hosted Internet Explorer, but that I wanted to keep HTML-compatible pages. The first approach is to have a local web server that delivers the pages. There is however a second approach, which involves actually having the images in the file system. I could either store them in the database and "extract" them to a temporary directory when SWiki starts, or I could keep them externally and only "register" them in the database.

I think I like that approach as well, because it solves any corporate networking concerns (IT usually isn't too happy with people running their own rogue web servers within a corporate network...).

I'm busy with some other projects and I have to find a new name for SWiki (someone else had the name before :) ), but I do now have a good idea on what I want to do with it and how to achieve that.

Careful with SPContext.Current…

...as it will be NULL within a Timer Job or Workflow. I have some shared Data Access classes that use SPContext.Current.Web all over the place and now that I want to use them from within a Timer Job, I have to refactor them to take a SPWeb as a Parameter...

Why doesn’t Windows offer a working help system anymore?

If you are developing Windows Desktop applications, you may want to offer context-sensitive help, triggered either by pressing F1 or by clicking on the question mark icon in the title bar and on an element. Back in the old days (starting in 1990 and de-facto ending in 2006), there was WinHelp.

Now, WinHelp wasn't exactly beautiful and in recent years (after 1996 that is), the "Maximize Database Size" dialog was downright stupid, but WinHelp had all the features a Help system needs: Articles are organized in Chapters and can contain Images, Links and basic formatting. And it allowed your application to open a specific page, providing contextual help.

But most importantly, WinHelp just WORKS. Really, you press F1 and maybe you have to "Maximize Database Size" once, but then it opens. I never ever had a problem with WinHelp.

But Microsoft decided it wasn't modern anymore. That we needed something new. Granted, WinHelp clearly showed it's age, and creation of Help Files was somewhat complicated. So they introduced Compiled HTML Help, or CHM. It is a modern Help system, allowing you much more freedom with your layout and styling. It's a really good format, with one tiny little problem: CHM doesn't actually work:

Turns out that CHM is displayed through the MSHTML Control (which is essentially an embedded Internet Explorer) and thus it has some security limitations. The most important one is that CHM files on non-trusted (e.g., network) locations simply don't work.

Now, you may say that this can be resolved. The file can be unblocked, or the path can be set to trusted. An Application Installer could do that. I reply: Doesn't matter. It's a Help system. It has to work without configuration. Press F1, get help. If I'm in a situation I need help using my application and my help system tells me that it wants some treats first, it's a failure. Besides, not every Application has an installer because not every application needs one. A large amount of applications are just DLLs (like the one the above screenshot is from) or ZIPped application files.

So CHM is a complete and utter failure, and Microsoft at least acknowledged that by killing off Microsoft Help 2 and starting a new approach with MAML. However, MAML is not a Help System, it's a language that can be used as source to be converted into an output format like HTML, RTF or whatever. In other words, Microsoft has created DocBook again without actually solving the problem of displaying help.

The real successor to CHM seems to be the HelpPane introduced in Windows Vista and included in Windows 2008 and 7 as well. Those help files have the extension h1s and a nice little icon, so Windows knows what they are. There is our new Help system, right? Well, try to double click one of those h1s files...

Hmmm... So Microsoft didn't just register a file type handler for h1s files. Well, can't be that hard to do, can it?

AP Help - Guided Help - Technical FAQ

Can I launch Guided Help through other means besides the Help Pane?
Yes, but you must create and publish the Guided Help topic through Help. Once you have a Guided Help topic compiled into an H1S file and installed (at this stage only possible for Microsoft and OEM's), you can launch it directly through a command line if you wish.
The syntax is:

%systemroot%\system32\acw.exe –Extensions GuidedHelp.dll –taskID mshelp://windows/?id=id-of-your-help-topic –ExecutionMode DoIt | ShowMe

For a fast impression copy following text to your run dialog:

%SystemRoot%\System32\ACW.exe -Extensions GuidedHelp.dll -taskID mshelp://windows/?id=3726934c-1315-4c29-bd4d-e42c10225e5a -ExecutionMode ShowMe

Excuse me, but ARE YOU FRIGGIN' KIDDING ME? Oh, yes you are, let me just quote Microsoft:

Microsoft is committed to providing Help and Support technology in the Windows® platform and will continue to investigate new solutions for software developers.

Sorry, but if "comitted" means "Killing off perfectly working solutions and replacing them with a plethora of broken solutions every two years" then you are absolutely right, because that's what you are doing. WinHelp survived 16 Years and if you would still ship it with Vista and 7 then it would still be alive. So you as an application developer, what can you do? WinHelp isn't part of Vista and Windows 7 anymore and you're not allowed to distribute it with your application. CHM/H1S doesn't work. What are your alternatives?

Some applications use PDF. They offer rich layout and a Table of Content, however there is no standard reader. Sure, there is Adobe Reader, but you can't easily control it (e.g., open a PDF on a given page) - if the user has a version that is too old or too new for your application, you may run into issues. And if the user doesn't have Adobe Reader (or any other PDF reader) installed, you have to explain why someone would download an additional program just because you're not competent enough to include help. So PDF is not an option.

What about HTML Files? Everyone has a browser, even the short lived Windows 7 E Editions included MSHTML allowing you to at least display HTML within an application. The major downside of HTML is that you can't control which browser displays it, so you have to stay conservative and make sure old Internet Explorer or Firefox browsers display it (say goodbye to transparent PNGs...). JavaScript maybe tricky (also due to widely spread Extensions like NoScript). And instead of one help file, you have a whole folder. Adding contextual help to your application is somewhat possible, but overall you simply lose the ability to control and test how the help looks and works.

This is possibly the moment where you expect me to say "But after researching all these non-working options, here is the one that works!". Sorry, can't do that. I don't know a single Help system that works on Windows Vista/7/2008. I asked on StackOverflow a long time ago and the consensus was the same.

It's really sad that a task that seems so simple and straight forward is too hard for Microsoft. Seriously, all that you need to do is to take a simple container format, some basic formatting options, the ability to link and embed images and an API to call Help from your application. If you want, include video support with a standard codec (keeping in mind Windows N/KN Editions)

Simple, easy, straight-forward, hassle-free or in other words: Exactly how a Help System should work. Exactly how WinHelp worked since 1990 before it was brutally murdered. Rest in Piece WinHelp, we miss you dearly.

Dealing with Multiple Time Zones in SharePoint 2010

Organizations that deploy SharePoint farms often have employees in different countries, or at least in different Time Zones. While people in the US (which spans 4 time zones) are pretty comfortable with translating between time zones all the time, the same cannot be said for everyone. Trying to translate between Pacific Time and Middle European Time is just painful, especially since the daylight savings time starts and end at different dates.

With SharePoint 2010 you get the tools to convert the time according to the users time zone. There are two types of Regional Settings: Each Site (SPWeb) has RegionalSettings that specify the Time Zone (and Locale, Calendar etc.) for that site. This is useful if you have sites that are predominately used by people in one time zone. The second type of Regional Settings are the one the user (SPUser) can set (My Settings - My Regional Settings). Those are the same settings as the ones on SPWeb, but each user can specify their own setting.

When storing Dates in code, you have two options:

  • Store the time in local time of the Web and use DatesInUtc = true on a SPQuery to get it back as Utc
  • Store the time in Utc and to not use DatesInUtc on SPQuery

What does that mean? As said, each SPWeb has it's own Regional Settings. Let's assume you have a date of 2010-06-14 15:00:00.

If the TimeZone of the SPWeb is Pacific Time (GMT-8) and you query the List using SPQuery, you get back this date. If you however set DatesInUtc = true on the SPQuery, you get back 2010-06-14 22:00:00. SharePoint doesn't know if 15:00:00 was already UTC, so using DatesInUtc may translate a date twice.

The caveat here is that when storing dates, you would normalize them either to UTC or to the Local Time of the Web. What would you do if some employee from Texas (which runs on Central Time, GMT-6) enters 2010-06-14 15:00:00? You would need to store it either as GMT-8 (so the time becomes 13:00:00) or as UTC (22:00:00).

Needless to say, I prefer to store all dates as UTC if the list isn't visible to the user directly. Then when querying the list through Code, I can just convert the time to whatever the user's timezone is:

var user = SPContext.Current.Web.CurrentUser;
// Always perform a Null-Check on SPUser.RegionalSettings
if (user.RegionalSettings != null)
{
    return user.RegionalSettings.TimeZone.UTCToLocalTime(listDateUtc);
}
else
{
    // User didn't set a time zone, so use the one from the Web
    return SPContext.Current.Web.RegionalSettings.TimeZone.UTCToLocalTime(listDateUtc);
}

Overall, the option for people to set their own timezones independently from the SPWeb is a fantastic and long needed addition. On the other hand, it does make dealing with times a bit more complex.

If the list is visible to the user, you may need to normalize the times differently (for example, use user.RegionalSettings.TimeZone.LocalTimeToUTC to convert a user time to UTC and then SPWeb.RegionalSettings.TimeZone.UTCToLocalTime to convert the time to the Web-Time).

If you do build custom pages that make use of the Microsoft.SharePoint.WebControls.DateTimeControl then you can just use UseTimeZoneAdjustment="true" on it to have it automatically convert to UTC and back (SelectedDate will be UTC when accessed through code, but the User's/Web's time when rendered).

A Visual Studio Macro to insert a new Guid

I've been trying to create some SharePoint Content Types and List Definitions recently, and everyone who done that before knows what you need for that: Guids, and quite a few of them. One for each Field, Feature, Solution... So instead of using GuidGen, I wanted something that inserts a new Guid at the cursor position in the Editor when I press a certain keyboard shortcut.

Luckily, this is rather easy with the Macro Editor. Just create a new Macro/Module and enter this code:

Sub InsertGuid()
    Dim newId As String = Guid.NewGuid().ToString("B")
    Dim doc As Document = DTE.ActiveDocument
    Dim textDoc As TextDocument = CType(doc.Object("TextDocument"), TextDocument)
    textDoc.StartPoint.CreateEditPoint()
    textDoc.Selection.Insert(newId)
End Sub

You can then go to Tools / Options / Environment / Keyboard and look for the Macro you just created (Macros.MyMacros.SomeModule.InsertGuid) and assign a Keyboard shortcut to it.

Writing a BF Compiler for .net (Part 5: [ and ] – while loops in IL)

The final two commands we're looking at are [ and ]. Their description in the first article was a bit cryptic, [ was described as

Go to the next instruction if the byte at the memory pointer is not 0, otherwise move it past the matching ] instruction

while ] was described as

Go to the instruction after the matching [ if the byte at the memory pointer is not 0, else move it past the ]

In C# code, this is a lot simpler:

// BF Code for this: [-]
while (memory[pointer] > 0)
{
    // Instructions between [ and ]
    // The following instruction is only to have a body
    memory[pointer]--;
}

It's a while-loop. It's important to note that we have to use a pre-test loop, that is a loop that checks the condition before executing the loop (as opposed to a do-while loop which executes the code block at least once and checks afterwards).

So how does a while loop look in .net IL?

// See note below regarding .s suffix on br.s and bgt.s
IL_0000:  br.s       IL_001f
// This is the memory[pointer]-- instruction
IL_0002:  ldsfld     uint8[] BFHelloWorldCSharp.Program::memory
IL_0007:  ldsfld     int16 BFHelloWorldCSharp.Program::pointer
IL_000c:  ldelema    [mscorlib]System.Byte
IL_0011:  dup
IL_0012:  ldobj      [mscorlib]System.Byte
IL_0017:  ldc.i4.1
IL_0018:  sub
IL_0019:  conv.u1
IL_001a:  stobj      [mscorlib]System.Byte
// This is the while loop
IL_001f:  ldsfld     uint8[] BFHelloWorldCSharp.Program::memory
IL_0024:  ldsfld     int16 BFHelloWorldCSharp.Program::pointer
IL_0029:  ldelem.u1
IL_002a:  ldc.i4.0
IL_002b:  bgt.s      IL_0002

GOTO considered harmful?
Okay, this looks complicated, but it is easy. To explain it, we have to open Pandora's Box and look at the dirtiest secret there is in development: At Machine Level, GOTOs are essential.
Ha, take that Dijkstra!

Regardless how much you abstract it away, control structures like while have to be translated as "GOTO's", or more precisely as jumps to addresses to continue execute code from. In .net, this is not called GOTO though, it's called Branch.

Our code has three parts: A single GOTO/Branch instruction at the beginning, the body of the loop (in our case the single memory[pointer]-- instruction) and then the while check.

So we start with br.s, which is described as

Unconditionally transfers control to a target instruction (short form).

In other words, this is a GOTO and it goes to IL_001f. The code starting from here does the while-check: Load memory and pointer onto the stack. Then load the value of memory[pointer] onto the stack as Unsigned 8-Bit Int. Afterwards, push the number 0 to the stack.

Our evaluation stack now contains the value of memory[pointer] and the number 0. Then we have the new bgt.s command:

Transfers control to a target instruction (short form) if the first value is greater than the second value.

In other words and Pseudocode: if(memory[pointer] > 0) goto IL_0002;

The code starting from IL_0002 is our memory[pointer]-- instruction which will be executed and then we'll do the while-check again.

In Debug mode, the bgt instruction is not used. Instead, the check is done much more complicated. Feel free to look it up using ILDASM, but Debug Mode uses this C# Pseudocode to capture the result of the comparison into a local variable:

bool DoJump = memory[pointer] > 0;
if(DoJump) goto IL_0002;

This is useful for Debugging (who would've thought it, given that it's a debug build?), but rather heavy compared to Release mode (8 instructions and a local variable compared to 5 instructions without).

Looking at that, you can easily imagine what the difference between a while and a do while loop is: The do while loop does not have the br.s instruction at the beginning. It therefore executes the method body at least once before it enters the while-check.

Before I end this post, I want to talk about short form commands.

What is "Short Form"?
If you look at the IL Commands, some say "Short Form". What does this mean? Well, normally all addresses are 32 Bit, that is 4 Bytes. If you want an unconditional jump, you would use the br command with the target address. However, this means you'll have 5 bytes in the target file - 1 for the Br Instruction and 4 for the target. As this instruction is so common, it would be a massive overhead to always have to write 5 bytes to the file.

Short Form commands only take 1 byte for the target address. The target here is described as

1-byte signed offset from the beginning of the instruction following the current instruction

So instead of giving an absolute address, we give a relative address to jump to instead. This only works if the target is less than ~125 bytes away (signed offset!) of course, so it's a lot less flexible and your compiler needs to know the distance between the target and the jump instructions. However, the savings are huge as short form only requires 2 bytes, less than half of the full instructions.

This concludes the command overview. Part 6 will finally show how we will write our compiler.