Relax NG Verification in .net (and a bit of Schematron)

I've been working with Docbook V5.0 a bit and started working on some processing tools to support my workflow. One of the big things is that the official Docbook Schema is Relax NG and Schematron.

Relax NG

In .net, you can create a validating XML Reader by passing in XmlReaderSettings into XmlReader.Create, but the built-in ValidationType is limited to W3C XML Schema (.xsd) or Document type definitions (.dtd). Docbook has Schema files for both, but neither are the official standard because of a slight lack of features in those schema languages.

Thankfully, the Mono team has made a Relax NG library and created a NuGet Package that is useable in Microsoft's .net. The Package ID is RelaxNG:

PM> Install-Package RelaxNG

I've created a simple Docbook XML File for testing purposes and I'm using the docbookxi.rng schema file (since I'm using XIncludes).

// using System.Xml;
// using System.Xml.Linq;
// using Commons.Xml.Relaxng;
using (XmlReader instance = new XmlTextReader("DocbookTest.xml"))
using (XmlReader grammar = new XmlTextReader("docbookxi.rng"))
using (var reader = new RelaxngValidatingReader(instance, grammar))
{
    XDocument doc = XDocument.Load(reader);
    Console.WriteLine("Document is using Docbook Version " +
                       doc.Root.Attribute("version").Value);
}

There are two ways of handling Validation Errors (in the test case, I've duplicated the <title>First Part of Book 1</title> node, which is illegal in Docbook since title can only occur once in that scenario).

If no handler is set up, this throws a Commons.Xml.Relaxng.RelaxngException with an error like Invalid start tag closing found. LocalName = title, NS = http://docbook.org/ns/docbook.file:///H:/RelaxNgValidator/bin/Debug/DocbookTest.xml line 35, column 14.

The better way is to hook up to the InvalidNodeFound Event which has a signature of bool InvalidNodeFound(XmlReader source, string message):

reader.InvalidNodeFound += (source, message) =>
                           {
                               Console.WriteLine("Error: " + message);
                               return true;
                           };

source is the RelaxngValidatingReader as an XmlReader and allows you to look at the current state to do further analysis/error recovery. message is a human readable message like "Invalid start tag found. LocalName = title, NS = http://docbook.org/ns/docbook.". The return value decides whether of not processing continues. If true, it will skip over the error - in the end, I'm going to have a proper XDocument but of course all guarantees for validity are off. If false, this will throw same RelaxngException as if there's no event handler wired up.

Generally, I prefer to make use of a lambda closure to log all errors during Validation and set a bool on failure that prevents further processing afterwards.

Schematron

Now, Relax NG is only one of the two parts of Docbook Validation, although arguably the bigger one. Schematron is employed for further validation, for example that book must have a version attribute if (and only if) it's the root element, or that otherterm on a glosssee must point to a valid glossentry. The Docbook Schematron file is in the sch directory and for this test, I've removed the <glossentry xml:id="sgml"> node from the DocbookTest.xml file. This still passes Relax NG, but is no longer a valid Docbook document.

There isn't much in terms of Schematron support in .net, but I've found a rather ancient library called Schematron.NET of which I downloaded Version 0.6 from 2004-11-02. This is messy, because I have to use the Docbook W3C XML Schema file which has embedded Schematron rules - basically docbook.xsd, xml.xsd and xlink.xsd from the /xsd directory. Thanks to this article on MSDN for pointing me to the library and to the fact that Schematron rules can be embedded into .xsd using the appinfo element.

I also need to make sure to use the XmlTextReader and not any other XmlReader - Liskov be damned!

using (XmlReader instance = new XmlTextReader("DocbookTest.xml"))
{
    var schemas = new XmlSchemaCollection();
    schemas.Add("http://www.w3.org/XML/1998/namespace", "xml.xsd");
    schemas.Add("http://www.w3.org/1999/xlink", "xlink.xsd");
    schemas.Add("http://docbook.org/ns/docbook", "docbook.xsd");

    var schematron = new Validator();
    schematron.AddSchemas(schemas);
    schematron.Validate(instance);
}

This throws a NMatrix.Schematron.ValidationException with the message

Results from XML Schema validation:
  Error: Reference to undeclared ID is 'sgml'.
  At: (Line: 85, Column: 35)

There doesn't seem to be an Event Handler, but the code is very 2004-ish, with properties being set after processing. Overall, the whole approach is very messy, I'm even validating the whole document again against XSD after it's been passed through Relax NG already.

The library is also expecting the old Schematron 1.5 namespace of http://www.ascc.net/xml/schematron - which is fine for Docbook 5.0 but will be a problem once Docbook 5.1 comes out since it uses the ISO Schematron namespace of http://purl.oclc.org/dsdl/schematron.

For 5.0 it does give proper Schematron validation which is good enough for now, but overall, this isn't really a great way to do Schematron validation. Not sure if there's a better solution because I'd love to avoid starting my own .net 4.5 Schematron Validator Project 🙂