Pages

Tags

Validation (9) XML (9) Geeky (3) Java (3) Android (2) Business IT (1) Chromecast (1) Devfest (1) Web (1)

Thursday, May 27, 2010

NVDL — a Breath of Fresh Air for Compound Document Validation (XTech2007)

Abstract
Compound document is a document that consist of elements and attributes from different mark-up vocabularies (namespaces). This approach to document engineering is very effective because it allows reusing existing markup technologies in a novel ways. However validation of such documents is a big challenge. It is very hard to create complex combined schemas in RELAX NG or W3C XML Schema for validation of compound documents. Instead advantages of using NVDL as a validation language are explained in this article. Further, our own implementation called JNVDL is introduced. The second part of the article deals with problems of recognizing different versions of a single XML vocabulary. Finally, an extension of the NVDL language able to solve this problem is introduced.

Introduction

This article deals with compound document validation. In the the section called “Compound documents” advantages and the current usage of compound documents is described. The next part (the section called “Validation of compound documents”) shows insufficiency of classical schema languages for validation of compound documents. NVDL language is proposed as a solution to this problem in the section called “NVDL is the right solution”. The second part of this article introduces our implementation of NVDL called JNVDL (the section called “JNVDL”) and problems which were solved during implementation of the NVDL standard. The last part of this article (the section called “Impact on a Web architecture”) deals with the problem of versioning document types and proposes some NVDL extensions which can be used to solve problem of several different document types residing in one namespace.
Open-source implementation of NVDL validator and proposed NVDL extensions could be considered as the main contribution of our work described in this paper.

Compound documents

Compound document is a modern name for XML documents that consist of elements and attributes from different mark-up vocabularies. In other words, by combining two or more different XML languages in a single document we create a compound document. This was made technically possible thanks to XML Namespaces. In general, the Namespaces in XML recommendation[NS] solves problems of collision and recognition of elements and attributes from different mark-up vocabularies within one XML document. This is achieved through qualified names, that make different vocabulary elements or attributes distinguishable even they use the same local names.
At first, some people considered namespaces to be a hostile element polluting XML with additional complexity without any reasonable need for it. At this time, the world of mark-up vocabularies has been dominated by over-grown monolithic languages, but soon they faced serious extensibility issues. The set of problems a language addresses is growing and evolving over time. To keep-up with the changing requirements, the monolithic approach constantly pollutes the vocabulary with new closely specific mark-up. As a result, we see extensive, difficult to learn and difficult to maintain languages, which are intended to solve all sort of problems, but not solving any of them in a satisfactory manner.
Recently, it is more and more obvious that some extensibility problems may be solved smarter using composition of more different single-purpose languages rather than further extending the monolithic language. If there already exists a widely adopted and understood vocabulary, which solves part of our problem, it makes a good sense to reuse it rather than introducing something new. With this approach, we gain an immense flexibility, as for every specific problem, we can adopt a specific combination of vocabularies.
Isolated single-purpose languages are easier to maintain and what is even more important, they can be easily reused in distinct applications. The domain of such languages is narrow and their aim is well defined, which helps to keep the language free of indiscreet extensions.

The Web environment

HTML is a good example of a shift from a standalone language to the use of compound documents. At first, HTML was a simple format for publishing mostly scientific articles and for linking them together. Soon the Web became incredibly popular and widely adopted. It didn't serve just as a standard for publishing electronic documents, but it also become a standard way to create distributed user interfaces for various information systems.
The tendency to use or misuse HTML to solve all sorts of distinct problems slowly polluted the language with various indiscreet proprietary extensions. Browser and other software vendors were promptly introducing new user-oriented features to gain a competitive advantage, without having a long term vision of the language development in mind. All those issues slowly caused the Web was loosing one of its main advantage; cross-platform interoperability.
The situation has significantly improved after standardization of the language (under W3C). The language has been purified from most of the presentational aspects and HTML has been made fully XML based (XHTML). But the major improvement was modularization of XHTML (see [XHTMLMOD]). The language became more flexible and adaptable in various environments for various tasks. Modularization allows to use a subset of HTML features that most reflects our particular needs. Logically related aspects of the language are decomposed into modules such as e. g. structure module with the main HTML structure, table module for expressing tabular data, text module for organizing texts into headings and paragraphs, forms module to define user interfaces, meta-information module to attach meta-data to documents, image module for images, link module to hyperlink our documents etc.
For addressing different problems, we can use different XHTML modules. For example, it doesn't make sense to use the forms module when publishing a simple electronic document. Moreover, we may look at modules as default vocabulary fragments with limited expressiveness. They can be left out completely (if not needed), but they can also be replaced by a whole different specialized and feature-rich language. This can be clearly spotted as the current trend in Web standards. Instead of the forms module, we can use XForms, RDF can replace the meta-information module, link module can be overridden by XLink, instead of images we can directly embed SVG vector graphics and so on.
To conclude, compound documents make Web more flexible and powerful. It makes it suitable for all sort of very different applications. Every specific problem can be addressed by a custom combination of highly specialized languages. If we follow this line of thinking, at some point we realize, there is no need for HTML anymore. The only think needed is a parent language to embed all the vocabularies which best suit our intentions. The XHTML structure module is a good candidate, but a completely different language can serve this purpose as well. As an example, imagine a Web application, where SVG takes the role of the parent language and embedded XForms is used to define the user interface. As SVG is a powerful presentational language and XForms is the latest declarative and feature-rich user interface standard, such combination of languages may be considered as the best solution to a particular problem.

Examples of application

There are many different applications of compound documents in many different areas. When once adopted, composition of different vocabularies seems as a natural and convenient approach to many problems. One of the first application of compound documents ever were templating languages. The W3C XSL Transformations recommendation has been published just few months after Namespaces in XML. XSLT is a popular templating language, which can express rules for transforming a source XML tree into a result tree. The particular rules are expressed using the XSLT vocabulary and they contain embedded fragments of the result tree vocabularies. This means XSLT stylesheets are basically compound documents.
In addition, XML is widely used as a data exchange format. In non-trivial cases, there is a protocol such as for example SOAP to exchange structured XML messages. SOAP messages are again compound documents as they consist of an envelope (SOAP vocabulary) which encapsulates the actual message payload (arbitrary vocabulary).
Compound documents are perfectly applicable also for office document formats. In this case, various differently structured data as structured text, tables, graphs, diagrams, spreadsheets need to be incorporated into one document. This is the right task for XML, as it features the right combining mechanisms and there are many specialized and mature XML vocabularies ready to be reused within the office documents. Lets mention the Open Document Format (ODF) as an example of XML being successfully used in various office bundles as the native document format.
As mentioned previously, the Web offers huge opportunities for compound documents and those haven't been really exploited yet. One of the problems is the lacking client support for the new standards. The situation is far better among the mobile device clients than on the desktop. Despite those issues, there are already some interesting compound document solutions, which may be used with some compromises in most today's mainstream browsers. Among of all, a good example of a Web compound document concept is the xH language, presented at the WWW2006 conference[XH]. It is basically a synthesis of XHTML and several other well-know standard XML languages; mainly XForms, SVG and MathML. Interaction of the different language blocks is achieved through JavaScript.
All the languages recommended by xH are widely used and well-known. But the value added and the major intention of xH is to encourage people to use those languages in combination to build a new generation of rich and flexible Web application with enhanced user experience.

Validation of compound documents

XML namespaces technically allow presence of multiple vocabularies inside one XML document, but there are many other issues, which need to be addressed, before we can adopt a compound document solution. Having descriptions of syntax and semantics of the particular vocabularies is insufficient for the client application to handle compound documents correctly. In addition, we need to have also syntax and semantics defined for the compound language.[1]
Different vocabulary fragments can be combined in many different ways and even the isolated fragments are meaningful (in respect to their language semantics) and they are syntactically correct, the combination of such fragments can be difficult to interpret or it can be even semantically empty.
As an example, imagine we have an XHTML document which consists of a head section, used to place document related meta-data, and a body section, intended to be rendered. Placing an RDF meta-data fragment into the body section would cause interpretation difficulties, as meta-data are not intended to be rendered. The correct place to put the fragment is of course the head section.
This implies, there are two major issues concerning compound documents. First, we need to create semantics for different compound languages to make them correctly interpretable by different applications. Second, we need to constrain the way different languages are combined together by allowing just meaningful combinations. This requires some kind of a “meta-schema” to express such constraints.
Today, automated validation is absolutely essential for standalone XML languages to ensure their syntactical correctness and thus interoperability. But this is even more important for compound documents, as they bring additional interpretation complexity. To make compound documents applicable in a heterogeneous environment (as the Web environment for example), it is absolutely essential to provide powerful compound document validation tools and techniques. This requires schema languages able to cope with multiple namespaces and validation engines able to check document instances against such schemas. In the following sections we will discuss two distinct approaches to compound document validation.

XML Schema or RELAX NG is insufficient

One approach to face the compound document validation problem is to use namespace support included in today's mainstream schema languages; e. g. in RELAX NG or XML Schema. Those namespace-aware schema languages use qualified names to define elements and attributes, thus compound document schemas can be created easily just by specifying the appropriate namespace for elements from different vocabularies.

Example 1. Allowing RDF inside the XHTML head section (RELAX NG)
<element name="head" 
         ns="http://www.w3.org/1999/xhtml">          
  <element name="title" 
           ns="http://www.w3.org/1999/xhtml">
    <text/>
  </element>
  <interleave>
    ... other XHTML head elements ...
    <optional>
      <element name="RDF" 
               ns="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        ...
      </element>
    </optional>  
  </interleave>
</element>

The previous example is quite straightforward, but it is not really flexible indeed. Reusing the definition for a different combination of vocabularies would be very painful as it would require us to use the copy and paste technique. The definition can be significantly improved by modularizing the schema. Having modules for each vocabulary would allow us to create schemas for various compound languages simply by including the right subset of modules.
Despite the namespace-aware schema language approach is very simple, it has several drawbacks. Imagine we like to allow a foreign vocabulary in some contexts of XHTML. This is a simple task for someone who is familiar with the implementation details of the XHTML schema (knowing the structure of the schema's definition and modules). But it is not as straightforward for someone who doesn't have the right insight.
Moreover, in most cases we cannot simply reuse the existing schemas for vocabularies we like to combine. Standalone language schemas aren't often well prepared to be combined with other schemas. Usually they don't have the right level of modularity and abstraction which is needed for their seamless integration. In addition, they are frequently written in different schema languages or even in languages which aren't namespace-aware at all (for example DTD). This implies, schemas for different vocabularies first need to be converted to the same namespace-aware schema language and slightly modified before they can be used as modules of the compound definition.
Such approach would not only require deep knowledge of the particular schemas and a long implementation time, but it also leads to maintenance issues. As different languages evolve over time, we need to keep our modified or converted schemas constantly up-to-date. For complex compound languages this may be an essential problem.
To demonstrate the issues, lets consider the following example. We would like to create a compound document schema for XHTML with embedded SVG and MathML in all block level and inline elements and RDF in the head section. When we decide to use a namespace-aware schema language (for example XML Schema) to achieve that, we run into problems. First, we need to convert the official XHTML DTDs into XML Schema. This needs to be done in a specific way, as we need to have abstract classes prepared for the head section and block and inline elements to make them easily extensible through additional modules. Further we can reuse the XML Schema for MathML, but such schema cannot be used as it is. First, we need to modify it to make it a module of our parent XHTML schema. The module needs to be further tailored in a specific way to allow MathML just in the context of the block and inline elements. A similar task needs to be done also for SVG and RDF.
Basically, every time a new vocabulary needs to be incorporated into the compound definition, its official schema first needs to be converted and modified. Moreover, different vocabulary modules needs to be constantly synchronized with new versions of the languages. This is an error-prone approach, because the different definitions are being duplicated. Another problem arises, when for example SVG should displace XHTML as the parent language. In such case, different vocabulary modules cannot be simply reused. On the contrary, they may need to be again duplicated and slightly reworked.
To conclude, the namespace-aware schema language concept is applicable in simple cases, but it is not a solution which can be considered for complex scenarios. Reusability of existing schemas is an important requirement which is not satisfied at all within today's namespace-aware schema languages. Different approach to compound document validation is needed to allow schema reusability. Such approach has to be independent of the particular schema's implementation details and the schema languages used.

NVDL is the right solution

NVDL, which means Namespace-based Validation Dispatching Language, is “Part 4 of ISO/IEC 19757 DSDL” (Document Schema Definition Languages) international standard. NVDL is a simple “meta-schema” language which allows to control processing and validation of compound documents. Figure 1, “NVDL validation process at a glance” demonstrates a particular validation dispatching process decomposed into several phases. An NVDL schema and a compound document instance shown in Example 2, “NVDL validation process” are potential participants of such process.
The essence of NVDL is dividing XML document instances into sections each of which contains elements or attributes from a single namespace. A section tree is first constructed for every instance (see Example 3, “Decomposing sections”). Sections are further combined or manipulated in various ways to create so called validation candidates.
Manipulation of sections is achieved through rules and their corresponding actions defined in an NVDL script. Actions are executed on a particular section whenever they match a certain rule; usually in case the sections namespace matches the rule's namespace wildcard.
There are several actions defined in NVDL; e. g. attach for attaching sections back to their parent, unwrap to handle wrapped sections and validate to send a particular validation fragment to a particular validator.
After executing actions, we usually obtain single namespace validation candidates which are further filtered for redundancy into validation fragments. Such fragments are finally independently send for validation against different subschemas[2] (see Example 4, “Dispatching validation fragments to validators”).
The ability to create single namespace fragments allows us not to care about namespaces in our subschemas at all. Single namespace schemas are easier to write and what is important also easy to reuse for various different compound languages, where the same vocabulary may be used in a different context or in combination with different vocabularies. Moreover, NVDL is schema language transparent, thus subschemas may be written in any preferable schema language e. g. RELAX NG, XML Schema, Schematron or DTD. For detailed information about the NVDL validation dispatching process, refer to [NVDL].

Figure 1. NVDL validation process at a glance



Example 2. NVDL validation process
The following example shows an NVDL schema and a compound document instance which relates to Figure 1, “NVDL validation process at a glance”. In this case, NS1 represents the XHTML namespace and NS2 stands for the XForms namespace. The following instance is an XHTML document with a simple form for retrieving stock quote information.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xf="http://www.w3.org/2002/xforms">
<head>
<xf:model>
  <xf:instance><stockquote><symbol/></stockquote></xf:instance>
  <xf:submission xml:id="form" method="post"  action="getStockQuote.do"/>
</xf:model>
</head>

<body>
<xf:group ref="stockquote">
<xf:input ref="symbol"><xf:label>Symbol</xf:label></xf:input>
<br />
<xf:submit submission="form"><xf:label>Get Quote</xf:label></xf:submit>
</xf:group>
</body></html>
To achieve behavior consistent with Figure 1, “NVDL validation process at a glance”, the following schema is applied to the previous compound document instance. Using such schema, the NVDL dispatcher first sends the root XHTML fragment for validation after filtering any descendant XForms fragments and attaching any descendant XHTML fragments. XForms sections are handled in a similar way by filtering any descendant XHTML. For any XHTML document instance with embedded XForms, the following NVDL schema causes one pure XHTML fragment to be send for validation against xhtml.xsd and one or more pure XForms fragments to be validated using xforms.rng.
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0">
<namespace ns="http://www.w3.org/1999/xhtml">
  <validate schema="xhtml.xsd">
    <mode><namespace ns="http://www.w3.org/2002/xforms">
      <validate schema="xforms.rng">
        <mode><namespace ns="http://www.w3.org/2002/xforms"><attach/></namespace>
          <namespace ns="http://www.w3.org/2002/06/xhtml2"><unwrap/></namespace>
        </mode>
      </validate>
      <unwrap>
      <mode><namespace ns="http://www.w3.org/2002/xforms"><unwrap/></namespace>
        <namespace ns="http://www.w3.org/2002/06/xhtml2"><attach/></namespace>
      </mode>
      </unwrap>
    </namespace></mode>
  </validate>
</namespace>
</rules>


Example 3. Decomposing sections
The instance shown in Example 2, “NVDL validation process” is decomposed into the following section tree after applying the NVDL schema from the same example.
ES1 <html><head>ref to ES2</head>
<body>ref to ES4</body>
</html>

ES2 <xf:model>...</xf:model>

ES3 <br />

ES4 <xf:group ref="stockquote"><xf:input ref="symbol">...</xf:input>
ref to ES3
<xf:submit submission="form">...</xf:submit></xf:group>


Example 4. Dispatching validation fragments to validators
After executing attach and unwrap actions on the section tree shown in Example 3, “Decomposing sections”, the following resulting fragments are created and send independently for validation.
<html><head></head>
<body><br /></body>
</html> -> xhtml.xsd

<xf:model>...</xf:model> -> xforms.rng

<xf:group ref="stockquote"><xf:input ref="symbol">...</xf:input>
<xf:submit submission="form">...</xf:submit></xf:group> -> xforms.rng

Namespace-aware schema languages force us to convert all our schemas to the same schema language. In contrast, when using NVDL, we are free to choose any languages we prefer and we may also combine different schema languages during a single validation process. NVDL does not force XML language designers to use mainstream schema languages. On the contrary, they are encouraged to choose the schema language which best suits their vocabulary needs. NVDL is also completely isolated from the implementation details of the particular subschemas. The vocabulary designers can fully focus on the schema implementation without even thinking about how can their vocabulary possibly be combined with a different one. NVDL schema designers do not need any knowledge of the particular subschema implementation details. Moreover, they don't even need to understand the different subschema languages when designing their NVDL scripts.
In the previous section we have seen how difficult it is to create a compound definition using a namespace-aware schema language. Lets use NVDL to create the same compound schema: XHTML with embedded SVG, MathML and RDF. In this case, there is no need for several experts to work on that for days. One person can create such NVDL script in a matter of minutes (see Example 5, “NVDL schema for XHTML with embedded SVG, MathML and RDF”). The reason is, existing schemas can be fully reused without making any changes to them.

Example 5. NVDL schema for XHTML with embedded SVG, MathML and RDF
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0" startMode="root">
  <mode name="root">       
    <namespace ns="http://www.w3.org/1999/xhtml">
      <validate schema="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
        <context path="head" useMode="head"/>
        <context path="div|li|p...all block level elements" useMode="block_inline"/>
        <context path="a|em|span|...all inline elements" useMode="block_inline"/>
      </validate>
    </namespace>
  </mode>  
  <mode name="block_inline">
    <namespace ns="http://www.w3.org/2000/svg">
      <validate schema="http://www.w3.org/TR/2002/WD-SVG11-20020108/SVG.xsd"/>
    </namespace>
    <namespace ns="http://www.w3.org/1998/Math/MathML">
      <validate schema="http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"/>
    </namespace>
  </mode>
  <mode namne="head">
    <namespace ns="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <validate schema="http://www.w3.org/2000/07/rdf.xsd">
        <mode><anyNamespace><attach/></anyNamespace></mode>  
      </validate>
    </namespace>
  </mode>
</rules>

In the NVDL script in Example 5, “NVDL schema for XHTML with embedded SVG, MathML and RDF”, subschemas are referenced directly at their original locations using URLs. The script tells the NVDL engine that the only acceptable parent language is XHTML and other vocabularies are forbidden in that context. Plain XHTML is extracted from the validated document and send for validation against the official DTDs. RDF sections may only occur in the context of the head element. Any foreign vocabulary contained inside the RDF fragment is attached to it before being send for validation. SVG and MathML fragments are allowed only in block and inline elements. Any other vocabulary in any other context of the document is rejected.
This simple example demonstrates the power of NVDL. Modifying the NVDL script to allow any other vocabulary in some context is a simple and straightforward task. In addition, the script contains only the required information about the compound language. Anything related to the grammar of the particular vocabularies is encapsulated in the subschemas where it really belongs. This makes NVDL schemas not only easy to design, but also easy to read and understand.
Note that Example 5, “NVDL schema for XHTML with embedded SVG, MathML and RDF” demonstrates the use of an NVDL context construct which allows to apply a specific handling to sections in a given path within their parent section. Several paths separated by | may be used within one context condition. Paths used in the example are relative, but absolute paths may be used as well. For example an HTML head context can be addressed also using the /html/head path.

JNVDL

JNVDL is a Java-based implementation of the [NVDL] specification and it was developed by our team. JNVDL uses the new JAXP validation API, which has been introduced in Java 5. This is a standard way to invoke validation processes and it makes JNVDL easy to use for the end user as well as simply reusable in other applications.

Example 6. Invoking NVDL validation using the JAXP validation API
Schema schema = 
  SchemaFactory.newInstance(
    "http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0")
      .newSchema(new File("schema.nvdl"));

schema.newValidator().validate("instance.xml");

The validation API is also used by JNVDL to invoke further validation processes for particular subschemas. This makes JNVDL transparent to different schema language validator implementations. Enabling JNVDL to validate against a new schema language is just the matter of adding the appropriate validator's jar library on the Java classpath. No changes to the JNVDL code are required.

Implementation issues

One of the issues the JNVDL implementation faced is related to the fact that validators tend to report error locations using line and column numbers. When parsing the validated instance and turning it into validation fragments, the original position of elements and attributes is inevitably lost. There are two reasons for that. First, today parsers are lacking round-tripping support, thus some whitespaces which are considered to be irrelevant aren't reported and preserved. Second, different validation fragments are taken from different places within the original document. As they are send to validators separately, the original position is lost.
Such behavior may confuse users as the error line numbers reported by the particular validators aren't related to the original document but to the particular fragment context. To interpret the information correctly, users would need to deduce the line numbers from the original position of the validation fragments created by JNVDL.
To overcome those difficulties, JNVDL provides a proprietary round-tripping extension. Such extension preserves whitespaces from the original document and before validation fragments are being send to the particular validators, they are modified so that elements and attributes occur on the same lines as in the original document. Further, if an XML fragment is extracted from the middle of the document, JNVDL adds the appropriate number of empty lines before it to keep the fragment at the same location.
The problem of irrelevance of some whitespaces within XML documents makes the use of line numbers to locate elements and attribute problematic or even error-prone. Validation API designers should consider using a different mechanism to locate errors. For example XPath is a good candidate, as it is whitespace independent and precise.

Impact on a Web architecture

It is evident that compound documents are used more and more often. At this time NVDL is the best available technology for validation of compound documents. Still NVDL has some limitations which make it not very usable for some document types that are widely used on the Web. On the other hand the most popular container format for compound documents—XHTML—is not designed with this usage scenario in mind and breaks many principles of a Web architecture.

Namespace is not a document type

It is a quite common misconception that for each namespace there is a single schema defined somewhere. This assumption might hold for some simpler specialized XML based languages, but for many languages used on the Web, namespace works just as a basic semantic identification.
Very often, there are multiple different variants of vocabulary in one particular namespace. These vocabularies could be subsets of the “base” language—for example, this is a case of XHTML 1.0 Transitional and its derivates like XHTML 1.0 Strict, XHTML Basic or XHTML Print. The second case is newer version of vocabulary which does not change meaning of original elements so there is no need to change namespace. Both XSLT 1.0 and XSLT 2.0 share the same namespace, but XSLT 2.0 defines dozen of new elements and attributes, it even changes content model of some elements. Similar situation is true also for XHTML—XHTML 1.1 defines several new elements for Ruby annotations.

Versioning namespaces

Several different approaches for recognizing document types in a single namespace are in a common use. One of the easiest is usage of dedicated attribute for holding version information; for example in case of XSLT.

Example 7. Version information inside XSLT 2.0 stylesheet
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="2.0">
  ...
</xsl:stylesheet>

This is almost ideal way of conveying version information. Attribute value can be easily accessed in almost all processing tools. What is even more important, you can embed XSLT into other XML vocabulary and you are still able to identify a version of XSLT used by using the version attribute.
The only problem is that XSLT allows versioning attribute to occur only on a top element of a stylesheet. So you are unable to extract for example one template from stylesheet and add versioning information to this template.
XHTML uses legacy way of specifying versioning information which is depending on presence of a document type declaration (!DOCTYPE) at the start of the document.

Example 8. Version information in XHTML document
<!DOCTYPE html 
  PUBLIC "-//W3C//DTD XHTML-Print 1.0//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-print10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  ...
</html>

Strictly speaking, document type declaration is not version indication, it is just reference to DTD which can be used for validation and definition of entities used. But public or system identifier could be used as a version identifier albeit quite long and verbose.
Unfortunately document type declaration can occur only at the beginning of an XML document. It can not be embedded in the middle of the document—this disqualifies it from being used in the Web of compound documents. This for example means that you can not embedded an XHTML page into SOAP message and identify version of XHTML used.
Moreover, current specifications of several XHTML flavors (for example XHTML Basic and XHTML Print) make public identifier optional and allows specification of a private system identifier as long as it points to a copy of original DTD. This means that in order to reliably detect version of XHTML used, you have to download DTD, normalize line-end characters inside it and then compare it to one of original DTDs provided by W3C as a part of respective specification. It is evident that such process is overkill. Moreover, requests for download of private copy of DTD could be misused as attack against the Web agent—this DTD could be very long or it could use a big amount of entity declarations to congest the XML parser.
There is also not very well known feature of XHTML that could be used for specifying version information instead of document type declaration. It is possible to use the profile attribute on the head element. Profile identifies particular profile (version, subset) of the language used and it has form of URI.

Example 9. More robust way of labeling document as XHTML Print
<html xmlns="http://www.w3.org/1999/xhtml">
  <head 
    profile="http://www.w3.org/Markup/Profile/Print">
  ...
  </head>
  ...
</html>

Again, profile attribute is not a perfect solution—it can be specified only on the head element and thus can not be used for specifying flavor of XHTML used for just a small fragment of XHTML code.
Previous examples show a very sad conclusion that the current state of XML vocabularies and their specifications were not designed in order to make it possible to fully exploit possibilities of compound documents. We think that W3C should extend current Web architecture [WEBARCH] and update older specifications to support robust and flexible way of attaching version information to arbitrary fragments of XML vocabularies. It seems that allowing version or similar attribute on all elements which can be used as root elements of XML fragments is a simple and sufficient solution. At the same time, a document type declaration should be made an optional part of a conforming document.

Extending NVDL to support versioning

In the previous text we explained why we think that namespace itself is not sufficient for a document type identification. As NVDL validation dispatching is based solely on namespaces, we have a problem here—the current version of NVDL is not able to recognize different flavors of XHTML or XSLT and route validation to appropriate versions of the schema. To overcome this limitation, we designed few extensions to the NVDL language. Their utility is being evaluated using JNVDL (our implementation of the NVDL standard).
Our extensions allow better control over the dispatching process. Validations candidate (usually XML fragment from a single namespace) is being validated only if corresponding rule in NVDL script matches both namespace and additional condition. This condition can be expressed in XPath language. Example 10, “Validation dispatching based on XPath expression” shows NVDL script which can differentiate between XSLT 1.0 and XSLT 2.0 and use the correct schema for validation.

Example 10. Validation dispatching based on XPath expression
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"
       xmlns:jnvdl="http://jnvdl.sf.net">
  <namespace ns="http://www.w3.org/1999/XSL/Transform" 
             jnvdl:useWhen="@version = '1.0'">
    <validate schema="xslt1.xsd"/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/XSL/Transform" 
             jnvdl:useWhen="@version = '2.0'">
    <validate schema="xslt2.rng"/>
  </namespace>
</rules>

The appropriate rule in an NVDL script is used for dispatching only when expression in the useWhen attribute is true. Context for XPath evaluation corresponds to the element section to be dispatched. For attribute sections, dummy element with attached attributes is expected.
Although we previously stated that document type declaration is not a robust way for specifying version information, it is worth to support this way for legacy documents. The past practice was to differentiate document types using system and public identifier. JNVDL thus supports also additional parameters useWhenPublicId, useWhenSystemId, useWhenPublicIdRegex and useWhenSystemIdRegex to control when a particular rule will be used for dispatching based on content of system or public identifiers. If there is more then one “useWhen…” parameter on a single rule, it is sufficient if just one of them matches. Example 11, “Extended NVDL script which handles various flavors of XHTML” shows how to use these JNVDL extensions for handling various kinds of XHTML in a single NVDL script.

Example 11. Extended NVDL script which handles various flavors of XHTML
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"
       xmlns:jnvdl="http://jnvdl.sf.net"
       xmlns:html="http://www.w3.org/1999/xhtml">
  <namespace ns="http://www.w3.org/1999/xhtml" 
             jnvdl:useWhenPublicId="-//W3C//DTD XHTML 1.0 Strict//EN"
             jnvdl:useWhenSystemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <validate schema="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/xhtml" 
             jnvdl:useWhenPublicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
             jnvdl:useWhenSystemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <validate schema="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/xhtml" 
             jnvdl:useWhenPublicId="-//W3C//DTD XHTML-Print 1.0//EN"
             jnvdl:useWhenSystemId="http://www.w3.org/MarkUp/DTD/xhtml-print10.dtd"
             jnvdl:useWhen="contains(concat(' ', normalize-space(html:html/html:head/@profile), ' '), 
                                      ' http://www.w3.org/Markup/Profile/Print ')">
    <validate schema="xhtml-print-1.rng"/>
  </namespace>
  <!-- Unrecognized flavors of XHTML are validated as XHTML 1.0 Transitional -->
  <namespace ns="http://www.w3.org/1999/xhtml">
    <validate schema="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>
  </namespace>
</rules>

We think that without extensions similar to what we propose, usage of NVDL is somehow limited when considering the real Web compound documents in use. On the other hand, only XPath enabled matching (useWhen) could be potentially added into a future revision of NVDL. Matching done on system and public identifiers needs information, which is not available in the NVDL data model. JNVDL is thus using an augmented data model to support those extensions.

Related work

NVDL could be used not only for validation, but also as a generic mechanism for associating schemas to XML documents. This feature is necessary in many scenarios, including XML editors or loosely coupled interfaces for accepting incoming XML messages.
NVDL is an international standard and builds on many preceding technologies. Some XML editors, for example nXML mode for Emacs and oXygen, use similar approach for association of schemas for document being edited. But as far as we know, none of those systems supports usage of full XPath to manage associations. In this regard our NVDL implementation offers unique functionality which we believe will be incorporated directly into NVDL standard in a future.
Currently there are two other NVDL implementations—oNVDL and enovdl. Advantages of our implementation include seamless integration with Java XML APIs and extensions for working with multiple different schemas for the same namespace.
Our implementation holds all validation candidates in memory in a tree representation. This allows us to do fancy things like evaluating full XPath before dispatching takes place. oNVDL uses streaming approach which is more memory efficient and can be used for processing of a very large datasets. But it is impossible to support full XPath in a streaming mode.

Future work

One of the future plans for JNVDL is to incorporate it into the Relaxed Web document validation project, see [RLX]. Relaxed consists of comprehensive HTML 4.01/XHTML 1.0 schemas and a validation framework. The schemas were created using RELAX NG with embedded Schematron rules. This combination of languages is very powerful. It allows to express more constrains than the official W3C DTD-based schemas. Relaxed validates also some of the WCAG 1.0 constrains and it has a compound documents validation support. Users may choose to allow foreign namespaces inside their documents or to ban them completely. They may also validate against a set of predefined compound schemas e. g. XHTML+SVG, XHTML+MathML etc.
In terms of compound documents, the Relaxed project is limited to RELAX NG namespace support. As described previously, this makes it difficult or even time consuming to create compound document schemas. To overcome those difficulties, the new version of Relaxed, which is currently being developed, will use JNVDL as the core validation engine. Such step will make NVDL validation publicly accessible through a Web based interface.
Moreover, JNVDL will help to create and maintain new compound document schemas easily. Part of the future work is to enrich the current Relaxed schema repository with NVDL scripts for different combinations of the standard and widely used vocabularies, especially those intended to be used in the Web environment.

Conclusions

This article has shown that validation of compound documents is best handled using the NVDL validation language, which has many advantages over other similar approaches (e. g. namespace support in Relax NG or XML Schema). We disclosed a problem of having different versions of document type in a single namespace and we have shown how to fix this problem by extending NVDL in our implementation called JNVDL. We will use the acquired knowledge from developing JNVDL and using NVDL when building new version of the Relaxed validator.

Bibliography


[NVDL] Document Schema Definition Languages (DSDL) — Part 4: Namespace-based Validation Dispatching Language — NVDL. ISO/IEC 19757-4. 2006. Available at: http://standards.iso.org/ittf/PubliclyAvailableStandards/c038615_ISO_IEC_19757-4_2006(E).zip

[XH] Birkbeck, M: xH: The new language you already know. Presented at WWW 2006 conference, Edinburgh.

[XML] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F.: Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C, 2006. Available at: http://www.w3.org/TR/2006/REC-xml-20060816

[NS] Bray, T., Hollander, D., Layman, A., Tobin, R.: Namespaces in XML 1.0 (Second Edition). W3C, 2006. Available at: http://www.w3.org/TR/REC-xml-names

[WEBARCH] Walsh, N., Jacobs. I: Architecture of the World Wide Web, Volume One. W3C, 2004. Available at: http://www.w3.org/TR/webarch/

[RLX] Kosek, J., Nálevka, P.: Relaxed—on the Way Towards True Validation of Compound Documents. In: WWW 2006 Proceedings. WWW 2006. May 23–26, 2006. Edinburgh, Scotland. Available at: http://www2006.org/programme/files/pdf/4508.pdf

[HTML4] Ragget, D., Le Hors, A., Jacobs, I.: HTML 4.01 Specification. W3C, 1999. Available at: http://www.w3.org/TR/1999/REC-html401-19991224/

[XHTML1] XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition). W3C, 2002. Available at: http://www.w3.org/TR/2002/REC-xhtml1-20020801/

[XMLSCH-ST] Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures Second Edition. W3C, 2004. Available at: http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/

[XMLSCH-DT] Biron, P., Malhotra, A.: XML Schema Part 2: Datatypes Second Edition. W3C, 2004. Available at: http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/

[RNG] Clark, J., Murata, M.: RELAX NG Specification. OASIS Committee Specification, 2001. Available at: http://www.relaxng.org/spec-20011203.html

[XHTMLMOD] Altheim, M., McCarron, S., Boumphrey, F., Dooley, S., Schnitzenbaumer, S., Wugofski, T.: Modularization of XHTML™. W3C, 2001. Available at: http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/

[RDF] Beckett, D.: RDF/XML Syntax Specification (Revised). W3C, 2004. Available at: http://www.w3.org/TR/rdf-syntax-grammar

[SVG] Ferraiolo, J., Fujisawa, S., Jackson, J.: Scalable Vector Graphics (SVG) 1.1 Specification. W3C, 2003. Available at: http://www.w3.org/TR/SVG11

[MTHML] Carlisle, D., Ion, P., Miner, R., Poppelier, N.: Mathematical Markup Language (MathML) Version 2.0 (Second Edition). W3C, 2003. Available at: http://www.w3.org/TR/MathML2


[1] Compound language is a term used within this text to describe a language composed of two or more different XML vocabularies. Such a composition is considered to be a language itself, as it has its own syntax and semantics in addition to the syntax and semantics of the particular vocabularies.
[2] Subschema is defined in the NVDL specification as a schema referenced by the NVDL script.

No comments:

Post a Comment