Yaidom and namespaces by dvreeze

Introduction

This article is the second one in a series of articles introducing the yaidom library. The topic is XML namespaces in yaidom. Like in the first article, the code examples contain basic XML queries in yaidom, but this time special attention is paid to namespaces. In particular, the concepts of qualified names, expanded names, namespace declarations and in-scope namespaces are treated in the context of yaidom. It becomes clear that yaidom is an API that values precision, clarity, and, to a large extent, minimality.

This article is based on Understanding XML Namespaces, which offers an excellent explanation of XML namespaces. Much of the content of that article can be expressed in yaidom. This second yaidom article indeed shows that yaidom is not just an XML query API, but that its understanding of namespaces can be explained via yaidom code as well! In that capacity yaidom feels a bit like a "theory", and this article indeed exposes that "theoretical" side.

It is assumed that the reader has first read the first article about yaidom, which introduced the basics of element querying in yaidom. Like in the preceding article on yaidom, it is assumed that the reader knows the basics of XML (including namespaces), Java XML processing, and Scala with its Collections API.

The remainder of this article is organized as follows:

Qualified and expanded names.
Namespace declarations and in-scope namespaces.
Attribute querying and resolution.
Equivalent XML documents.
Conclusion (of this second article about yaidom).

Qualified and expanded names

Like in the preceding article on yaidom, all code examples below can be tried out in the Scala REPL. Scala 2.11.X is required, and yaidom 1.3.2 (or later) must be on the classpath. The sample XML file (from the article by Evan Lenz) is feed1.txt. First assume the following code to have executed:

import java.net.URI
import javax.xml.parsers._
import scala.collection.immutable
import eu.cdevreeze.yaidom._
import eu.cdevreeze.yaidom.core._

import queryapi.HasENameApi._

// Using a yaidom DocumentParser that used SAX internally
val docParser = parse.DocumentParserUsingSax.newInstance

val feed1Doc: Document =
    docParser.parse(new URI("http://dvreeze.github.io/examples/feed1.txt"))

val feed1DocElem = feed1Doc.documentElement

In Understanding XML Namespaces, Evan Lenz distinguishes between qualified names and expanded names. As he puts it, qualified names (or QNames) are the syntactic constructs that represent expanded names. Qualified names have an optional colon, so foo and my:foo are QNames. Expanded names, which do not syntactically occur in XML, have an optional namespace URI. Using James Clark notation, foo and {http://xmlportfolio.com/xmlguild-examples}foo are expanded names (or ENames).

Let's now query for the qualified names and expanded names of all elements in the document parsed above. The qualified names of all descendant-or-self elements are found as follows:

val elemQNames = feed1DocElem.findAllElemsOrSelf.map(_.qname).toSet

The resulting QNames are as follows:

require(elemQNames ==
    Set(
        QName("feed"),
        QName("title"),
        QName("rights"),
        QName("xhtml", "div"),
        QName("xhtml", "strong"),
        QName("xhtml", "em")))

// Writing the QNames differently, corresponding to the string representations:

require(elemQNames ==
    Set(
        QName("feed"),
        QName("title"),
        QName("rights"),
        QName("xhtml:div"),
        QName("xhtml:strong"),
        QName("xhtml:em")))

The qname method is not part of the yaidom "uniform query API" (ElemApi and friends), but belongs to the "default" element implementation. Expanded names, on the other hand, are far more central in the query API. The expanded names (or resolved names) of all descendant-or-self elements are found as follows:

val elemENames = feed1DocElem.findAllElemsOrSelf.map(_.resolvedName).toSet

The resulting ENames are as follows:

val atomNs = "http://www.w3.org/2005/Atom"
val xhtmlNs = "http://www.w3.org/1999/xhtml"

require(elemENames ==
    Set(
        EName(atomNs, "feed"),
        EName(atomNs, "title"),
        EName(atomNs, "rights"),
        EName(xhtmlNs, "div"),
        EName(xhtmlNs, "strong"),
        EName(xhtmlNs, "em")))

// Writing the ENames differently, using James Clark notation:

require(elemENames ==
    Set(
        EName("{http://www.w3.org/2005/Atom}feed"),
        EName("{http://www.w3.org/2005/Atom}title"),
        EName("{http://www.w3.org/2005/Atom}rights"),
        EName("{http://www.w3.org/1999/xhtml}div"),
        EName("{http://www.w3.org/1999/xhtml}strong"),
        EName("{http://www.w3.org/1999/xhtml}em")))

Unlike the query for qualified names, the query for expanded names would have worked for most yaidom element implementations.

As Evan Lenz points out, it is important to be careful when someone uses the term "qualified name". Often the term "qualified name" is used for what we call EName, and often it means the same as what we call an EName, but keeping the optional prefix as well (see for example javax.xml.namespace.QName). Like Evan Lenz does in his article, yaidom clearly distinguishes between the two concepts of QNames and ENames, because without such a distinction it is very hard to talk about namespaces in precise terms. It is not very helpful that namespace-related terminology is not used consistently across different XML specifications. The Namespaces specification uses the terms qualified name and expanded name like Evan Lenz does (and like yaidom does), whereas the XML Schema Part 2, XPath 2.0 and XQuery 1.0 specifications blur this distinction. According to the XML Schema Part 2 specification, the xs:QName data type has a "lexical space" reminding of what we call QNames and a "value space" reminding of what we call ENames.

While yaidom claims to offer precision, clarity and (to a large extent) minimality, it should become clear now why yaidom does not claim to be "correct". After all, what is correctness if different XML specifications have the same term mean different things? Moreover, what is correctness if some definitions are problematic in that they mutually depend on each other? For example, namespace declarations are considered to be attributes by the Namespaces specification, while the resolution of (prefixed) attributes depends on in-scope namespaces and therefore on namespace declarations. To retain precision and clarity, yaidom (like several other XML libraries) therefore does not consider namespace declarations to be attributes.

We have not paid any attention to how qualified names are resolved as expanded names. That's the topic of the next section.

Namespace declarations and in-scope namespaces

In the example XML document above, only the document element contains some namespace declarations. The "atom" namespace (with namespace URI http://www.w3.org/2005/Atom) is the default namespace. The "xhtml" namespace (with namespace URI http://www.w3.org/1999/xhtml) is declared with prefix xhtml. Another namespace (with namespace URI http://xmlportfolio.com/xmlguild-examples) is declared with prefix my. These namespace declarations are written in yaidom as follows:

val feed1ElemDecls = Declarations.from(
    "" -> "http://www.w3.org/2005/Atom",
    "xhtml" -> "http://www.w3.org/1999/xhtml",
    "my" -> "http://xmlportfolio.com/xmlguild-examples")

Note that the default namespace was declared with the empty string as prefix. The in-scope namespaces of the document element are the same as the namespaces declared in the document element, because it has no parent element (and therefore no in-scope namespaces of the parent element). Indeed:

val feed1ElemScope = Scope.Empty.resolve(feed1ElemDecls)

val expectedFeed1ElemScope = Scope.from(
    "" -> "http://www.w3.org/2005/Atom",
    "xhtml" -> "http://www.w3.org/1999/xhtml",
    "my" -> "http://xmlportfolio.com/xmlguild-examples")

require(feed1ElemScope == expectedFeed1ElemScope)

There are no namespace declarations elsewhere in the XML document, so all descendant-or-self elements have the same scope:

require(feed1DocElem.findAllElemsOrSelf.forall(e => e.scope == feed1ElemScope))

Like the qname method earlier, the scope method is not part of the yaidom "uniform query API" (ElemApi and friends), but belongs to the "default" element implementation. It is now shown that for each element in the atom namespace the scope is used to resolve the QName as EName:

val allElems = feed1DocElem.findAllElemsOrSelf

// The default namespace is the atom namespace
val allAtomElems = allElems.filter(e => e.qname.prefixOption.isEmpty)

require(allAtomElems.forall(e =>
    e.scope.resolveQNameOption(e.qname) == Some(e.resolvedName)))

// Indeed, the ENames are in the atom namespace
require(allAtomElems.forall(e =>
    e.resolvedName.namespaceUriOption == Some(atomNs)))

Not surprisingly, for the elements in the xhtml namespace things are very similar:

val allXhtmlElems = allElems.filter(e => e.qname.prefixOption == Some("xhtml"))

require(allXhtmlElems.forall(e =>
    e.scope.resolveQNameOption(e.qname) == Some(e.resolvedName)))

// Indeed, the ENames are in the xhtml namespace
require(allXhtmlElems.forall(e =>
    e.resolvedName.namespaceUriOption == Some(xhtmlNs)))

Generalizing resolution of QNames as ENames, the following check succeeds for all descendant-or-self elements:

require(feed1DocElem.findAllElemsOrSelf.forall(e =>
    e.scope.resolveQNameOption(e.qname) == Some(e.resolvedName)))

The correspondence between element QNames and ENames can also be expressed as follows:

require(feed1DocElem.findAllElemsOrSelf.forall(e =>
    e.resolvedName.localPart == e.qname.localPart))

require(feed1DocElem.findAllElemsOrSelf.forall(e =>
    e.resolvedName.namespaceUriOption ==
        e.scope.prefixNamespaceMap.get(e.qname.prefixOption.getOrElse(""))))

Note how properties like these can be expressed in yaidom, using "default" (immutable!) elements.

Attribute querying and resolution

Up to now, we have only queried for elements, and not for attributes. Typically we query for attributes using method attributeOption or its alias \@, which returns the string value of the attribute, if any, wrapped in an Option. If we are sure that an attribute exists, we can instead use method attribute, which returns the string value of the attribute, but throws an exception if the attribute does not exists. Namespace declarations are not considered attributes in yaidom, as mentioned earlier. For example:

// Get the rights child element of the root element
val rights1Elem: Elem = feed1DocElem.getChildElem(withEName(atomNs, "rights"))

require(rights1Elem \@ EName("type") == Some("xhtml"))

val examplesNs = "http://xmlportfolio.com/xmlguild-examples"

require(rights1Elem \@ EName(examplesNs, "type") == Some("silly"))

As shown above, the rights element (in the default atom namespace) has 2 attributes with local name type, one unprefixed and one with prefix my. As Evan Lenz points out, the default namespace does not affect unprefixed attributes. Therefore the expanded names of these 2 attributes are EName("type") and EName(exampleNs, "type"), respectively.

To get all attributes of rights1Elem, as a mapping from QNames to the attribute string values, we write:

val rights1ElemAttrs = rights1Elem.attributes

require(rights1ElemAttrs.toMap.keySet ==
    Set(QName("type"), QName("my", "type")))

To get the "resolved" attributes, as mappings from ENames to the attribute string values, we write:

val rights1ElemResolvedAttrs = rights1Elem.resolvedAttributes

require(rights1ElemResolvedAttrs.toMap.keySet ==
    Set(EName("type"), EName(examplesNs, "type")))

The correspondence between the qualified and expanded names of attributes can be expressed more generally as follows:

require {
    feed1DocElem.findAllElemsOrSelf forall { elem =>
        val attrs = elem.attributes
        val resolvedAttrs = attrs map {
            case (attrQName, attrValue) =>
                val resolvedAttrName = elem.attributeScope.resolveQNameOption(attrQName).get
                (resolvedAttrName -> attrValue)
        }

        resolvedAttrs.toMap == elem.resolvedAttributes.toMap
    }
}

Above, the "attribute scope" is the scope excluding the default namespace:

require(feed1DocElem.findAllElemsOrSelf.forall(e =>
    e.attributeScope == e.scope.withoutDefaultNamespace))

Equivalent XML documents

Now consider the XML document feed2.txt. It is equivalent to the preceding XML document, in that it has the same elements with the same expanded names and "resolved attributes". Still, its namespace declarations occur not only in the document element, it uses a different prefix for one of the namespaces, and it overrides the default namespace. First assume the following code to have executed:

val feed2Doc: Document =
    docParser.parse(new URI("http://dvreeze.github.io/examples/feed2.txt"))

val feed2DocElem = feed2Doc.documentElement

Consider the element with local name div. What are the in-scope namespaces of that element? That scope is determined by the namespace declarations in the ancestor-or-self elements. So it is the "concatenation" of the namespace declarations in the document element, the rights element, and the div element itself. Written as yaidom code:

val div2Elem = feed2DocElem.findElem(withEName(xhtmlNs, "div")).get

val feed2ElemDecls = Declarations.from("" -> atomNs)
val rights2ElemDecls = Declarations.from("example" -> examplesNs)
val div2ElemDecls = Declarations.from("" -> xhtmlNs)

val div2ElemScope =
    Scope.Empty.resolve(feed2ElemDecls).resolve(rights2ElemDecls).resolve(div2ElemDecls)

require(div2ElemScope == Scope.from("" -> xhtmlNs, "example" -> examplesNs))

Given the scope of the div element, it is clear that the div element has the same resolved name as in the preceding XML document, namely EName(xhtmlNs, "div"), because the QName is QName("div") and the default namespace is the xhtml namespace. Looking at the entire XML documents, we can assert that feed1.txt and feed2.txt have the same elements, with the same expanded names and "resolved attributes". So, if we compare both XML documents without considering prefixes, they turn out to be equivalent:

val feed1ResolvedElem = resolved.Elem(feed1DocElem)
val feed2ResolvedElem = resolved.Elem(feed2DocElem)

require(feed1ResolvedElem.removeAllInterElementWhitespace ==
    feed2ResolvedElem.removeAllInterElementWhitespace)

Indeed, both XML documents have the same elements with the same resolved names and resolved attributes:

require(feed1DocElem.findAllElemsOrSelf.map(_.resolvedName) ==
    feed2DocElem.findAllElemsOrSelf.map(_.resolvedName))

require(feed1DocElem.findAllElemsOrSelf.map(_.resolvedAttributes) ==
    feed2DocElem.findAllElemsOrSelf.map(_.resolvedAttributes))

Finally, consider the XML document feed3.txt. It is also equivalent to the preceding XML documents, in that it has the same elements with the same expanded names and "resolved attributes". This time, the title, rights and div elements have (some of) the same namespace declarations as the feed element. Such duplicate namespace declarations obviously do not affect the scope of the element in question. Assume the following code to have executed:

val feed3Doc: Document =
    docParser.parse(new URI("http://dvreeze.github.io/examples/feed3.txt"))

val feed3DocElem = feed3Doc.documentElement

Let's again compute the scope of the div element by hand:

val div3Elem = feed3DocElem.findElem(withEName(xhtmlNs, "div")).get

val feed3ElemDecls =
    Declarations.from("" -> atomNs, "xhtml" -> xhtmlNs, "my" -> examplesNs)
val rights3ElemDecls = feed3ElemDecls
val div3ElemDecls = Declarations.from("xhtml" -> xhtmlNs, "my" -> examplesNs)

val div3ElemScope =
    Scope.Empty.resolve(feed3ElemDecls).resolve(rights3ElemDecls).resolve(div3ElemDecls)

require(div3ElemScope ==
    Scope.from("" -> atomNs, "xhtml" -> xhtmlNs, "my" -> examplesNs))

// The namespace declarations in the rights and div elements added no information
require(
    feed3DocElem.getChildElem(withEName(atomNs, "rights")).scope ==
        feed3DocElem.scope)
require(div3Elem.scope == feed3DocElem.scope)

Again, feed3.txt and feed1.txt are equivalent:

val feed3ResolvedElem = resolved.Elem(feed3DocElem)

require(feed1ResolvedElem.removeAllInterElementWhitespace ==
    feed3ResolvedElem.removeAllInterElementWhitespace)

Conclusion

In this article, we concentrated on yaidom's support for namespaces. Following the article Understanding XML Namespaces, yaidom clearly distinguishes among the concepts of qualified names, expanded names, namespace declarations and in-scope namespaces. It therefore turns out that we can talk about namespaces, using yaidom code, with almost mathematical precision.

The preceding yaidom article and this one laid the groundwork for more interesting articles on yaidom that will follow shortly.