XML processing with Scala and yaidom

Introduction

This article introduces the yaidom XML query library, using examples in the domain of XBRL (eXtensible Business Reporting Language).

It is assumed that the reader has some experience with XML processing in Java (e.g. JAXP) or another OO programming language (such as Scala or C#).

XSLT, XQuery and XPath are standard XML transformation/query languages, yet in this article yaidom (with Scala) is introduced as an alternative approach to in-memory XML querying/transformation, leveraging the Scala programming language. Still, yaidom can also be used together with standard languages such as XQuery, for example when using an XML database.

The remainder of this article is organized as follows:

Brief introduction to Scala and Scala Collections.
Brief introduction to yaidom.
Brief introduction to the XBRL examples.
Simple yaidom query examples.
Namespace examples.
Extending yaidom for custom XML dialects.
Conclusion.

As mentioned above, after introducing Scala, Scala Collections and yaidom, a brief introduction to XBRL follows. XBRL is an XML-based business reporting standard. Business reports in XBRL format are called XBRL instances. XBRL instances must obey many requirements, in order for them to be considered valid. After the brief XBRL introduction, the remainder of this paper shows how many of these rules can be expressed using yaidom and Scala. It will be shown that using Scala and yaidom instead of standard XML query and transformation languages actually makes expressing these rules relatively easy.

There are many other articles about XML processing in Scala, mostly about Scala's own standard XML libary. For example, XML Processing in Scala contains many well-chosen examples that show how to process XML in Scala. Moreover, it first introduces Scala, assuming some familiarity with XQuery on the part of the reader. To get an appreciation of XML processing using Scala in general, and of XML processing using Scala and yaidom in particular, it makes sense to read both papers, starting with XML Processing in Scala.

Brief introduction to Scala and Scala Collections

The Scala programming language is the most popular alternative to the Java language on the Java virtual machine. It is object-oriented (more so than Java) and also functional, in that functions are first-class objects. It is statically typed, but it feels like a dynamically typed language, because of features like type inference.

Scala is a safe and expressive language, typically leading to good productivity and low bug counts in skilled disciplined teams. Its rich Collections API, its strong support for immutable data structures, and its focus on expressions rather than statements enables programmers to work at a higher level of abstraction in Scala than in Java.

The Collections API of a programming language (which in the case of Scala and Java is a part of the standard library of the language, not of the core language) often says a lot about the expressive power of that language. Below follows some Scala code that manipulates collections, to illustrate Scala's expressiveness.

Consider a book store and some queries about books (using sample data from the Stanford University online course Introduction to Databases). The Scala code is as follows:

case class Author(firstName: String, lastName: String)
case class Book(isbn: String, title: String, authors: List[Author], price: Int)

val someBooks = List(
  Book(
    "ISBN-0-13-713526-2",
    "A First Course in Database Systems",
    List(Author("Jeffrey", "Ullman"), Author("Jennifer", "Widom")),
    85),
  Book(
    "ISBN-0-13-815504-6",
    "Database Systems: The Complete Book",
    List(
      Author("Hector", "Garcia-Molina"),
      Author("Jeffrey", "Ullman"),
      Author("Jennifer", "Widom")),
    100),
  Book(
    "ISBN-0-11-222222-3",
    "Hector and Jeff's Database Hints",
    List(Author("Jeffrey", "Ullman"), Author("Hector", "Garcia-Molina")),
    50),
  Book(
    "ISBN-9-88-777777-6",
    "Jennifer's Economical Database Hints",
    List(Author("Jennifer", "Widom")),
    25)
)

// Return all books that cost no more than 50 dollars (i.e., the last 2 books)

val cheapBooks = someBooks.filter(book => book.price <= 50)

// Return all books having Jeffrey Ullman as one of its authors (i.e., the first 3 books)

def hasAuthor(book: Book, authorLastName: String): Boolean = {
  book.authors.exists(author => author.lastName == authorLastName)
}

val booksByUllman =
  someBooks.filter(book => hasAuthor(book, "Ullman"))

// Return all book authors, without duplicates

val allAuthors = someBooks.flatMap(book => book.authors).distinct

// Return all titles of books having Jeffrey Ullman as one of its authors

val bookTitlesByUllman =
  someBooks.filter(book => hasAuthor(book, "Ullman")).map(book => book.title)

Note how the queries in prose naturally map to their counterparts in Scala code, using a small vocabulary of higher-order functions such as map, flatMap and filter. The code shows the "what" more than the "how". In that respect, the Scala code is more like XQuery than Java (especially than Java before version 8). In a sense, the Scala core language along with its Collections API form a universal query (and transformation) language. Of course, Scala is a lot more than that, but for the purposes of this article this is a fitting description.

Brief introduction to yaidom

The yaidom library can be used for querying and transforming XML in Scala. It interoperates very well with the Scala Collections API.

It was mentioned above that Scala and its Collections API can be used as a universal query and transformation language. The yaidom library offers an XML element query API that turns elements into Scala collections of elements. So yaidom can be said to turn a universal query and transformation language into an XML querying and transformation language. In other words, Scala + its Collections API + yaidom can be used as an "XML querying/transformation stack". Below it will become clear that yaidom can plug in different "XML backends", thus making the "XML stack" very powerful.

Using the bookstore example above, some simple yaidom XML queries are shown below. The XML is as follows:

// The book store XML

<Bookstore>
	<Book ISBN="ISBN-0-13-713526-2" Price="85" Edition="3rd">
		<Title>A First Course in Database Systems</Title>
		<Authors>
			<Author>
				<First_Name>Jeffrey</First_Name>
				<Last_Name>Ullman</Last_Name>
			</Author>
			<Author>
				<First_Name>Jennifer</First_Name>
				<Last_Name>Widom</Last_Name>
			</Author>
		</Authors>
	</Book>
	<Book ISBN="ISBN-0-13-815504-6" Price="100">
		<Title>Database Systems: The Complete Book</Title>
		<Authors>
			<Author>
				<First_Name>Hector</First_Name>
				<Last_Name>Garcia-Molina</Last_Name>
			</Author>
			<Author>
				<First_Name>Jeffrey</First_Name>
				<Last_Name>Ullman</Last_Name>
			</Author>
			<Author>
				<First_Name>Jennifer</First_Name>
				<Last_Name>Widom</Last_Name>
			</Author>
		</Authors>
		<Remark>Buy this book bundled with "A First Course" - a great deal!
		</Remark>
	</Book>
	<Book ISBN="ISBN-0-11-222222-3" Price="50">
		<Title>Hector and Jeff's Database Hints</Title>
		<Authors>
			<Author>
				<First_Name>Jeffrey</First_Name>
				<Last_Name>Ullman</Last_Name>
			</Author>
			<Author>
				<First_Name>Hector</First_Name>
				<Last_Name>Garcia-Molina</Last_Name>
			</Author>
		</Authors>
		<Remark>An indispensable companion to your textbook</Remark>
	</Book>
	<Book ISBN="ISBN-9-88-777777-6" Price="25">
		<Title>Jennifer's Economical Database Hints</Title>
		<Authors>
			<Author>
				<First_Name>Jennifer</First_Name>
				<Last_Name>Widom</Last_Name>
			</Author>
		</Authors>
	</Book>
</Bookstore>

Below follow the yaidom XML queries corresponding to the (non-XML) queries above. Written rather verbosely, they are as follows:

// Assume a root element called bookstore.

val someBooks = bookstore.filterChildElems(book => book.localName == "Book")

// Return all books that cost no more than 50 dollars (i.e., the last 2 books)

val cheapBooks =
  someBooks.filter(book => book.attribute(EName("Price")).toInt <= 50)

// Return all books having Jeffrey Ullman as one of its authors (i.e., the first 3 books)

def hasAuthor(book: simple.Elem, authorLastName: String): Boolean = {
  require(book.localName == "Book")

  book.findElem(e =>
    e.localName == "Author" &&
      e.getChildElem(che => che.localName == "Last_Name").text == authorLastName).isDefined
}

val booksByUllman =
  someBooks.filter(book => hasAuthor(book, "Ullman"))

// Return all book author elements (with duplicates, this time)

val allAuthors =
  someBooks.flatMap(book => book.filterElems(e => e.localName == "Author"))

// Return all titles of books having Jeffrey Ullman as one of its authors

val bookTitlesByUllman =
  someBooks.filter(book => hasAuthor(book, "Ullman")).
    map(book => book.getChildElem(e => e.localName == "Title").text)

Above, the EName type stands for "expanded name". It corresponds to Java's javax.lang.namespace.QName, except that it does not retain the prefix, if any.

For clarity, these XML queries were written more verbosely than needed. Even when writing these queries in a less verbose way than has been done above, there would still be some verbosity related to XML handling. This is intentional: yaidom is a precise XML query API. For example, yaidom does not abstract away the distinction between elements and attributes, or between names with a namespace and those without any namespace. Despite the syntax dedicated by yaidom to XML node manipulation, the yaidom query examples are not that much more verbose than the non-XML query examples presented earlier. Compared to ad-hoc XML querying in Java (using JAXP), however, ad-hoc XML querying in Scala using yaidom is much less verbose. Even when invoking XPath queries (returning node sets) from Java code, the processing of the resulting node sets would add "syntactic clutter" that the use of Scala with yaidom could have prevented.

Why use yaidom and not Scala's own XML library? As will become apparent in this article, yaidom has very precise support for XML namespaces, more so than Scala XML. Using the yaidom API it is always clear if queries are namespace-aware. If it is intentional to query for elements that have specific local names, regardless of the namespace, then yaidom forces the user to be explicit about that. Precise namespace support in yaidom even goes as far as the ability to express a simple theory of XML namespaces (relating namespace declarations, in-scope namespaces, "qualified names" and "expanded names") in yaidom code itself, outside of any particular XML tree!

There are more reasons why yaidom may be preferable to Scala's own XML library. For example, yaidom has a precise uniform XML query API that is offered by multiple "XML backend" implementations. Not only does yaidom offer own native DOM-like tree ("XML backend") implementations with different strengths and weaknesses, but it is also possible to wrap existing XML library tree implementations (DOM, JDOM, XOM, Saxon etc.) in yaidom, offering the same yaidom query API. For example, yaidom wrappers around Saxon-EE NodeInfo trees offer the best of Saxon and the Scala-yaidom combination: on the one hand the completeness and schema-type-awareness of Saxon-EE, and on the other hand a "Scala Collections API querying experience", using yaidom as the natural bridge between Saxon "nodes" and Scala collections processing. Unlike yaidom, the Scala XML library does not offer multiple tree implementations backing the same query API.

This extensibility of yaidom goes even further than specific "XML backends". It is even possible to extend yaidom for custom "XML dialects" (or "vocabularies"). This will be explained and shown later in this paper. Arguably this could be the best reason to prefer yaidom to Scala's own XML library.

Why not just use standards such as XPath, XSLT or XQuery? XBRL processing is a good example where yaidom shines, as will become clear below. After all, XBRL is a lot more than "just" XML, so XBRL processing is a lot more than "just" XML processing. Performing most or all of this processing in Scala using yaidom offers the following advantages:

No Scala syntax is spent on processing sequences (or node sets) resulting from XPath/XQuery evaluation
In a programming language (such as Scala) it is quite natural and easy to store intermediate results in variables
As a rich (functional) OO programming language, Scala has a lot of expressive power, which makes it easy to build layered models on top of DOM-like element trees (as will be shown below)
Yaidom leverages the Scala Collections API, which enables the user to achieve a lot using only a small vocabulary
There is a large ecosystem around Scala (and Java), offering many high quality libraries
Yaidom offers (and enables) element implementations optimized for fast querying (although no benchmarks are provided in this paper)
For programmers on the JVM, Scala and yaidom have more familiar semantics than XPath and the XQuery and XPath Data Model (XDM):
- In XDM there is no difference between an item (node or atomic value) and a singleton sequence containing that item
- Sequences in XDM cannot be nested, so are always flattened
- In Scala (as in Java), "equality" is expected to be an equivalence relation (unlike the general comparison "equality" operator in XPath, which is not transitive)

Surely yaidom can be used with XQuery when using an XML(-enabled) database, where XQuery joins and filters database XML data into "raw result sets", which are further processed using yaidom queries. Still, it makes sense to keep the number of boundaries between XQuery and yaidom/Scala relatively low, for each such boundary has some (syntactic and semantic) costs. In summary, the more some XML processing task can benefit from the use of Scala, the more attractive the use of yaidom becomes.

Brief introduction to the XBRL examples

So far, this article has introduced Scala and yaidom, using only trivial examples. In the remainder of this article, yaidom examples in the domain of XBRL are used. First, this section gives a very brief introduction to XBRL.

XBRL (eXtensible Business Reporting Language) is a standard for business reporting. Many (but not all) XBRL reports are financial statements. XBRL reports ("XBRL instances") are XML documents, following a specified structure.

Suppose we want to report that for a given organization (CIK-number 1234567890) the average number of employees in 2003 was 220, and that the corresponding numbers for 2004 and 2005 were 240 and 250, respectively. More precisely, concept gaap:AverageNumberEmployees (described by the so-called US-GAAP XBRL taxonomy) has the value 220 in the given context (CIK-number 1234567890, year 2003). Then we can report the 3 facts above in XBRL format as follows:

<xbrl xmlns="http://www.xbrl.org/2003/instance" xmlns:gaap="http://xasb.org/gaap">

   <context id="D-2003">
      <entity>
         <identifier scheme="http://www.sec.gov/CIK">1234567890</identifier>
      </entity>
      <period>
         <startDate>2003-01-01</startDate>
         <endDate>2003-12-31</endDate>
      </period>
   </context>

   <context id="D-2004">
      <entity>
         <identifier scheme="http://www.sec.gov/CIK">1234567890</identifier>
      </entity>
      <period>
         <startDate>2004-01-01</startDate>
         <endDate>2004-12-31</endDate>
      </period>
   </context>

   <context id="D-2005">
      <entity>
         <identifier scheme="http://www.sec.gov/CIK">1234567890</identifier>
      </entity>
      <period>
         <startDate>2005-01-01</startDate>
         <endDate>2005-12-31</endDate>
      </period>
   </context>

   <unit id="U-Pure">
     <measure>pure</measure>
   </unit>

  <gaap:AverageNumberEmployees contextRef="D-2003" unitRef="U-Pure" decimals="INF">220</gaap:AverageNumberEmployees>
  <gaap:AverageNumberEmployees contextRef="D-2004" unitRef="U-Pure" decimals="INF">240</gaap:AverageNumberEmployees>
  <gaap:AverageNumberEmployees contextRef="D-2005" unitRef="U-Pure" decimals="INF">250</gaap:AverageNumberEmployees>

</xbrl>

This example comes from a non-trivial sample XBRL instance written by Charles Hoffman, also known as "the father of XBRL".

There are many requirements that have to be met in order for an XBRL instance to be valid. The XBRL Core specification (as well as other XBRL specifications) describes many of these requirements. There are also many common best practices that have been formalized as complementary rules. For example, the international FRIS standard places additional constraints on XBRL instances.

Most of the remainder of this article will show how many of those FRIS rules can be written naturally as Scala expressions using yaidom. Yaidom is in no way married to XBRL, but XBRL validations are good XML processing examples where Scala and yaidom really shine.

Simple yaidom query examples

The XBRL snippet above is part of this sample instance. In this section, some simple yaidom XML queries are performed on the XBRL instance.

Before showing these queries on this XBRL instance, it should be noted that knowing only 3 yaidom query API methods to some extent means knowing them all. These 3 methods are filterChildElems, filterElems and filterElemsOrSelf. They all filter elements, based on the passed element predicate function. The difference is that they filter child elements, descendant elements, and descendant-or-self elements, respectively. The word "descendant" is left out from the method names.

It should also be noted that methods filterChildElems and filterElemsOrSelf have shorthands \ and \\, respectively. Method attributeOption has shorthand \@. Moreover, some element predicate functions have names, such as withLocalName and withEName.

Some yaidom queries on the sample XBRL instance are as follows:

// Let's first parse the XBRL instance document

val docParser = DocumentParserUsingSax.newInstance

val doc = docParser.parse(sampleXbrlInstanceFile)

// Check that all gaap:AverageNumberEmployees facts have unit U-Pure.

val xmlNs = "http://www.w3.org/XML/1998/namespace"
val xbrliNs = "http://www.xbrl.org/2003/instance"
val gaapNs = "http://xasb.org/gaap"

val avgNumEmployeesFacts =
  doc.documentElement.filterChildElems(withEName(gaapNs, "AverageNumberEmployees"))

println(avgNumEmployeesFacts.size) // prints 7

val onlyUPure =
  avgNumEmployeesFacts.forall(fact => fact.attributeOption(EName("unitRef")) == Some("U-Pure"))
println(onlyUPure) // prints true

// Check the unit itself, minding the default namespace

val uPureUnit =
  doc.documentElement.getChildElem(e =>
    e.resolvedName == EName(xbrliNs, "unit") && (e \@ EName("id")) == Some("U-Pure"))

println(uPureUnit.getChildElem(withEName(xbrliNs, "measure")).text) // prints "pure"

// Now we get the measure element text, as QName, resolving it to an EName (expanded name)
println(uPureUnit.getChildElem(withEName(xbrliNs, "measure")).textAsResolvedQName)
// prints EName(xbrliNs, "pure")

// Knowing the units are the same, the gaap:AverageNumberEmployees facts are uniquely identified by contexts.

val avgNumEmployeesFactsByContext: Map[String, simple.Elem] =
  avgNumEmployeesFacts.groupBy(_.attribute(EName("contextRef"))).mapValues(_.head)

println(avgNumEmployeesFactsByContext.keySet)
// prints Set("D-2003", "D-2004", "D-2005", "D-2007-BS1", "D-2007-BS2", "D-2006", "D-2007")

println(avgNumEmployeesFactsByContext("D-2003").text) // prints 220

The uniform query API of yaidom consists of several query API traits. They are like LEGO blocks, that can easily be combined. Yaidom (native and wrapper) element tree implementations all mix in some or most of these query API traits. The example queries above are not bound to any particular element implementation, but use a common query API trait, namely ScopedElemApi, which is itself a combination of query API traits. This trait offers methods like filterElemsOrSelf, filterChildElems (from trait ElemApi), as well as methods to get text content, qualified names, expanded names, attributes etc. In other words, it offers a query API abstraction that is valid for almost all element implementations.

The query API traits themselves are not visible in normal yaidom client code. They are relevant for creators of custom yaidom element implementations, for example wrappers around elements offered by existing XML libraries. Yaidom users that do not extend yaidom may still want to know which query API traits are offered by some XML tree implementation, of course.

Sometimes we want to use methods that are only offered by specific element implementations, and not by any query API traits. The default native yaidom element implementation is simple.Elem. It knows about elements and text content (as per the mixed-in ScopedElemApi trait), but it also knows about comments, processing instructions and CDATA sections (if passed by the XML parser). For example:

println(doc.comments.map(_.text.trim).mkString)
// prints "Created by Charles Hoffman, CPA, 2008-03-27"

val contexts = doc.documentElement.filterChildElems(withEName(xbrliNs, "context"))
println(contexts forall (e => !e.commentChildren.isEmpty)) // prints true: all contexts have comments

// Being lazy, and forgetting about the namespace here
val facts =
  doc.documentElement.filterChildElems(withLocalName("ManagementDiscussionAndAnalysisTextBlock"))
println(facts.flatMap(e => e.textChildren.filter(_.isCData)).size >= 1) // prints true

Namespace examples

Yaidom has very precise namespace support. Like the article Understanding Namespaces, yaidom distinguishes qualified names from expanded names, and namespace declarations from in-scope namespaces. Their yaidom counterparts are immutable classes QName, EName, Declarations and Scope. Having these 4 distinct concepts, their relationships can be expressed very precisely, even in yaidom code, and even outside of the context of any particular XML tree!

In the example XBRL instance above, all namespace declarations are in the root element, and therefore all descendant-or-self elements have the same in-scope namespaces. In code:

val rootScope = doc.documentElement.scope

val sameScopeEverywhere =
  doc.documentElement.findAllElemsOrSelf.forall(e => e.scope == rootScope)

println(sameScopeEverywhere) // prints true

Let's consider the first FRIS rule taken from the international FRIS standard, expressed in yaidom. Rule 2.1.5 states that some commonly used namespaces should use their "preferred" namespace prefixes in XBRL instances. The rule can be expressed in yaidom as follows:

val standardScope = Scope.from(
  "xbrli" -> "http://www.xbrl.org/2003/instance",
  "xlink" -> "http://www.w3.org/1999/xlink",
  "link" -> "http://www.xbrl.org/2003/linkbase",
  "xsi" -> "http://www.w3.org/2001/XMLSchema-instance",
  "iso4217" -> "http://www.xbrl.org/2003/iso4217")

val standardPrefixes = standardScope.keySet
val standardNamespaceUris = standardScope.inverse.keySet

// Naive implementation: expects only namespace declarations in root element

def usesExpectedNamespacePrefixes(xbrlInstance: simple.Elem): Boolean = {
  val rootScope = xbrlInstance.scope
  require(xbrlInstance.findAllElemsOrSelf.forall(e => e.scope == rootScope))

  val subscope = xbrlInstance.scope.withoutDefaultNamespace filter {
    case (pref, ns) =>
      standardPrefixes.contains(pref) || standardNamespaceUris.contains(ns)
  }
  subscope.subScopeOf(standardScope)
}

Above, there is no useful error reporting, but that is easy to add, because the implementation is entirely in the rich Scala programming language. In prose, method usesExpectedNamespacePrefixes checks that if some of the 5 namespace prefixes above are used, that they all map to the expected namespace URIs. The method also checks the other side: if some of the namespace URIs are in-scope, then the corresponding namespace prefixes are the expected ones, with the exception that they may be the default namespace.

The example above illustrates yaidom's precise support for namespaces in the uniform query API, and therefore offered by diverse element tree implementations. Yet the namespace support goes further than that. As article Understanding Namespaces makes clear, namespaces are not only used in element and attribute names, but can also be used in text content and attribute values.

FRIS rule 2.1.7 must take namespaces in text content and attribute values into account, because it states that XBRL instances should not have any unused namespace declarations. Yet how do we detect the use of namespaces in text content or attribute values? We know this from the XML schema(s) describing XBRL instances. For example, the xbrli:measure element has type xs:QName. So the text content of an xbrli:measure should be interpreted as an "expanded name". The namespace of that expanded name is therefore one of the namespaces used in the XBRL instance.

Yaidom makes it possible to code a DocumentENameExtractor strategy, holding information about ENames and therefore namespaces occurring in text content or attribute values. So, looking at the XML schema(s), we can easily code such a strategy ourselves (yaidom itself has no XML Schema awareness). Then, using method NamespaceUtils.findAllNamespaces, all namespaces used in the XBRL instance can be found.

Method NamespaceUtils.findAllNamespaces does not work on the default "simple" elements, however, because "simple" elements do not know their ancestry. To that extent, yaidom offers so-called "indexed" elements, that do know their ancestry. Like "simple" elements, "indexed" elements are immutable, because they are just wrappers around a root as "simple" element along with an "index" into that element tree. The "indexed" and "simple" elements also share most of the query API, in particular the ScopedElemApi query API trait.

Let's now implement FRIS rule 2.1.7, but only for the sample XBRL instance:

val xbrliDocumentENameExtractor: DocumentENameExtractor = {
  // Not complete, but good enough for this example!

  new DocumentENameExtractor {

    def findElemTextENameExtractor(elem: indexed.Elem): Option[TextENameExtractor] = elem.resolvedName match {
      case EName(Some(xbrliNs), "measure") if elem.path.containsName(EName(xbrliNs, "unit")) =>
        Some(SimpleTextENameExtractor)
      case EName(Some(xbrldiNs), "explicitMember") =>
        Some(SimpleTextENameExtractor)
      case _ => None
    }

    def findAttributeValueENameExtractor(elem: indexed.Elem, attributeEName: EName): Option[TextENameExtractor] = elem.resolvedName match {
      case EName(Some(xbrldiNs), "explicitMember") if attributeEName == EName("dimension") =>
        Some(SimpleTextENameExtractor)
      case _ => None
    }
  }
}

val indexedDoc = indexed.Document(doc)

val namespaceUrisDeclared = indexedDoc.documentElement.scope.inverse.keySet

import NamespaceUtils._

// Check that the used namespaces are almost exactly those declared in the root element (approximately rule 2.1.7)

val companyNs = "http://www.example.com/company"

val usedNamespaces =
  findAllNamespaces(indexedDoc.documentElement, xbrliDocumentENameExtractor).diff(Set(xmlNs))

// The "company namespace" is an unused namespace in our sample XBRL instance
require(usedNamespaces == namespaceUrisDeclared.diff(Set(companyNs)))

Although yaidom itself has no XML Schema awareness, yaidom can still be useful in a context where schema-awareness is needed. For example, Saxon-EE NodeInfo objects can be wrapped as yaidom trees, thus getting the best of Scala Collections processing and Saxon-EE XML and XML Schema support.

Let's now remove the unused namespaces (the "company" namespace in this example), and compare the result with the original XBRL instance. Yet how do we compare two XML trees (as "simple" elements) for equality? In order to do so, note that namespace prefixes are irrelevant to equality comparisons, but namespace URIs do count. (Be careful with prefixes in text content and attribute values!) Yaidom offers an XML element implementation in which namespace prefixes do not occur. These elements are called "resolved" elements. They share much of the same query API with "simple" and "indexed" elements, but not all of it. After all, "resolved" elements do not know about namespace prefixes, so they do not know about qualified names. Therefore they do not mix in the ScopedElemApi trait, but they do mix in traits like ElemApi and HasTextApi, that is, all traits extended by ScopedElemApi that do not know about qualified names. Hence, "resolved" elements still have much of the yaidom query API in common with "simple" and "indexed" elements.

The following code strips unused namespaces, and shows that the result is the same, when comparing the trees as "resolved" elements.

val editedRootElem =
  stripUnusedNamespaces(indexedDoc.documentElement, xbrliDocumentENameExtractor)

val areEqual =
  resolved.Elem(indexedDoc.document.documentElement) == resolved.Elem(editedRootElem)

println(areEqual) // prints true

Extending yaidom for custom XML dialects

Above, all XBRL instance processing was coded as normal XML processing, mostly using yaidom "simple" and "indexed" elements. That's not very convenient. It would be nice if we could talk about contexts, units, facts etc., instead of just XML elements that happen to be contexts, units, facts, etc. In general, it would be nice if yaidom would make it easy to support custom XML dialects. That is indeed the case. We already knew that yaidom is extensible, in that new element implementations offering the same yaidom query API can easily be added. Yet, what's more, yaidom also facilitates a "yaidom querying experience" for custom XML dialects, such as XBRL instances (or DocBook files, or Maven POM files, or any other XML dialect described by schemas).

To that end, yaidom offers the SubtypeAwareElemApi query API trait. Whereas the ElemApi trait offers querying for child/descendant/descendant-or-self elements, trait SubtypeAwareElemApi extends this to class hierarchies (for XML dialects), offering querying for child/descendant/descendant-or-self elements of specific sub-types of the root class of the class hierarchy.

In this XBRL instance class hierarchy we can see this action. Each part of an XBRL instance is of type XbrliElem or a sub-type. Common sub-types are those for contexts, units, item facts, tuple facts, and, of course, XBRL instances themselves. Super-type XbrliElem mixes in traits ScopedElemApi and SubtypeAwareElemApi. Trait ScopedElemApi offers the most common yaidom element query API, as we know, and trait SubtypeAwareElemApi makes it easy to query for elements of specific types, with little boilerplate. The latter is used internally in the code of the XbrliElem class hierarchy, but can also be used in client code, if need be.

For the remaining FRIS validations in this article, we will use the XbrliElem class hierarchy.

Consider FRIS rule 2.1.10. It states that there is a specific expected order of the child elements of the root element. One way to code that is as follows:

// Assume xbrlInstance variable of type XbrlInstance

val remainingChildElems =
  xbrlInstance.findAllChildElems dropWhile {
    case e: SchemaRef => true
    case e => false
  } dropWhile {
    case e: LinkbaseRef => true
    case e => false
  } dropWhile {
    case e: RoleRef => true
    case e => false
  } dropWhile {
    case e: ArcroleRef => true
    case e => false
  } dropWhile {
    case e: XbrliContext => true
    case e => false
  } dropWhile {
    case e: XbrliUnit => true
    case e => false
  } dropWhile {
    case e: Fact => true
    case e => false
  } dropWhile {
    case e: FootnoteLink => true
    case e => false
  }

require(remainingChildElems.isEmpty)

Now consider FRIS rule 2.4.2 stating that all contexts must be used. It is also checked that all context references indeed refer to existing contexts. Note in this case how friendly the XBRL instance model is compared to raw XML elements:

val contextIds = xbrlInstance.allContextsById.keySet

val usedContextIds = xbrlInstance.findAllItems.map(_.contextRef).toSet

require(usedContextIds.subsetOf(contextIds))

// Oops, some contexts are not used, namely I-2004, D-2007-LI-ALL and I-2003
println(contextIds.diff(usedContextIds))

The next rule is more complex. FRIS rule 2.4.1 states that S-equal contexts should not occur. S-equality ("structural equality") is defined in the Core XBRL specification. A good implementation of S-equality requires type information. Therefore Saxon-EE backed yaidom wrappers would be a good choice. A very naive approximation is given below:

def transformContextForSEqualityComparison(context: XbrliContext): resolved.Elem = {
  // Ignoring "normalization" of dates and QNames, as well as order of dimensions etc.
  val elem = context.indexedElem.elem.copy(attributes = Vector())
  resolved.Elem(elem).coalesceAndNormalizeAllText.removeAllInterElementWhitespace
}

Then rule 2.4.1 applied to our XBRL instance is as follows:

val contextsBySEqualityGroup =
  xbrlInstance.allContexts.groupBy(e => transformContextForSEqualityComparison(e))

require(contextsBySEqualityGroup.size == xbrlInstance.allContexts.size)

As we can see, the more complex the rules, the more we profit from the fact that all code is Scala code, and that there is no needed effort in bridging between Scala and XSLT, for example. The Scala language, its Collections API, and yaidom form a powerful combination.

Finally, consider FRIS rule 2.8.3, stating that concepts are either top-level or nested in tuples, but not both. Using the XBRL instance model, the code is simple:

val topLevelConceptNames = xbrlInstance.allTopLevelFactsByEName.keySet

val nestedConceptNames =
  xbrlInstance.allTopLevelTuples.flatMap(_.findAllFacts).map(_.resolvedName).toSet

require(topLevelConceptNames.intersect(nestedConceptNames).isEmpty)

Conclusion

In this article, the yaidom Scala XML query library was introduced. We used examples from XBRL. It turned out that Scala, its Collections API, and the yaidom library form a powerful precise XML processing "stack". This "stack" is even more powerful when using custom mature yaidom "backends" as Saxon-EE. It also turned out that yaidom makes it easy to support custom XML dialects (such as XBRL instances), offering more type-safety and leading to less boilerplate. The extensibility of yaidom (in more than 1 way) is one of its strengths, along with its precise namespace support and uniform precise element query API (offered by numerous XML "backends").

The FRIS rule examples show that a programming language like Scala is a natural fit for implementing those rules. Had we used XSLT or XQuery instead, how would we easily have found unused namespaces, for example? Moreover, how would we have supported custom XML dialects in the same way that yaidom facilitates such support?

The examples only used XBRL instances. These instances are described by XBRL taxonomies. Such taxonomies have to obey many rules as well. Taxonomies typically span many files, and their validation is usually much more complex than instance validation. The advantages of using a "Scala yaidom XML stack" would even be greater than for XBRL instances.

As a concluding remark, yaidom is used in production code developed at www.ebpi.nl. Its usage in several projects has certainly helped it mature. I want to thank my colleagues Jan-Paul van der Velden, Andrea Desole, Johan Walters and Nicholas Evans for their valuable feedback on earlier versions of yaidom.