Basic XML element querying in yaidom

Yaidom is a uniform XML query API, written in Scala. Moreover, yaidom provides several specific-purpose DOM-like tree implementations adhering to this XML query API.

In this article, the basics of element querying in yaidom are explained.

View My GitHub Profile

Introduction

This article is the first in a series of articles introducing the yaidom library.

It is assumed that the reader knows the basics of XML, in particular XML namespaces. Some experience with XML processing in Java (e.g. JAXP) is also helpful. Finally, the reader is expected to have some familiarity with the Scala programming language, and with the Scala Collections API in particular.

This article only treats basic element querying in yaidom. The remainder of this article is organized as follows:

  1. Introductory example.
  2. Why yet another Scala XML library?
  3. Introduction to yaidom element queries.
  4. Uniform element query API.
  5. Conclusion (of this first article about yaidom).

The queries use the ubiquitous bookstore as example XML. The specific sample XML is from the (coursera) course Introduction to Databases, by Jennifer Widom (with permission).

Introductory example

First we give an introductory example to querying in yaidom. Consider the following XML file: books.xml. Suppose we have parsed this document, and that the parsed document has been stored into variable doc. Then we can query for all book author last names as follows (without removing duplicates):

import HasENameApi._

val authorLastNames =
    for {
        bookElem <- doc.documentElement \ withLocalName("Book")
        authorElem <- bookElem \\ withLocalName("Author")
        lastNameElem <- authorElem \ withLocalName("Last_Name")
    } yield lastNameElem.text

Unlike querying with the standard Scala XML library, we cannot chain \ and \\ operators. It will become clear below why yaidom sacrifices a little bit of conciseness compared to XML libraries that offer a more XPath-like experience.

Still, we can write the query above in a less verbose manner, much like how the Scala compiler rewrites the for-comprehension. This is already more XPath-like, but not necessarily more readable:

val authorLastNames2 =
    doc.documentElement \ withLocalName("Book") flatMap
        (_ \\ withLocalName("Author")) flatMap
        (_ \\ withLocalName("Last_Name")) map (_.text)

Why yet another Scala XML library?

XML processing in Java has never been easy. Direct use of SAX, DOM or StAX is low-level and cumbersome. Use of XPath may be low-level in the handling of returned DOM nodes or node lists, and performance issues quickly arise. Several other (DOM-like) libraries may improve on JAXP DOM, but still do not make XML processing significantly easier. The use of O-X mapping (such as JAXB) just for XML manipulation seems the wrong tool for the job, since that would turn an XML manipulation problem into a bigger O-X mapping problem between 2 different perspectives ("Java object" and "XML") that must be kept in sync. (I'm not against O-X mapping, but I would like to stay away from O-X mapping if there is inherently no O-X mapping problem in the first place.)

Direct manipulation of XML should not be that hard. In Java (at least before version 8) not only XML processing, but data processing in general is hard (writing low level loops instead of functionally transforming collections). Scala and its expressive Collections API have a far more appealing data processing story, as most Scala programmers would agree. Shouldn't then XML processing in Scala be easier than in Java, as a consequence?

There are 3 Scala XML libraries that currently often get mentioned. There is Scales XML, which is not a DOM-like API. Then there is Anti-XML, which seems abandoned, and which aimed to be an improvement on the third library, viz. the standard Scala XML library. Unfortunately, the standard Scala XML library has some (rather annoying) issues that I find hard to accept. For example:

I do realize that the standard XML library tried to offer an XPath-like experience. On the other hand, this implies blurring the distinction between individual nodes and node collections. Arguably, this also means less precision and clarity (e.g., what is the type of an expression?), in the name of conciseness and XPath-like query support.

A well-known (but somewhat old) critique of the Scala XML library (by the creator of Anti-XML) can be found here.

It seems hard to find a Scala XML library having first-class namespace support. The best explanation of XML namespaces that I have come across is Understanding XML Namespaces. It distinguishes between qualified names (such as my:foo) and expanded names (such as {http://my}foo, in James Clark notation). Admittedly, qualified names occur in XML, and expanded names do not. Still, both concepts are important, and many XML libraries blur this fundamental distinction. Ignoring this distinction makes it hard to talk precisely about namespaces. In my view that leads to subpar namespace support. The quest for first-class namespace support, along with precise (Scala-esque) element querying, were the most important reasons to come up with yaidom.

Yaidom has been influenced by all of the 3 Scala XML libraries mentioned above, be it in very different ways. It has its own underlying design choices, however. In particular:

This article does not go deeply into the namespace support offered by yaidom. The next article on yaidom (and namespaces) will make up for this shortcoming.

Introduction to yaidom element queries

All code examples below can be tried out in the Scala REPL (if the parent directory path of the example files is replaced). Scala 2.11.X is required, and yaidom 1.3.2 (or later) must be on the classpath.

Before showing some basic yaidom element queries, assume the following code to have executed:

import java.io.File
import javax.xml.parsers._
import scala.collection.immutable
import eu.cdevreeze.yaidom._

// Using a yaidom DocumentParser that used DOM internally
val docParser = parse.DocumentParserUsingDom.newInstance

// Replace the following path!
val parentDir = new File("/path/to/parentdir")

val doc: simple.Document =
    docParser.parse(new File(parentDir, "books.xml"))

val docElem = doc.documentElement

Here we instantiated a yaidom parser (using DOM internally), and parsed the XML file. The document element is stored in variable docElem, as "default" (immutable) yaidom element.

Let's get started with some basic yaidom element queries. The books.xml XML has some magazines and books. We can query for the magazines (as yaidom elements) as follows:

val magazineElems =
    docElem.filterChildElems(e => e.localName == "Magazine")

In this query, we queried only for (some) child elements of the document element, knowing that all magazines are child elements of the document element. We also used no namespace in the query, knowing that in this case no ambiguity can arise. The query result consists of the following 4 XML elements:

<Magazine Month="January" Year="2009" xmlns="http://bookstore">
    <Title>National Geographic</Title>
</Magazine>

<Magazine Month="Februari" Year="2009" xmlns="http://bookstore">
    <Title>National Geographic</Title>
</Magazine>

<Magazine Month="Februari" Year="2009" xmlns="http://bookstore">
    <Title>Newsweek</Title>
</Magazine>

<Magazine Month="March" Year="2009" xmlns="http://bookstore">
    <Title>Hector and Jeff's Database Hints</Title>
</Magazine>

Analogously, we can query for the 4 books (as yaidom elements) as follows:

val bookElems1 =
    docElem.filterChildElems(e => e.localName == "Book")

require(bookElems1.size == 4, "Expected 4 books")

Above, we queried for child elements with local name Book, using method filterChildElems. We could have queried for all descendant elements with local name Book instead, using method filterElems. Not surprisingly, the same elements would be returned:

val bookElems2 =
    docElem.filterElems(e => e.localName == "Book")

require(
    bookElems2 == bookElems1,
    "Expected the same books as in bookElems1")

We could also query for all descendant-or-self elements with local name Book, using method filterElemsOrSelf. Again, the result would be the same:

val bookElems3 =
    docElem.filterElemsOrSelf(e => e.localName == "Book")

require(
    bookElems3 == bookElems1,
    "Expected the same books as in bookElems1")

If we import the HasENameApi companion object members, then we can write the element predicate more briefly using (element predicate factory) method withLocalName, like this:

import queryapi.HasENameApi._

val bookElems4 =
    docElem filterElemsOrSelf withLocalName("Book")

Now we know how to query for child elements, descendant elements or descendant-or-self elements, given an element predicate. Instead of writing filterChildElems we can write \, and instead of writing filterElemsOrSelf we can write \\, however. This would give us:

val bookElems5 = docElem \ withLocalName("Book")

and:

val bookElems6 = docElem \\ withLocalName("Book")

Most of yaidom's query API is easy to guess knowing the methods presented above. For example, method findAllElems returns all descendant elements (excluding self), and method findAllElemsOrSelf returns all descendant-or-self elements.

Uniform element query API

ElemApi trait

Above, we introduced basic element querying in yaidom, using the "standard" element implementation. Yaidom does not conform to the view that there is a one-size-fits-all element implementation, as we will see. Fortunately, this does not mean that there are as many yaidom element query APIs as there are element implementations. On the contrary, most element implementations mix in the ElemApi trait, which is the most important part of the query API.

The yaidom element query API plays very well with the Scala Collections API. Typical non-trivial queries are written as for-expressions, combining Scala collections with yaidom query API methods. In a sense, the yaidom element query API is a uniform element query API, and the Scala Collections API plays the role of universal (collection) query API. Put differently, yaidom query methods turn single elements into collections of elements, and the Collections API turns collections (of elements) into other collections.

Below we show yaidom as uniform query API. Again start with the same document used above. This time using namespaces throughout the query, the book authors (separating first and last name by a space) can be found uniformly across element implementations as follows:

val ns = "http://bookstore"

import queryapi._

def findAllBookAuthors[E <: ElemApi[E] with HasText](docElem: E): immutable.IndexedSeq[String] = {
    import HasENameApi._
    val result =
        for {
            bookElem <- docElem \ withEName(ns, "Book")
            authorElem <- bookElem \\ withEName(ns, "Author")
            firstNameElem <- authorElem \ withEName(ns, "First_Name")
            lastNameElem <- authorElem \ withEName(ns, "Last_Name")
        } yield s"${firstNameElem.text} ${lastNameElem.text}"
    result.distinct.sorted
}

Don't worry: typical yaidom client code does not abstract over multiple element implementations, so does not look this intimidating in the use of generics and F-bounded polymorphism. Generics are used here to show that the exact same query API works for multiple element implementations. In the function signature above, E is the type of the element implementation itself. We also imported all members of the HasENameApi companion object, to have the withEName element predicate builder in scope.

"Default" yaidom elements

Given the schema document above, we can query the book authors as follows:

val bookAuthors1 =
    findAllBookAuthors(doc.documentElement)

The query result consists of the following Strings:

Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom

Prefixes are insignificant in XML. Consider schema file books2.xml, which is equivalent to books.xml, except that the default namespace has been replaced by prefix books:

val doc2: Document =
    docParser.parse(new File(parentDir, "books2.xml"))

How can we easily assert that these documents are equivalent XML documents? To that end, yaidom offers so-called "resolved" elements, which contain only expanded (element and attribute) names, and no qualified names. Indeed:

val rootElem = doc.documentElement
val rootElem2 = doc2.documentElement

// Method removeAllInterElementWhitespace makes the equality comparison
// more robust, because it removes whitespace used for formatting

require(resolved.Elem(rootElem).removeAllInterElementWhitespace ==
    resolved.Elem(rootElem2).removeAllInterElementWhitespace)

"Resolved" elements

To show that the yaidom query API is uniform, let's now perform the same query, but passing it a "resolved" element instead of "default" element:

val resolvedDocElem =
    resolved.Elem(doc.documentElement)

val bookAuthors2 =
    findAllBookAuthors(resolvedDocElem)

require(
    bookAuthors2 == bookAuthors1,
    "Expected the same authors as bookAuthors1")

"DOM wrapper" elements

Sometimes we have a (mutable) JAXP DOM tree, and do not want to convert it to a yaidom immutable Elem tree. We can still use the yaidom query API to query those DOM elements. Function findAllBookAuthors can still be called:

// Using a JAXP (DOM) DocumentBuilderFactory
val dbf = DocumentBuilderFactory.newInstance
val db = dbf.newDocumentBuilder
val d = db.parse(new File(parentDir, "books.xml"))

val wrapperDoc = dom.DomDocument(d)

val bookAuthors3 =
    findAllBookAuthors(wrapperDoc.documentElement)

require(
    bookAuthors3 == bookAuthors1,
    "Expected the same authors as bookAuthors1")

"Indexed" elements

Sometimes we want to use immutable elements, and still have access to the ancestry of each element. For example, when querying for book authors, we could then alternatively implement this by querying for all authors, followed by checking in the ancestry if the author is for a book or a magazine. With so-called "indexed" elements, each element has indeed access to its ancestry. It is beyond the scope of this first yaidom article to go into the specifics, however.

This time using "indexed" elements, we again query for book authors:

val indexedDoc = indexed.Document(doc)

val bookAuthors4 =
    findAllBookAuthors(indexedDoc.documentElement)

require(
    bookAuthors4 == bookAuthors1,
    "Expected the same authors as bookAuthors1")

Uniform query API wrap up

There are more element implementations offering the same query API, such as wrappers for Scala XML Elem instances. Yaidom is extensible in this regard, since more element implementations can easily be created by mixing in some yaidom traits like ElemLike (which conforms to the ElemApi contract), and by implementing the methods that are abstract in those traits.

Conclusion

In this article a case was made for yet another Scala XML library, called yaidom. Its precise support for XML namespaces and its precise element query API may be its strongest assets. The API plays very well with the Scala Collections API, and it trades a little bit of conciseness for clarity and precision.

The basics of querying in yaidom were explained in this article. It was shown that different element implementations share the same element query API (the ElemApi trait in particular). It was also hinted at why yaidom has different element implementations in the first place.

Succeeding articles on yaidom will treat namespaces, more advanced querying, functional updates, and configuration of yaidom.

As a concluding remark, yaidom is used in production code developed at www.ebpi.nl. Its usage in several projects has certainly helped it mature. I want to thank my colleagues Jan-Paul van der Velden, Andrea Desole, Johan Walters and Nicholas Evans for their valuable feedback on earlier versions of yaidom.