postgrails   >   20211231-data-languages-2  

Data Languages, Part Two: Content Data (Better than XML)

published 2021-12-31

Limitations of XML

XML is used now throughout the publishing industry and academia for content, yet it is not a good fit to the needs of content. There are several shortcomings of XML that make it ill-suited to representing content data:

  1. XML requires a strict hierarchy, but content often does not observe a strict hierarchy. There are

    • Overlapping (concurrent) hierarchies. Many, many texts have concurrent hierarchies. Simple examples include
      • Poetry with stanzas/lines and sentences.
      • Scripture with book/chapter/verse and book/paragraph/sentence.
    • Non-contiguous ranges. A good example of non-contiguous ranges are annotations (e.g., highlights) that should be considered united but which cover non-contiguous ranges of text. Such highlights also, incidentally, cross all kinds of hierarchy boundaries in any underlying markup.
  2. XML attributes are all untyped strings, but content elements can have typed attributes. Type information is not included in the XML content, so a schema is needed to know the type of the attribute. This makes it more difficult to convert a set of attributes to, say, a map/dict. It is useful to be able to include typed attributes with the content element in the markup. JSON shows us how this can be done effectively.

LMNL

Jeni Tennison and Wendell Piez described a text content markup language called LMNL (Layered Markup and Annotation Language) which addresses the shortcomings of XML.

https://www.balisage.net/Proceedings/vol8/html/Piez01/BalisageVol8-Piez01.html

(I haven't seen anything that says this, but the name reads "liminal," which is the space in between two or more different realities. They also call one of their element tags a "limen", which is the Latin word for "doorstep". That's pretty cool.)

LMNL has the following features that enable it to overcome XML's limitations:

  • The fundamental data structure is ranges, not hierarchies.
  • Ranges can overlap arbitrarily.
  • Ranges can be non-contiguous by using ids.

The authors provide the following simple example of a LMNL "excerpt" with overlapping "s" (sentence) and "l" (line) ranges:

[excerpt}
[s}[l}He manages to keep the upper hand{l]
[l}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l}We fence our flowers in and the hens range.{l]{s]
{excerpt]

Here is an example in which arbitary "r" ranges can overlap by having an id attached to each:

[r=r1}A case [r=r2}of{r=r1] arbitrary overlap{r=r2]

And here is an example of LMNL "annotations", which are what they call attributes:

[excerpt [source}The Housekeeper{source] [author}Robert Frost{author]}
[s}[l [n}144{n]}He manages to keep the upper hand{l]
[l [n}145{n]}On his own farm.{s] [s}He's boss.{s] [s}But as to hens:{l]
[l [n}146{n]}We fence our flowers in and the hens range.{l]{s]
{excerpt]

-- the annotations are limens nested inside of range limens. I find it very confusing to have the same syntax for both, especially alongside the confusion of using [r} and {r] as range delimiters.

There is a folder of demos available from the linked page.

LMNL is a great experiment, because it demonstrates a different way to think about content: In terms of ranges rather than hierarchies. This is one of those things that should be obvious, but it is not. But once you see it, you cannot unsee it: Text is better represented by ranges and not by hierarchies.

Microsoft Word Document Ranges

I have worked extensively with the document model in Microsoft Word, using VBA to program Word documents. What is interesting about Word is that it uses ranges to represent the content in a document. (I don't remember if Word 6 used ranges, but it probably did; I do know that Word 97, which was the first version to convert from WordBasic to VBA, did use ranges, and I'm guessing that the use of ranges for the document model goes much further back.) Word's use of ranges is limited - ranges tend to nest and not overlap, such as a particular configuration of font formatting will be stored internally as a range, which doesn't overlap with other configurations of font formatting, and nests with the containing paragraph range. But it's a step in the right direction.

Emdros

A friend of mine, Ulrik Sandborg-Petersen, created Emdros (https://emdros.org/) as a system for analyzing marked-up texts. It's a text query engine, not a markup language, but it embodies the same concepts: All markup is treated as a series of overlapping ranges. The text itself is treated as the core data structure, with each unit of text (called "monad", usually a single character) being referenced by index in an immutable array, while all the markup is referenced to that array. Emdros includes a way of understanding how to model a range-based textual system internally, as well as a query language (MQL) that demonstrates some of the issues with querying such a data structure. Emdros pays special attention, in particular, to the following relationships between ranges:

  • Range 1 is contained within Range 2
  • Range 1 is before Range 2 and does not overlap
  • Range 1 adjoins Range 2
  • Range 1 is after Range 2 and does not overlap
  • Range 1 starts before Range 2 and ends within Range 2 (left overlap)
  • Range 1 starts within Range 2 and ends after it (right overlap)

SGML

One could consider going back to SGML, which peaked in use in the mid-1990s, and was replaced by XML. SGML is looser, but it doesn't really allow or accommodate overlapping hierarchies. It also lacks XML's namespaces. The way is forward, not back.

CML: Content Markup Language

Instead of going back, we should learn from everything we know and go forward to a content markup language that builds on prior art and accommodates what is different about content vs. hierarchical data. Content Markup Language should include:

  • Document marked up with ranges.

  • Ranges have a type (tag name), preceded by optional namespace ID, and followed by optional ID. Tags can therefore be:

    • type
    • type#id
    • ns:type
    • ns:type#id
    • #id (an untyped range specified only by ID)
  • Ranges can be disjoin / non-contiguous: multiple ranges with the same ID.

  • Ranges that are not explicitly closed are closed at document end (no implicit closes within document).

  • Attributes as objects (using JSON syntax?).

  • Whitespace is not assumed to be insignificant.

  • Namespaces define a vocabulary and grammar.

  • Document namespaces are defined in the document header with the reserved "NS" attribute name. The NS can be defined by a string (if there is only one, the default), or a dict (if there are several, including any default). If the dict form is used, the default namespace is indicated by an empty key in the namespace dict.

    <doc NS:{:"DEFAULT-URI", m:"MathML-URI"}>
    ...
    </doc>
    
  • End tags include the namespace (if any), the type (if any), and the ID (if any, optional if there is a type and if there are no other ranges of this type overlapping). Example:

    <p#p01>This is a paragraph about nothing in particular.</p>
    
    <p#p02>This is a <a#a01>paragraph that includes <a#a02>overlapping annotations</a#a01>, so the end tags on those annotations</a#a02> have to include their IDs, but the paragraph doesn't have overlaps, so the p end tag doesn't need its ID.</p>
    

    (In this example, I used short, indexed ids based on the type tag. These are the easiest to type, but they can become more difficult to manage in longer documents. At some point, it no longer makes sense to have IDs that index in order, and randomly-generated unique IDs become more manageable. IDs can be anything that matches [\w\-\.]+ (letters, numbers, underscores, hyphens, periods). So URL-safe base64 strings with = stripped are good ids, as are UUIDs, as are any XML names (which have stricter requirements), as are ULIDs.)

  • Start tags that have the same namespace, type, and ID as a previous range are considered disjoint continuations of those ranges.

  • CML Documents can include XML content - XML attributes are parsed as a dict of string-typed keys and values, and XML namespaces are supported.

  • CML is thus a superset of XML, but not a strict one: CML does not support some of the archaic features of XML (named entities, DTDs - anything that has to be defined to be interpreted). XML documents that include these features need to be rendered (named entities expanded). Numeric entities (&#x00a0; or &#9;) are still supported, along with the built-in entities (&amp;, &lt;, &gt; for XML grammar).

  • CML uses UTF-8 to encode all documents.

  • Validation: The Grammar for CML document type is defined using a schema (CSL: Content Schema Language), so that documents can be validated according to that schema.

  • CML content schemas are expected to be hosted at their Namespace URL (unlike XML schemas). That way, any valid CML document that uses a namespace can be validated by retrieving the schema from the namespace URL.

  • Object model:

    • CML Document:

      • content: the text of the document.
      • ranges: the (hashmap? list?) of all the range objects in the document. (Hashmap makes sense: Each (namespace, type, ID) corresponds to a single Range object.)
    • Internally, CML content can be represented a couple of different ways:

      • The content is a string, ranges are objects that reference the content string by index. Strings in most languages are vectors / 1-dimensional arrays. This is an efficient way to handle content that is immutable.
      • The content is a (double-linked) list, each cell contains a character, ranges are objects that reference the content string with pointers. This is an efficient way to handle content that is mutable / editable. The pointers from range objects into the content don't have to be edited every time the content text is edited.
    • Range objects

      • namespace (can be empty). Includes the key and the URL.
      • type (can be empty)
      • ID (can be empty)
      • refs: List of zero or more range references(start, end). Range references are either indexes (for immutable CML) or pointers (for mutable CML).
      • attributes: A hashmap.
  • Queries: Unlike with XML, there is no meaning to queries within a hierarchy of nodes. Instead, we look for text or ranges that have particular attributes.

    • Ranges with a particular name, or set of names, or in a particular namespace

    • Ranges with a particular attribute, or the attribute matching particular conditions.

    • Ranges with a particular relationship to some other range - contained by, containing, overlapping, before, after.

    • Ranges containing particular text (text queries).

    • AND/OR/NOT for any of the above

    • One thing that is better in CML vs. XML is text queries: Whether using word stemming, regular expressions, or just literal string matching, we can query the text of the document or within a particular range without having to worry about whether the query crosses from one node to another (inside the scope of our search). With XML, text queries are very difficult, because there is no "text" for the document -- only the text of each node. With CML, text queries are very easy, and can be mixed with range queries.

    • (CML with immutable text makes text queries very easy, because the text is just a string, and every programming language has mature facilities to query strings.)

    • (CML with mutable text makes text queries much harder, because it requires writing the libraries to query the character list. -- for example, regular expressions become much more challenging!)

  • Transformations:

    • CML with immutable text does not change the text. CML with mutable text does, but it's straightforward (adding / removing / replacing text within a range).

    • It is always easy to add, remove, or edit ranges on the text. This can be done in a procedural fashion (query, then mutate ranges) or in a declarative fashion (add/mutate these ranges that have these characteristics - match these query parameters).

  • Filters:

    • We can filter ranges out of our results. This does not filter out the text (if the text is in another range that we have included in our results). Filtering is thus a way to remove ranges without changing the result text.
  • Deserializing: Parse and load into the object model.

  • Serializing: Range begin/end tags at the same point in the text can appear in any order. Their order at that point is meaningless.

  • Rendering to XML:

    • Requires hierarchization, which in turn requires a nesting order for all the range types in the document.
    • Once a range type nesting order is defined, hierarchization is very straightforward.
    • A range type nesting definition probably requires taking into account a variety of contexts -- relationships between the ranges -- and how to handle them. For example, paragraphs usually occur within sections, but an aside is a type of section, and it might occur inside a paragraph, and include paragraphs within it. (Some XML document types are quite deeply nested. This will prove a challenge for CML.)
    • Different contextual relationships for ranges might also imply different XML renderings.