Saturday, October 12, 2013

The Termcat Rewriting System and Its Different Stages

In my previous blog post I described some of the syntax of Termcat, a markup language that I have been working on while attending Hacker School. In this post I'd like to outline how I'm implementing it.

Termcat documents can be thought of as a sequence of Termcat expressions. Compiling Termcat documents into HTML involves applying a series of transformations to these expressions until every one of them is reduced to a piece of HTML code. The concatenation of this HTML code, wrapped in some leading and trailing code, constitutes the output document.

When I say 'applying transformations' I'm being very literal. Everything after reading the file and before mixing in the leading and trailing code is implemented as a rewriting system. There are six stages to this process. The purpose of this blog post is to explain them one by one.

Pre-tokenization (1/6)

The Termcat document is initially loaded as a sequence of Unicode characters. Next, every one of these characters is mapped onto a 'token'. Think of a token as a pair consisting of (i) a token type and (ii) a payload. The payload is the character. The token type is, at this point in the 6-step process, solely based on the character.

A superset of the following mapping is used:

  • \ -> :escape
  • Space -> :whitespace
  • Newline character -> :newline
  • ( -> [:ldelim :parenthesis]
  • ) -> [:rdelim :parenthesis]
  • . or : -> :maybe-fun
  • # -> :maybe-title
  • * or _ or - -> :maybe-magic
  • Everthing else -> :default
In the above list, everything following an arrow is Clojure code. In Clojure brackets delimit vectors and their elements are separated by spaces. Terms following colons are 'keywords'. You can think of keywords as strings for internal use (as opposed to input/output and string manipulation).

Notice that parentheses have a vector for a type. The same is true for [&npbsp;] brackets, {&npbsp;} braces, and <&npbsp;> chevrons, but we omit these delimiters here. Later, in the tokenization phase, we will also convert bullet items and indentation to delimiter tokens.

Also notice that some types are named "maybe-". That's a gentle reminder to ourselves that these tokens types are rough guesses. To wit, as in Markdown the string "# A Title" at the start of a line would be a section title, but "A#B" is never a title, even though the character # is initially always assigned the :maybe-title type.

Tokenization (2/6)

At the second stage we get more serious about tokens. To start, all successive occurrences of tokens of types :default, :whitespace, or :maybe-magic are merged. Thus the series [:default "N"] [:default "e"] [:default "a"] [:default "t"] is transformed into [:default "Neat"]. At this stage successive instances of two or more newlines are also transformed into :emptyline tokens. Everything following an escape token is given the :default token type and the escape tokens are removed. Indented blocks are wrapped in [:ldelim :indent] and [:rdelim :indent] tokens. Bullet items are wrapped in [:ldelim :bullet] and [:rdelim :bullet]. Finally, superfluous whitespace—e.g. whitespace before an [:rdelim :indent] token—is removed.

Abstract syntax tree generation (3/6)

In the first stage we converted a sequence of characters into a sequence of tokens. In the second stage we transformed this sequence of tokens into a more meaningful sequence of tokens. Notice, however, that this latter sequence implicitly describes a tree.

So far we have considered one kind of term—viz. tokens. We now add a new kind of term—blocks. Blocks consist of two delimiter tokens (one that matches the type [:ldelim _] and one that matches the type [:rdelim _]) and a sequence of terms. We refer to this sequence as the block's 'center'.

Abstract syntax tree (AST) generation is very intuitive. We simply replace all sequences of the (balanced) pattern [:rdelim _] terms* [:rdelim _] by a block term. For instance,

[:ldelim :indent] [:default "Hi"] [:whitespace " "] [:default "there"] [:rdelim :indent]
[:block [:ldelim :indent] [[:default "Hi"] [:whitespace " "] [:default "there"]] [:rdelim :indent]]
This is great because from here on we can say things such as 'an empty line, followed by an indented block, followed by another empty line' is a blockquote.

Desugarization (4/6)

Termcat wants to have its cake and eat it too. It wants to be nice and comfy to type in, like Markdown, but it also wants to be powerful and simple like TeX. Here's how.

There's a sugar-free version of Termcat that can do everything that regular Termcat can. However, it doesn't have any of Markdown's syntax. Instead, to add a section you would type ":section(This is a title)", to add emphasis you type ":emph(emphasis)" to create a bullet list you type ":bullet-list(Item 1)(Item 2)(Item 3)", and so on.

The desugarization step is responsible for translating Markdown syntax into sugar-free Termcat code. It results in sequences such as [:fun some-clojure-function] [:block [:ldelim :bullet] [...] [:rdelim :bullet]] [:block [:ldelim :bullet] [...] [:rdelim :bullet]]. Specifically, it results in :fun terms (similar to tokens, but with a Clojure function for payload) followed by zero or more :block terms, representing the arguments.

Lambda (5/6)

In the fifth stage computations are performed and HTML code is generated. Given a successive occurrence of a :fun term and one or more :block terms, the payload of the :fun term (a function) is invoked with the center of each block for arguments.

Roughly, the functions whose names start with a colon (:emph, :bullet-list, etc.) wrap their arguments in :html tokens. For example, ":emph(Test!)" is first rewritten to

[:fun colon-emph-function] [:block [:ldelim :parenthesis] [[:default "Test!"]] [:rdelim :parenthesis]]
and is then rewritten to
[:html "<emph>"] [:default "Test!"] [:html "</emph>"]

HTML output (6/6)

After all :fun terms have been evaluated we have a tree of terms of various types. In the sixth and final stage we iterate over all these terms and convert them to :html terms. For instance, the payload of :default tokens is HTML-escaped, :whitespace tokens are replaced by a space character, and :block terms are flattened.

So that's the gist of the Termcat rewriting system. I made some simplifications and omitted some details—the elephant in the room being mathematical expressions—but I hope to have shown that rewriting systems and markup languages can make for a good match. In the next installment I will explain how I'm implementing this system in Clojure.