This blog post assumes basic familiarity with Clojure and Termcat.
The General Case
Recall that rewriting rules transform trees of tokens into trees of tokens. Equivalently, with term a token or a sequence of tokens, they transform sequences of terms into sequences of terms.
In Clojure we represent rewriting rules by functions that take the following arguments:
- An object state
- A sequence result
- A term v
- An object new-state
- An update sequence new-result
- An integer padding-count
- An object init-state
- Every term t in terms that is a sequence is replaced by the result of (rewrite t rule). (Recursion ends when terms is a sequence of tokens.)
- Next, we call (rule init-state (repeat padding-count nil) (first terms))
- The function rule is called once for all other terms in terms (both tokens and sequences), where the state and result passed to each call is what the previous call returned
- Finally, rule is called padding-count more times with nil passed for the last element
- They do pattern matching on the term or token type (see my previous post) of v and of the last padding-count elements of result
- The new result that they return is the old result, minus possibly its padding-count last elements, plus a number of new terms
The other rule that violates the above pattern is the rule for evaluating functions. This is because this rule may need to pop any number of function arguments from the accumulated result. However, this is merely a matter of convenience. By using currying we could make this case conform to the aforementioned pattern.
A Macro For the Simple Pattern
Termcat uses a macro defrule to deal with the simple pattern. Using this macro a rule might look like this:
[state t1 t2]
[_ (:or nil
[:block :bullet]) :whitespace] [nil t1]
[_ :whitespace (:or nil
[:block :bullet])] [nil t2])
The first argument assigns names to the states and the tokens that the rule wants to read. Implicitly, thereby, it also specifies the number of tokens it wants to read and the desired padding-count (number of tokens, minus 1). Here, it reads one element t1 from result. The second element t2 corresponds to v.
The second argument, tt, specifies that we want to match on the term-type of t1 and t2. The remaining arguments do pattern matching (using core.match) on state, (tt t1), and (tt t2).
A rule defined with defrule is expected to return a vector of the new state and the tokens that should replace the tokens it read. Alternatively, nil can be returned to indicate that nothing should change. This is also what happens if none of the patterns match the input.
Finite State Machines
The rules we discussed are simple enough that the matching part of the rules can be transformed into regular expressions or finite state machines. The difficulty is handling the case where the rules make substitutions efficiently.
In essence, transforming a Termcat document into an HTML document is a matter of applying a bunch of rules sequentially. That is, we're faced with something like this:
(-> (read-termcat-document "hello.tc")
It would be interesting to investigate whether or not the different rewriting rules could be automatically—using macros—merged into a single finite state machine. Ideally we would end up with something like this:
(-> (read-termcat-document "hello.tc")
Parallelism and Fast Recompilation
All of the rewriting rules (including the non-simple ones) are deterministic. Their output depends only on the following factors: (i) the state, (ii) the tokens that they read from result, and (iii) v.
This is incredibly exciting! Here's why: Parallelism and Fast Recompilation.
Suppose we want to rewrite a dozen tokens using a rule rule1. The first token has the payload (or content) '1', the second token has the payload '2', and so on. What we could do is apply rule 1 in parallel to (i) the complete sequence of 12 tokens and (ii) to the sequence of token 7-12. These two processes are depicted below.
We can see that for token 7 the input to rule1 is different for both processes. This could be because in the first process rule1 rewrote token 5 or 6. It could also be because the state is different for both function calls. Here we should discard the output of the second process. When the first process arrives at the eight token, however, we see that its arguments for rule1 are identical to the arguments that the second process used. This is great. This means we can terminate the first process and append the result of the first process to the result of the second process (discarding those tokens that were emitted after the second process rewrote token 7 and that were not read in by rule1 upon rewriting token 8).
Similar optimization should be possible when the source code of a Termcat document is modified. Think fast (partial) recompilation.
- Coordinating this sort of parallelism is inherently tricky. For small inputs it would definitely not be worth it. And what's 'small' in this context?
- Moreover, it's not clear if the overhead of comparing the arguments for rule1 would, in practice, ever be worth the time saved by not having to do redundant rewriting.
- This optimization is not applicable to the rule for generating abstract syntax trees since this rule (i) tends to keep a lot of state around and (ii) may need to read an arbitrary number of elements from result.
Note that much of the last two sections is speculation. I might very well dig deeper into these things at some point because I think that would be a lot of fun. Nevertheless, this will probably have to wait until after Hacker School since I want to focus on more basic features first.
Meanwhile, I'm approaching a point where I want to start demoing Termcat to people and get feedback. In my next blog post I will give an overview of the features currently implemented.