Showing posts with label Programming. Show all posts
Showing posts with label Programming. Show all posts

Tuesday, January 27, 2015

Live queries in Comprehend

Someone recently asked me if it's possible to have Comprehend fire events when the database is updated and new matches are encountered. This is indeed possible with not too much effort. The trick is to combine forward matching and mutable indexed sets.

To start, suppose we are maintaining a set of pairs and we are interested in all triples [a b c] such that [a b] and [b c] are in our database. We define a function that prints such matches iff they are new since the function was last called:
(require '[comprehend :as c])
(require '[comprehend.mutable :as cm])

(defn report-new-matches [m-s]
  "Prints new paths of length 3 since report-new-matches was last
  called on m-s. Prints zero matches on the first invocation."
  (let [!results (volatile! nil)]
    (dosync
      (vreset! !results (c/comprehend :mark :report-new-matches
                                      @m-s
                                      [a b]
                                      [b c]
                                      [a b c]))
      (cm/mark m-s :report-new-matches))
    (doseq [x @!results]
      (println "Matched" x))))
This function runs a transaction that comprises two steps:
  1. Run a query on the mutable indexed set
  2. Update the mutable indexed set with a marker
The transaction ensures that the set is not mutated between step 1 and step 2. After the transaction completes, report-new-matches prints the results.

We now hook up report-new-matches as the 'io' argument of the mutable indexed set constructor:
(declare m-s)
(defn live-reporter
  ([] nil)
  ([s diff] (report-new-matches m-s)))
(def m-s (cm/mutable-indexed-set live-reporter))
(report-new-matches m-s) ; init
(I admit the circular dependency between the definitions of live-reporter and m-s is currently a little ugly. I might tweak the signature of 'io' functions to remedy this problem.)

That's it! Let's see what happens when we add elements:
(reduce cm/conj m-s [[1 2] [2 3] [1 3] [3 4]])
; Matched [1 3 4]
; Matched [1 2 3]
; Matched [2 3 4]

(cm/conj m-s [0 1])
; Matched [0 1 3]
; Matched [0 1 2]
At some point the comprehend.mutable API will be amended with a macro to make this even easier. In the meantime, as you can see, it's straightforward to role out your own solution.

Thursday, January 1, 2015

Comprehend is now a lot faster

In a previous blog post I announced Comprehend, a Clojure library for pattern matching on indexed sets.

In its original implementation, when an item was added to an indexed set, Comprehend would translate the item’s relevant properties to predicate logic and it would store the resulting relations in an indexed core.logic database. When Comprehend was subsequently given a query, it would translate the query to predicate logic, and then have core.logic unify the database and the (translated) query.

The above approach came with quite a bit of overhead, both in terms of performance and in terms of memory use. That is why I have now rewritten the entire Comprehend engine.

The new engine uses a custom multi-threaded unifier. Indexes are created lazily, as they are needed, and are stored in a cache.

Comprehend currently uses a soft cache by default. This means that the garbage collector is in charge of evicting indexes. This seems to work well in practice, but if that’s not to your liking—don’t worry—it’s trivial to swap in an alternative cache from core.cache. Moreover, Comprehend should have enough hooks to allow for a custom cache that intelligently evicts indexes as items are added or removed from the indexed set.

I tried the new engine on a few queries and found the overhead small enough. Even for relatively small sets, Comprehend returns results quickly and I found it quite smart about what indexes (not) to create.

Sunday, May 18, 2014

Comprehend: A Clojure Library for Pattern Matching on Sets

I've been doing a lot of Objective-C programming lately. Objective-C is not the worst language, but last week I couldn't help but become nostalgic of the Clojure programming I did at Hacker School. At some point I could no longer help but open Good Ol' Sublime Text and experiment with core.logic.

My procrastinating was productive because I now have the pleasure of presenting Comprehend to you.

Comprehend is a Clojure library containing a setlike data type that is optimized for the idiom { ∀ patterns ⊆ S : expr }. In Clojure this ends up looking like this:
(c/comprehend S pattern1 pattern2 ... expr) ; => '(result1 result2 result3 ...)
I explain the syntax in quite a bit of detail on the Github page (repeat link). In this blog post I want to go into some technical aspects of the implementation.

The main ideas

In Comprehend core.logic.pldb is used to manage the indexes of indexed sets.

When adding an element X to an indexed set, here is what we do:
  • Add something akin to [:root-element (hash X)] to the logic database
Next, with Y = X initially, we recursively perform the following steps:
  • If Y is sequential then for every element Z of Y at position i:
    • Add [:vector-length (count Y)] to the logic database
    • Add [:vector-element (hash Y) i (hash Z)]
    • If Z is a (non-opaque) collection then repeat this algorithm with Y := Z
  • If Y is a map then we do something similar for both the keys and the values of Y. We don't store the size of Y, however, because in the case of maps we also want to match on supermaps.
  • If Y is a different kind of collection (e.g. a  set) then we do the same but we don't have to deal with both keys and values.
That's it! core.logic.pldb puts indices on all the elements in our tuples and this helps us perform fast queries. Next to the index we also store a dictionary from all the hashes in our tuples to the original objects.

So about those queries. We follow a similar procedure to translate patterns into tuples. Here are the main differences:
  • At compile time we determine what the unbound symbols in the pattern are.
  • Wherever we would previously put (hash Y) or (hash Z) in a tuple, we now need to be more clever:
    • If Y (or Z) was one of the unbound symbols then we just return it. (Nitpick: This is at runtime and core.logic will have made the symbol resolve to a core.logic 'lvar'.)
    • If Y (or Z) is a collection that contains one of the unbound symbols then we replace it by an lvar. We reuse the same lvar for identical patterns.
    • In all other cases we store the hash value after all.
We then hand core.logic all of the following: (a) the core.logic.pldb index, (b) the list of unbound symbols in the pattern, and (c) the tuples-with-variables. Core.logic will do its thing and returns a sequence of sequences of hashes. The outermost sequence can be thought of as a list of matches. Every innermost sequence contains hashes of assignments for (b). We map these hashes onto assignments using the dictionary that we stored next to the index.

Recall that we determined (b) at compile time. This means that at compile time we can also generate the following template:
(for [[var1 var2 ...] (do-all-of-the-above)] expr)
Thus, insofar the only unbound symbols in expr also occurred in (one of) the patterns, expr will be grounded.

What is it good for?

I wrote a quick and dirty benchmark. The idea being merely to verify that my library is faster than naive approaches for some problems.

In the benchmark I create a graph of N vertices and N random edges. I then compare several approaches to find all paths that consist of four edges.

The first approach I tried, of course, uses Comprehend.

I also implemented an alternative core.logic query that uses a vector representation of the graph as its input source. This code uses core.logic's membero operator and also core.logic's built-in support for unification of vectors with logic variables. At around N=50 Comprehend conclusively overtakes this implementation.

I also implemented a naive algorithm that uses 'for', 'filter', and argument destructuring. At N=50 this implementation is about 50% slower than the previous core.logic query.

That is to say, there is some proof that there exist use cases for Comprehend. The exact scope of use cases is not clear to me yet, however. This warrants further investigation.

Practical uses aside, I had a ton of fun implementing this library and definitely also learned a thing or two along the way.

Alright, now it's back to Objective-C for me!

Wednesday, February 26, 2014

Termcat ClojureScript Port and Live Demo

Recently I spent some time porting Termcat from Clojure to ClojureScript. As a result, it is now possible to embed the Termcat compiler in web pages.

I also went one step further and created a live demo. The demo consists of a simple Termcat editor. As you enter Termcat code, it compiles your document to HTML/MathML on the fly.

Do try the demo if you're interested in alternatives for LaTeX or Markdown!

Monday, November 25, 2013

Thoughts on the Clojure threading macros

Most programmers who have dabbled in Clojure are probably familiar with the threading macros -> and ->>. For readers, who aren't. Here's what they do.

The threading macros allow you to write function composition in reversed order. For instance, using the threading macros (f (g (h x))) can be written as (-> x h g f) or (->> x h g f). Here, the first argument of the macros are evaluated as usual. The other arguments are assumed to be functions.

It's also possible to use -> and ->> for functions that expect multiple arguments. As such, the code (-> a (f b) g) is translated into (g (f a b)) and (->> a (f b) g) is rewritten as (g (f b a)). That is to say, every time the pattern (function arg1 arg2) is encountered, the result of the preceding expression is inserted as the first (->) or last argument (->>) of the function call.

The macro ->> is perhaps the more popular threading macro. It is very useful in combination with map, reduce, or filter—functions that take a function as their first argument and a collection as their last argument. To wit, you might write the following code:

(->> [-1 0 1 2]
     (filter pos?)
     (map inc)
     (map str))

This expression would first filter all positive elements from the vector [-1 0 1 2] (i.e. delete -1 and 0),  then increment the remaining elements, and finally convert them to strings. The result is the list ("2" "3").

The macro -> is useful for functions that perform operations on their first argument. For instance,

(-> { :a [1 2] }
    (get :a)
    (conj 3))

This evaluates to the vector [1 2 3].

Invoking object or protocol methods in Clojure abides by the following syntax: (method obj arg1 arg2 ...). If, in typical functional style, your methods return updates on 'obj' then the macro -> tends to be the more useful threading macro for composing method calls.

Combining -> and ->> can be a bit of a pain. Nesting ->> within an -> expression works well:

(-> {:a [0 1 2 3]}
    (get :a)
    (->> (filter pos?)
         (map inc))
    reverse)
; (4 3 2)

This works because -> inserts the result of (get :a {:a [0 1 2 3]}) as the first argument of the call to ->>. Unfortunately, however, there's no obvious way for nesting -> within ->>.

Using only -> and ->> there doesn't seem to be a good general way of dealing with long chains of composed function calls that expect the 'object' in different positions.

Another annoyance is that -> and ->> don't allow for control flow in an obvious way. That is to say, you can't easily use if-statements within an -> expression.

To be fair, you could use lambdas like this

(-> ...
    ((fn [x] (if (pred x)
                 (foo x)
                 (bar x)))))

but this is a less than stellar solution if only because it introduces a lot of extra parentheses.

(Granted, using the special lambda form #() one pair of parentheses can be eliminated, but then the trade-off is that #()-expressions cannot be nested.)

Where the Clojure threading macros appear to fall short, the Swiss arrow macros purport to be a solution. However, there are rather many of them and I'm not sure if you want other people to understand your code then I feel it's better not to use them.

Enter the macro as->. It's a built-in macro (as of Clojure 1.5) and it's pretty darn awesome. Here's the source code:

(defmacro as->
  "Binds name to expr, evaluates the first form in the lexical context
  of that binding, then binds name to that result, repeating for each
  successive form, returning the result of the last form."
  {:added "1.5"}
  [expr name & forms]
  `(let [~name ~expr
         ~@(interleave (repeat name) forms)]
     ~name))

That code is unusually short, don't you agree? Here's what it does. It transforms the code

(as-> [1 2 3 4] $
      (conj $ 5)
      (map dec $))

into

(let* [$ [1 2 3 4] 
       $ (conj $ 5) 
       $ (map dec $)] $)

and it returns (0 1 2 3 4). I think it's a real gem.

The macro as-> allows you to introduce control flow and it works very well with the -> macro. To wit, you can write things like

(-> ...
    (as-> x (if (pred x)
                (foo x)
                (bar x)))))
    ...)

It also works great in combination with the -> macro. For instance, you can write

(as-> [1 2 3] x
      (map inc x)
      (-> x
          vec
          (conj 5)))
; [2 3 4 5]

Isn't that just swell?

Monday, November 18, 2013

This is not a blog post

In my previous blog post I wrote that I was feeling pretty happy with my productivity. Unfortunately I didn't feel quite the same this week. It's not that I didn't do anything. I wrote plenty of code. However, most of my time was spent experimenting with new syntax and getting lots of fairly boring details right. This makes it feel like I didn't do anything 'real'.

Having written a PhD thesis, however, I know all too well that feeling productive and actually being productive are things that don't always correlate. After digging through my notes and my Github commits I managed to compile this (non-exhaustive) list of stuff I did since my last post:

  • I made sure 'weird' expressions such as p^y_a^z'_b compile to sensible MathML code. LaTeX will actually give you an error if you type that; Termcat interprets it as p^{yz}'_{ab}. Incidentally, the MathML for that is <msubsup><mi>p</mi><mrow><mo>{</mo><mi>a</mi><mi>b</mi><mo>}</mo></mrow><mrow><mo>{</mo><mi>y</mi><mi>z</mi><mo>}</mo><mo>′</mo></mrow></msubsup>. Talk about verbose!
  • I ripped out some features that were dear to me, such as the ability to use '<' and '>' to delimit tuples. This, I realized, was confusing in many contexts. (One feature I did keep is that the Termcat code <~ ... ~>  converts the delimiters to proper chevron glyphs.)
  • I made many parsing rules a lot stricter. This means there's less chance of accidentally triggering special Termcat features.
  • I improved the typography of Termcat documents.
  • I experimented with syntax for bound names and for lambdas.
  • I read up on finite state automata (which are used to match regular expressions) and finite state transducers (which can do substitutions). One outcome of this is the conjecture that if I change the semantics of my rewriting system somewhat then my rewriting rules may become composable. This means that instead of having to do a pass over all tokens for every rule, it may be possible to automatically combine all rules first and then run them all at once in a single pass. I already alluded to this in an earlier blog post and I might do some work on this this week.
So there. Maybe I wasn't so unproductive after all!

Thursday, November 7, 2013

Halfway

I'm halfway through Hacker School and Termcat is reaching a point where I'm thinking of demoing it internally. I will probably do so next week.

Here are some of the features I implemented the past few days:
  • Added lexically scoped variables.
  • Added support for selecting a calligraphic math font. The Termcat code +M is equivalent to the LaTeX code \mathcal{M}.
  • It's now possible to use HTML tags as well as numerical and named entities in Termcat without escaping them. They 'fall through' and are outputted as raw HTML.
  • Added special support for the unary minus operator: -x is syntactic sugar for -~ x. In contrast, for the binary operator the notation x ~-~ y is used (or x - y after setting up the necessary bindings).
  • Like in LaTeX, whitespace is now more wide if it follows a colon or full stop (but not an acronym).
  • It's now possible to write hé as h%'e and nǐ hǎo as n%3i h%3ao. (I should learn more about Pinyin and figure out if it would be possible to convert the standard notation ni3 hao3 automatically into nǐ hǎo!)
  • Arbitrarily wide spaces can be inserted using the notation %_1.2em. Manual kerning is possible using the notation %+2px or %-.5ex.
  • Added support for superscript (x^y), subscript (x_y)  and fractions (x|y) in math mode. Similar to italics and other decorators, the outmost brackets in an expression like (1 + x)|(2 - y) are removed, so as to allow for spaces in the numerator and denominator.
I'm quite happy with my productivity as of late. It seems the foundations I created are holding up fairly well, and this means I can add new features easily.

Next week I will be working on adding user-defined functions (lamdas) to Termcat. That should be fun!

Friday, October 18, 2013

Termcat: Details On the Current Implementation and the Potential for Optimization

In my previous blog post I gave a general overview of the rewriting system behind Termcat. Below I explain how this system is implemented in Clojure. I also mention some potential optimizations.

This blog post assumes basic familiarity with Clojure and Termcat.

The General Case

Recall that rewriting rules transform trees of tokens into trees of tokens. Equivalently, with term a token or a sequence of tokens, they transform sequences of terms into sequences of terms.

In Clojure we represent rewriting rules by functions that take the following arguments:
  • An object state
  • A sequence result
  • A term v
Rules return pairs of the following elements:
  • An object new-state
  • An update sequence new-result
Rules have the following meta-data (meaning, things you need to know to apply the rule):
  • An integer padding-count
  • An object init-state
When we run (rewrite terms rule) this happens:
  • Every term t in terms that is a sequence is replaced by the result of (rewrite t rule). (Recursion ends when terms is a sequence of tokens.)
  • Next, we call (rule init-state (repeat padding-count nil) (first terms))
  • The function rule is called once for all other terms in terms (both tokens and sequences), where the state and result passed to each call is what the previous call returned
  • Finally, rule is called padding-count more times with nil passed for the last element
Interestingly, all but two rules fit the following simple pattern:
  • They do pattern matching on the term or token type (see my previous post) of v and of the last padding-count elements of result
  • The new result that they return is the old result, minus possibly its padding-count last elements, plus a number of new terms
The rule that converts the token sequence into an abstract syntax tree is the most notable rule that does not adhere to this pattern. It needs to be able to pop any number of elements from the accumulated result.

The other rule that violates the above pattern is the rule for evaluating functions. This is because this rule may need to pop any number of function arguments from the accumulated result. However, this is merely a matter of convenience. By using currying we could make this case conform to the aforementioned pattern.

A Macro For the Simple Pattern

Termcat uses a macro defrule to deal with the simple pattern. Using this macro a rule might look like this:

(defrule remove-superfluous-whitespace
  [state t1 t2]
  tt
  [_ (:or nil
          :emptyline
          [:block :indent]
          [:block :bullet]) :whitespace] [nil t1]
  [_ :whitespace (:or nil
                      :emptyline
                      [:block :indent]
                      [:block :bullet])] [nil t2])

The first argument assigns names to the states and the tokens that the rule wants to read. Implicitly, thereby, it also specifies the number of tokens it wants to read and the desired padding-count (number of tokens, minus 1). Here, it reads one element t1 from result. The second element t2 corresponds to v.

The second argument, tt, specifies that we want to match on the term-type of t1 and t2. The remaining arguments do pattern matching (using core.match) on state, (tt t1), and (tt t2).

A rule defined with defrule is expected to return a vector of the new state and the tokens that should replace the tokens it read. Alternatively, nil can be returned to indicate that nothing should change. This is also what happens if none of the patterns match the input.

Finite State Machines

The rules we discussed are simple enough that the matching part of the rules can be transformed into regular expressions or finite state machines. The difficulty is handling the case where the rules make substitutions efficiently.

In essence, transforming a Termcat document into an HTML document is a matter of applying a bunch of rules sequentially. That is, we're faced with something like this:

(-> (read-termcat-document "hello.tc")
    (rewrite tokenizer-rule1)
    (rewrite tokenizer-rule2)
    ...
    (rewrite generate-abstract-syntax-tree)
    (rewrite post-ast-rule1)
    (rewrite post-ast-rule2)
    ...
    (write-html-document))

It would be interesting to investigate whether or not the different rewriting rules could be automatically—using macros—merged into a single finite state machine. Ideally we would end up with something like this:

(-> (read-termcat-document "hello.tc")
    (rewrite merged-tokenizer-rules)
    (rewrite generate-abstract-syntax-tree)
    (rewrite merged-post-ast-rules)
    (write-html-document))

Parallelism and Fast Recompilation

All of the rewriting rules (including the non-simple ones) are deterministic. Their output depends only on the following factors: (i) the state, (ii) the tokens that they read from result, and (iii) v.

This is incredibly exciting! Here's why: Parallelism and Fast Recompilation.

Suppose we want to rewrite a dozen tokens using a rule rule1. The first token has the payload (or content) '1', the second token has the payload '2', and so on. What we could do is apply rule 1 in parallel to (i) the complete sequence of 12 tokens and (ii) to the sequence of token 7-12. These two processes are depicted below.
The coloring indicates which calls to rule1 have identical input (meaning the tokens they read from result, their state, and v are identical).

We can see that for token 7 the input to rule1 is different for both processes. This could be because in the first process rule1 rewrote token 5 or 6. It could also be because the state is different for both function calls. Here we should discard the output of the second process. When the first process arrives at the eight token, however, we see that its arguments for rule1 are identical to the arguments that the second process used. This is great. This means we can terminate the first process and append the result of the first process to the result of the second process (discarding those tokens that were emitted after the second process rewrote token 7 and that were not read in by rule1 upon rewriting token 8).

Similar optimization should be possible when the source code of a Termcat document is modified. Think fast (partial) recompilation.

Three caveats:

  1. Coordinating this sort of parallelism is inherently tricky. For small inputs it would definitely not be worth it. And what's 'small' in this context?
  2. Moreover, it's not clear if the overhead of comparing the arguments for rule1 would, in practice, ever be worth the time saved by not having to do redundant rewriting.
  3. This optimization is not applicable to the rule for generating abstract syntax trees since this rule (i) tends to keep a lot of state around and (ii) may need to read an arbitrary number of elements from result.
Note that much of the last two sections is speculation. I might very well dig deeper into these things at some point because I think that would be a lot of fun. Nevertheless, this will probably have to wait until after Hacker School since I want to focus on more basic features first.

Meanwhile, I'm approaching a point where I want to start demoing Termcat to people and get feedback. In my next blog post I will give an overview of the features currently implemented.

Saturday, October 12, 2013

The Termcat Rewriting System and Its Different Stages

In my previous blog post I described some of the syntax of Termcat, a markup language that I have been working on while attending Hacker School. In this post I'd like to outline how I'm implementing it.

Termcat documents can be thought of as a sequence of Termcat expressions. Compiling Termcat documents into HTML involves applying a series of transformations to these expressions until every one of them is reduced to a piece of HTML code. The concatenation of this HTML code, wrapped in some leading and trailing code, constitutes the output document.

When I say 'applying transformations' I'm being very literal. Everything after reading the file and before mixing in the leading and trailing code is implemented as a rewriting system. There are six stages to this process. The purpose of this blog post is to explain them one by one.

Pre-tokenization (1/6)

The Termcat document is initially loaded as a sequence of Unicode characters. Next, every one of these characters is mapped onto a 'token'. Think of a token as a pair consisting of (i) a token type and (ii) a payload. The payload is the character. The token type is, at this point in the 6-step process, solely based on the character.

A superset of the following mapping is used:

  • \ -> :escape
  • Space -> :whitespace
  • Newline character -> :newline
  • ( -> [:ldelim :parenthesis]
  • ) -> [:rdelim :parenthesis]
  • . or : -> :maybe-fun
  • # -> :maybe-title
  • * or _ or - -> :maybe-magic
  • Everthing else -> :default
In the above list, everything following an arrow is Clojure code. In Clojure brackets delimit vectors and their elements are separated by spaces. Terms following colons are 'keywords'. You can think of keywords as strings for internal use (as opposed to input/output and string manipulation).

Notice that parentheses have a vector for a type. The same is true for [&npbsp;] brackets, {&npbsp;} braces, and <&npbsp;> chevrons, but we omit these delimiters here. Later, in the tokenization phase, we will also convert bullet items and indentation to delimiter tokens.

Also notice that some types are named "maybe-". That's a gentle reminder to ourselves that these tokens types are rough guesses. To wit, as in Markdown the string "# A Title" at the start of a line would be a section title, but "A#B" is never a title, even though the character # is initially always assigned the :maybe-title type.

Tokenization (2/6)

At the second stage we get more serious about tokens. To start, all successive occurrences of tokens of types :default, :whitespace, or :maybe-magic are merged. Thus the series [:default "N"] [:default "e"] [:default "a"] [:default "t"] is transformed into [:default "Neat"]. At this stage successive instances of two or more newlines are also transformed into :emptyline tokens. Everything following an escape token is given the :default token type and the escape tokens are removed. Indented blocks are wrapped in [:ldelim :indent] and [:rdelim :indent] tokens. Bullet items are wrapped in [:ldelim :bullet] and [:rdelim :bullet]. Finally, superfluous whitespace—e.g. whitespace before an [:rdelim :indent] token—is removed.

Abstract syntax tree generation (3/6)

In the first stage we converted a sequence of characters into a sequence of tokens. In the second stage we transformed this sequence of tokens into a more meaningful sequence of tokens. Notice, however, that this latter sequence implicitly describes a tree.

So far we have considered one kind of term—viz. tokens. We now add a new kind of term—blocks. Blocks consist of two delimiter tokens (one that matches the type [:ldelim _] and one that matches the type [:rdelim _]) and a sequence of terms. We refer to this sequence as the block's 'center'.

Abstract syntax tree (AST) generation is very intuitive. We simply replace all sequences of the (balanced) pattern [:rdelim _] terms* [:rdelim _] by a block term. For instance,

[:ldelim :indent] [:default "Hi"] [:whitespace " "] [:default "there"] [:rdelim :indent]
becomes
[:block [:ldelim :indent] [[:default "Hi"] [:whitespace " "] [:default "there"]] [:rdelim :indent]]
This is great because from here on we can say things such as 'an empty line, followed by an indented block, followed by another empty line' is a blockquote.

Desugarization (4/6)

Termcat wants to have its cake and eat it too. It wants to be nice and comfy to type in, like Markdown, but it also wants to be powerful and simple like TeX. Here's how.

There's a sugar-free version of Termcat that can do everything that regular Termcat can. However, it doesn't have any of Markdown's syntax. Instead, to add a section you would type ":section(This is a title)", to add emphasis you type ":emph(emphasis)" to create a bullet list you type ":bullet-list(Item 1)(Item 2)(Item 3)", and so on.

The desugarization step is responsible for translating Markdown syntax into sugar-free Termcat code. It results in sequences such as [:fun some-clojure-function] [:block [:ldelim :bullet] [...] [:rdelim :bullet]] [:block [:ldelim :bullet] [...] [:rdelim :bullet]]. Specifically, it results in :fun terms (similar to tokens, but with a Clojure function for payload) followed by zero or more :block terms, representing the arguments.

Lambda (5/6)

In the fifth stage computations are performed and HTML code is generated. Given a successive occurrence of a :fun term and one or more :block terms, the payload of the :fun term (a function) is invoked with the center of each block for arguments.

Roughly, the functions whose names start with a colon (:emph, :bullet-list, etc.) wrap their arguments in :html tokens. For example, ":emph(Test!)" is first rewritten to

[:fun colon-emph-function] [:block [:ldelim :parenthesis] [[:default "Test!"]] [:rdelim :parenthesis]]
and is then rewritten to
[:html "<emph>"] [:default "Test!"] [:html "</emph>"]

HTML output (6/6)

After all :fun terms have been evaluated we have a tree of terms of various types. In the sixth and final stage we iterate over all these terms and convert them to :html terms. For instance, the payload of :default tokens is HTML-escaped, :whitespace tokens are replaced by a space character, and :block terms are flattened.

So that's the gist of the Termcat rewriting system. I made some simplifications and omitted some details—the elephant in the room being mathematical expressions—but I hope to have shown that rewriting systems and markup languages can make for a good match. In the next installment I will explain how I'm implementing this system in Clojure.

Thursday, July 10, 2008

Getting Rid of Legacy Text Encoding

In the process of migrating my sites to a new server, I am also making my HTML code 7-bit clean. This should finally relieve me from ever having to manually change the character set in my terminal and the $LANG variable in my shell environment.

I tried to find a program that would do the conversion for me but, alas, I did not find anything that would help me get the job done quickly. Hence, I decided to write my own script.