Monday, November 4, 2013

How Termcat Parses Mathematical Expressions

In my first blog post on Termcat I explained that one of my primary goals was to create a markup language that has a more natural syntax for writing mathematical expressions than LaTeX.

I mentioned the following expression as an example:
In LaTeX the code for this expressions looks like this:
$E = \{\langle a, n, n' \rangle \subseteq I \times N \times N \mid Pa \text{ and } n < n' \}$
I set myself the goal to allow the same expression to be generated from the following code:
E = {<a, n, n'> :subseteq I :times N :times N | Pa \and n < n'}
I mostly have something like this (but better) working!

A heuristic approach

One of the driving ideas behind Termcat's syntax is that when it recognizes an infix operation then it should know that, normally, there's a mathematical expression to the left and to the right of that operation. It should also be able to make similar inferences from prefix and suffix operators.

By way of example, the 'raw' syntax for operators is as follows:
~=~ : infix operator =, automatic spacing
~~|~~ : infix operator |, forces normal spacing
~~! : postfix operator !, normal spacing to the left
#~~~ : prefix operator #, wide spacing to the right
Operators must be surrounded by whitespace on the side of the tildes.

Using this syntax the expression above can be encoded like this:
E ~=~ {~ ⟨~ a ~,~ n ~,~ n' ~⟩ ~⊆~ I ~×~ N ~×~ N ~~|~~ Pa and n ~<~ n' ~}
(There's magic involved in getting n' to display correctly, but let's ignore that. Also, MathML doesn't seem to define default spacing for '|' so it needs to be surrounded by double tildes.)

Termcat also has heuristics for parentheses, brackets, braces, and chevrons and this allows us to some of the tildes:
E ~=~ {<a ~,~ n ~,~ n'> ~⊆~ I ~×~ N ~×~ N ~~|~~ Pa and n ~<~ n'}
The output is nearly identical to what LaTeX generates:
The Termcat code can be simplified further. First, however, I need to introduce another Termcat feature.

Intermezzo: lexical bindings

I'm currently working on adding user-defined 'bindings' or substitutions of standalone words to Termcat. The idea is that
!bind(test)(*test*)
test
should be rewritten into
*test*
Bindings are lexically scoped, where scope is delimited by parentheses, brackets, braces, chevrons, indentation, and bullet list syntax determine scope. Hence
(!bind(test)(*test*)
test)

test
is rewritten into
*test*

test
Towards a more natural syntax for mathematical expressions

Bindings can be used to remove the remaining tildes in the Termcat code. Consider the following declarations:
!bind
- =
- ~=~
!bind
- ,
- ~,~
!bind
- subseteq
- ~⊆~
!bind
- *
- ~×~
!bind
- |
- ~~|~~
!bind
- \<
- ~<~
For now the idea is that these bindings have to be set in every Termcat document. I may add a default bindings table at a later point though. In any case, after the above bindings have been defined it should be possible to write the following code:
E = {<a , n , n'> subseteq I * N * N | Pa and n < n'}
That looks a lot more readable than the LaTeX code if you ask me! In fact, I think it's nicer than the syntax I originally envisioned.

One obvious further improvement might be to treat commas (and semicolons) as special by default. This would obviate the need to surround commas by whitespace. I will look into this too.