Monday, 23 August 2010

Indentation - Lex or Parse ?

A couple of questions on the squint forum got me thinking about this subject.

Muzzleflash:  "However, isn't the only thing determining basic indentation in sqf braces?"

Callihn:  "I tried the [CPP auto-indent], I don't think it should be deleting things?" 
Before I talk about indentation, let's start with an easier topic; syntax-highlighting, or markup.  The difference between lexical markup vs markup based on parsing is most easily demonstrated by an example using a couple of english sentences...

Turn left here
Take the next turn

Now suppose we wanted to highlight the verbs in these sentences using a lexical approach. In this case we simply look at each word in isolation and decide whether it is a verb or something else. Starting with the first sentence things look quite easy; we can simply look up the word 'turn' in a big list of known verbs and colour it red.  The second sentence gives us a problem though.  'turn' is actually a noun in this sentence so we end up colouring it incorrectly.  In contrast, a parse-based approach tries to understand the underlying structure of the sentence and detects that 'turn' is the subject of the verb 'take' and therefore must be a noun.

So much for basic lex/parse theory... what has this got to do with squint ?

Well, the whole point of squint is that it tries to construct a parse-tree of the file it is looking at rather than simply looking at individual words. So when you write
if (_a) then {_b=_c}; 

squint truly 'understands' this as
and uses that knowledge to highlight the operators it finds.  This is in contrast to most text-editors which simply take the lexical approach of looking up keywords in a dictionary.  For markup, there really aren't significant differences between one approach and another but using a parse-tree does mean that squint can do other interesting things like highlight the operands used by a particular operator.

For things like indentation though, the differences between the two approaches become more profound.  The lexical approach simply looks at each token in turn in other words, the indenter sees a sequence a bit like


That's where Muzzleflash's comment about 'counting the brackets' comes from because nothing else really matters. The nice thing about this approach is that it's quite easy to implement and quite 'robust'.  Ie the worst that will happen if you get it wrong is that the indentation will look a bit screwy.  The big disadvantage is that a lot of indentation rules are quite context-dependent.  For example, in the code

while {_a<10} do { for [{_i=0{,{_i<_a},{_i=_i+1}}] do {_z=_z+i;};_a=_a+1};

which I've deliberately written on one line to illustrate the problem, you probably want to linefeed on the braces after 'do' but not the other ones.  Building in these kinds of rules is a bit awkward with a lexical -indenter although it can be done and to some extent relies upon creating some basic parsing mechanism.

In contrast the parse-tree approach starts from a very different place.  Referring back to our original tree of
it's worth noting that in this representation the indenter knows nothing at all about brackets.  It simply infers that the contents of a CODE block should be wrapped in '{' and '}' when it comes time to present these to the viewer.  This makes a parse-tree approach potentially much more powerful. One can easily define special formatting rules for particular keywords or constructs because the whole context is available (at least in theory, in practice it's often a pain to do this).  Another point is that the indenter is guaranteed to output syntactically correct code!

However, this apparent strength makes the parse-tree approach quite 'brittle'. Since the original textual representation has been discarded by the time the indenter operates on the parse-tree, things can go very badly wrong if the parser does not accurately reflect the intention of the original input. To put it simply, in the worst case, the indenter can remove whole swathes of input that it couldn't parse!  Parse-tree based indenters really, really, prefer to operate on syntactically correct code !  It's interesting to note for example that the auto-indenter in Microsoft Visual Studio will refuse to operate until you have fixed all the syntax errors in your code, implying strongly that it is parse-tree based.

So coming back to the Callihns problem - "I don't think it should be deleting things"....  The explanation is really a variation on the problem above.  In this case I suspect the CPP parser was being used to indent SQF code.  Since CPP is a much simpler grammar than SQF, there are a number of 'missing' node-handlers in the indenter which mean that certain kinds of nodes in the tree can be completely omitted !  Eventually I'll fix this up to allow proper sqf indentation.

It's interesting though to consider the idea of a hybrid or dual-mode indenter.  After all, whilst writing code, it is almost always syntactically incorrect (ie, as you enter keywords, braces etc character-by-character there are very few occasions when the stream of input parses correctly).  So for this situation a lexical indenter would clearly be a better solution even if it can only perform a limited set of transforms on a local scope.  The parse-tree indenter could then be used to perform indentation across the whole file once syntactical errors have been removed.

No comments:

Post a Comment