tracking

Monday, 23 August 2010

Indentation - Lex or Parse ?

A couple of questions on the squint forum got me thinking about this subject.

Muzzleflash:  "However, isn't the only thing determining basic indentation in sqf braces?"

Callihn:  "I tried the [CPP auto-indent], I don't think it should be deleting things?" 
 
Before I talk about indentation, let's start with an easier topic; syntax-highlighting, or markup.  The difference between lexical markup vs markup based on parsing is most easily demonstrated by an example using a couple of english sentences...

Turn left here
Take the next turn

Now suppose we wanted to highlight the verbs in these sentences using a lexical approach. In this case we simply look at each word in isolation and decide whether it is a verb or something else. Starting with the first sentence things look quite easy; we can simply look up the word 'turn' in a big list of known verbs and colour it red.  The second sentence gives us a problem though.  'turn' is actually a noun in this sentence so we end up colouring it incorrectly.  In contrast, a parse-based approach tries to understand the underlying structure of the sentence and detects that 'turn' is the subject of the verb 'take' and therefore must be a noun.

So much for basic lex/parse theory... what has this got to do with squint ?

Well, the whole point of squint is that it tries to construct a parse-tree of the file it is looking at rather than simply looking at individual words. So when you write
if (_a) then {_b=_c}; 

squint truly 'understands' this as
OP(OP(if,_a),then,CODE(OP(_a,=,_b))) 
and uses that knowledge to highlight the operators it finds.  This is in contrast to most text-editors which simply take the lexical approach of looking up keywords in a dictionary.  For markup, there really aren't significant differences between one approach and another but using a parse-tree does mean that squint can do other interesting things like highlight the operands used by a particular operator.

For things like indentation though, the differences between the two approaches become more profound.  The lexical approach simply looks at each token in turn in other words, the indenter sees a sequence a bit like

if...(..._a...)...then...{..._b...=..._c...}...;

That's where Muzzleflash's comment about 'counting the brackets' comes from because nothing else really matters. The nice thing about this approach is that it's quite easy to implement and quite 'robust'.  Ie the worst that will happen if you get it wrong is that the indentation will look a bit screwy.  The big disadvantage is that a lot of indentation rules are quite context-dependent.  For example, in the code

while {_a<10} do { for [{_i=0{,{_i<_a},{_i=_i+1}}] do {_z=_z+i;};_a=_a+1};

which I've deliberately written on one line to illustrate the problem, you probably want to linefeed on the braces after 'do' but not the other ones.  Building in these kinds of rules is a bit awkward with a lexical -indenter although it can be done and to some extent relies upon creating some basic parsing mechanism.

In contrast the parse-tree approach starts from a very different place.  Referring back to our original tree of
OP(OP(if,_a),then,CODE(OP(_a,=,_b))) 
it's worth noting that in this representation the indenter knows nothing at all about brackets.  It simply infers that the contents of a CODE block should be wrapped in '{' and '}' when it comes time to present these to the viewer.  This makes a parse-tree approach potentially much more powerful. One can easily define special formatting rules for particular keywords or constructs because the whole context is available (at least in theory, in practice it's often a pain to do this).  Another point is that the indenter is guaranteed to output syntactically correct code!

However, this apparent strength makes the parse-tree approach quite 'brittle'. Since the original textual representation has been discarded by the time the indenter operates on the parse-tree, things can go very badly wrong if the parser does not accurately reflect the intention of the original input. To put it simply, in the worst case, the indenter can remove whole swathes of input that it couldn't parse!  Parse-tree based indenters really, really, prefer to operate on syntactically correct code !  It's interesting to note for example that the auto-indenter in Microsoft Visual Studio will refuse to operate until you have fixed all the syntax errors in your code, implying strongly that it is parse-tree based.

So coming back to the Callihns problem - "I don't think it should be deleting things"....  The explanation is really a variation on the problem above.  In this case I suspect the CPP parser was being used to indent SQF code.  Since CPP is a much simpler grammar than SQF, there are a number of 'missing' node-handlers in the indenter which mean that certain kinds of nodes in the tree can be completely omitted !  Eventually I'll fix this up to allow proper sqf indentation.

It's interesting though to consider the idea of a hybrid or dual-mode indenter.  After all, whilst writing code, it is almost always syntactically incorrect (ie, as you enter keywords, braces etc character-by-character there are very few occasions when the stream of input parses correctly).  So for this situation a lexical indenter would clearly be a better solution even if it can only perform a limited set of transforms on a local scope.  The parse-tree indenter could then be used to perform indentation across the whole file once syntactical errors have been removed.

Line-numbers in squint - how hard could it be?

It should have been simple right? Just add another richtextbox alongside the code-window, fill it with numbers and then find some way to synchronise the scroll position of the two windows. Probably about an hour's work.  WRONG - so this is what actually happened....

  • Add the new line-number text-box (after creating yet another splitter panel for it to live in)
  • Fill it with numbers (as a quick hack we just go from 1 to 10000) and make sure we're using the same font as the codeview so the line-spacing is the same.
  • Try to synchronise them.... I know, I'll use the tried-and-trusted method of get the character index of the top-left corner, working out the line-number and then scrolling to the caret.  Unfortunately the results are horrible - richtextbox has it's own semi-random algorithm for deciding how far into the visible region the caret will be so the line-numbers are just not lining up at all.
  • A bit of googling later, I find some neat code that allows me to scroll down some number of lines. It means I have to abandon my 'members-only' principles and start using SendMessage but what-the-hell if it works....
  • ...only it doesn't.  Fortunately further googling reveals another approach using yet more nasty sendmessaging but by this time I no longer feel dirty about using it.  (Did I mention the diversion involved in trying to figure out how to use IntPtrs -back in my day we had physical and virtual memory but as usual Microsoft has taken a simple concept and put their own spin on it - don't get me started on their crappy Mutex API....) Anyway, fortunately the new approach of reading the scrollpos of the codewindow and simply setting the line-number scrollpos to the same value works like a charm.  Now I can whizz up and down in the code window with the line-numbers beautifully synchronised.  This has all taken less than a couple of hours and I'm feeling pretty cocky about showing this off and moving on to the next feature...
  • EXCEPT....
  • As I'm starting to look at implementing 'undo' I notice some of the lines aren't quite lined up with the line-numbers.  Double-checking that both textboxes are using the same zoomfactor and fontsize shows that they _should_ be using identical line-heights so this is very odd.
  • First theory - maybe using characters with descenders like ; or } causes that line to be higher ? I briefly contemplate the horrible idea that I will have to frig every line-number to match the height of the particular codewindow line it is supposed to line up with. The good news though is that it doesn't seem to have anything to do with descenders after all, just adding one more character to the line makes it grow in height.
  • The bad news though is that I can't figure out why. On closer inspection things are even weirder. As soon as I add a single character to the end of a line the font visibly 'grows' slightly as if it's being bolded. It's this effect which means the line heights in the codewindow are ever-so-slightly different from those in the line-number window.  Thus begins a descent into frustration and near madness that lasts nearly four hours....
  • I become convinced after some time that the problem has something to do with rich-text.  Squint uses an 'offline' rendering mechanism because marking up large files can take an appreciable amount of time (some portion of a second).  Therefore I construct a new richtextbox in a background thread, copy the text into it, mark it up with all those colours then copy back the rendered rich-text into the code-window. Somehow that round trip process is causing the rendered text to 'swell' when it's copied into the code window.   (It's probably useful to mention at this point that I have already had terrible problems with richtextbox.ZoomFactor which tends to reset itself at the slightest provocation and which can also cause font resizing during this process so there is some time involved in making sure that is not the problem here.)   I start breakpointing the copyRendering function I've made and doing a byte-by-byte comparison of the richtext being copied in versus what as there before. Needless to say there are some significant differences so I look up the rtf spec and read that for a while before deciding it would be quicker to write some code to try to massage the differences away before doing the copy. Then if I can find out which difference is causing the problem I can go away and read up the spec to figure out what I might do to fix this properly.
  • ...Some time later most of the differences have been successively .Replaced out and I'm feeling confident that the answer must be close. And yes... finally the rendered text doesn't 'swell' anymore.  Excitedly I comment out the substitution to allow the difference back in again so I can check this really is the one and.... err, wait a minute it's still not swelling even though it did a minute ago. WTF?!?!?  I go back and forth a few times before reluctantly concluding that however heavily I massage the incoming richtext string, I can't make this problem go away for good and the last hour or so has been a complete waste of time.  This also raises the frightening possibility that there is some persistent state in the richtextbox control that I am somehow corrupting with this background-render and copy trick -not a good thought at all.  
  • In some desperation I start all over again, this time disabling the zoomfactor-fudging completely and lo-and-behold I don't get the swelling.  Strange, I think but by this point I'm past caring and the easiest solution seems to be to to find some way to disable mousewheel zoom and leave the zoomfactor at 1.  Only the RichTextBox doesn't let you do that (by now I would gladly track down the author of this component and discuss his 'design' decisions with the blunt end of my laptop) so I'm stuck. 
  • Until.. more googling says the answer to this problem may be to create a subclassed variant and filter out the mousewheel events.  I don't really like the idea of doing this but hey, I tend to be a bit stubborn and having come this far I figure the extra effort is worth it even if I never actually saw the point of line-numbers in the first place
  • So a bit later I have my subclassed variant, I'm filtering out the mousewheel events but as an act of generousity to those who do like to zoom in and out of windows I implement a 'fake' zoom which simply changes the size of the font in the two windows to make it look like your're zooming.
  • Blimey - it all seems to work now and once again I'm preparing for a release. It was a lot of headbanging and frustration but I'm feeling that sense of accomplishment that comes from pushing through what looked like an insoluble problem.
  • BUT then.. no, it can't be.... the misaligned lines are back! Right now I can't believe that all that work to rip out half the guts of the UI and replace it with a new custom class has made no difference whatsoever.  I briefly consider throwing the laptop out of the window and cancelling squint development but decide to take 10 minutes out to walk around and get my head together instead.
  • Which helps, because once I make that decision and start thinking about what could possibly be left that I haven't tried, the solution starts to come together. Clearly it has something  to do with zoomfactors and fonts - I don't  always see the problem, in fact the more I look at it the more I realise it only occurs at certain levels of zoom.  Hmm, both zoomfactors and font sizes are specified as floating-point numbers - maybe there is some kind of rounding error causing the two textboxes to use ever-so-slightly-different sizes?
  • I implement some code to clamp the zoomed font size to an integer size and now the problem has gone away .... getting warmer.
  • And then,  finally,  I figure out what must be happening.  The text in the 'offline' rendering richtextbox is being rendered at a font size of 23.75 (to pick a random example).  When I copy the rtf from that textbox to the codewindow, the subtlety of the non-integer font size is getting lost and it's being rounded up to the next integer, 24.  Perhaps rtf can't express non-integer fontsizes, perhaps it's a bug in richtextbox ? I don't care.  I'm happy that the problem is solved.
  • EXCEPT... no 'except' this time, it really is solved :)