Building a better Tag-Parser
If you read the code of my Extensible BBCode module for [[Drupal]], you'll notice that my tag parsing algorithm is kind of complex. In two steps, the tags are first "paired" (inserting a matching ID into the opening and closing tag) and then "rendered", all by an evaluated regular expression that calls another function. The process certainly ensures that all tags are balanced - and it even allows some tags to be rendered before others are, regardless of nesting.
But for all its voodoo magic and messing about, it is vulnerable to the same bug as most primitive BBCode parsers. Namely, allowing input that will result in this output:
Now, the [[quirks mode]] of all [[layout engines]] are very good at dealing with simple cases like that. Inline markup will be rendered correctly even with overlapping elements.
But what if it includes block elements? In [[phpBB]], I recently noticed that it is extremely easy to disrupt the page layout using just two simple tags;
The following overlapping input will be accepted:
The following output will (roughly) be generated:
Unlike the inline markup, several major layout engines ([[Gecko (layout engine)|]], [[WebKit]] and [[Presto (layout engine)|]] are the ones tested) take issue with this. The quirk mode behavior in all three drops the three unmatched div tags inside the blockquote element, and then applies the three closing div tags to parent div elements higher up in the tree. This naturally wreaks havoc on the page. (I'd almost consider this a mild denial-of-service vulnerability because any malicious user can break the page layout.)
XBBCode is vulnerable to the same, because in spite of all its careful "pair matching", it never actually checks that the tag tree is nested correctly.
So now I'm working on a much simpler and in my opinion more stable approach. I can only suppose it hadn't occurred to me earlier because I was thinking that BBCode had to be rendered "in-place", replacing each tag in the string where it was found. It seems that a better approach is to first find all tags, and then gradually build the output from substrings of the original text.
In Pseudocode (the actual PHP code is not as clean, alas):
You can see that even though the algorithm and the data it operates on is iterative, the text is rendered recursively from the inside out: Whenever a tag is closed, it is rendered and the output appended to the parent element's content.
If the tag being closed lies further down the stack, then all intermediate unclosed tags are discarded (though any complete tags inside them are kept). In the end, the same is considered to happen to the virtual "root" tag, that encloses the entire input - unclosed tags are discarded. (Note that "discarded" means "displayed in the output as unrendered BBCode", as is customary.)
The first and greatest downside to this method is that tags cannot be weighted anymore. Every tag renderer will receive its content with all nested BBCode tags already rendered. Deferring the rendering of a tag is just not feasible in this algorithm.
However, this can be trivially circumvented by using a feature that is a lot easier to implement. Tags can be given a "nocode" property, which will tell this algorithm to ignore any tags nested inside it. The tag renderer could then independently run the BBCode filter on its content to render the tags that were left unrendered earlier.
(I'm not yet at the point where a dangling nocode tag will not stop enclosed tags from being rendered. It's not trivial, alas.)