Building a better Tag-Parser

The Problem

If you read the code of my Extensible BBCode module for [[Drupal]], you'll notice that my tag parsing algorithm is kind of complex. In two steps, the tags are first "paired" (inserting a matching ID into the opening and closing tag) and then "rendered", all by an evaluated regular expression that calls another function. The process certainly ensures that all tags are balanced - and it even allows some tags to be rendered before others are, regardless of nesting.

But for all its voodoo magic and messing about, it is vulnerable to the same bug as most primitive BBCode parsers. Namely, allowing input that will result in this output:

<strong>This sentence is <em>not</strong> valid HTML code.</em>

Now, the [[quirks mode]] of all [[layout engines]] are very good at dealing with simple cases like that. Inline markup will be rendered correctly even with overlapping elements.

But what if it includes block elements? In [[phpBB]], I recently noticed that it is extremely easy to disrupt the page layout using just two simple tags; [spoiler] and [quote].

The following overlapping input will be accepted:

[quote] [spoiler] [/quote] [/spoiler]

The following output will (roughly) be generated:

<blockquote> <div><div><div> </blockquote> </div></div></div>

Unlike the inline markup, several major layout engines ([[Gecko (layout engine)|]], [[WebKit]] and [[Presto (layout engine)|]] are the ones tested) take issue with this. The quirk mode behavior in all three drops the three unmatched div tags inside the blockquote element, and then applies the three closing div tags to parent div elements higher up in the tree. This naturally wreaks havoc on the page. (I'd almost consider this a mild denial-of-service vulnerability because any malicious user can break the page layout.)

XBBCode is vulnerable to the same, because in spite of all its careful "pair matching", it never actually checks that the tag tree is nested correctly.

The Algorithm

So now I'm working on a much simpler and in my opinion more stable approach. I can only suppose it hadn't occurred to me earlier because I was thinking that BBCode had to be rendered "in-place", replacing each tag in the string where it was found. It seems that a better approach is to first find all tags, and then gradually build the output from substrings of the original text.

In Pseudocode (the actual PHP code is not as clean, alas):

1. Let `tags` be a list of the opening and closing tags in the input. Every element has the properties .name, .is-opening, .args, .index, .length .index is the string offset of the first character of the tag - "[". .element is the whole tag's string, from "[" to "]". Let `open_tags` be a stack of tags. Every element has the properties .name, .args, .content, .element and .offset Let `open_tags_by_name` be a dictionary of String -> Int. 2. For every tag `T` in `tags`: Is `T.is-opening` true? Then: Let `Parent` be the top element of `open_tags` (leave it on the stack). Append the substring between `Parent.offset` and `T.index` to `Parent.content`. Let `Parent.offset` be `T.index`. Let T2 be: .name : `T.name`, .args : `T.args`, .content : "", .index : `T.index` + `T.length` .element : `T.element` Push T2 onto `open_tags`. Increment `open_tags_by_name[T.name]` by 1. Else: If `open_tags_by_name[T.name] greater than 0? Then: Pop an element off `open_tags` and call it `Current`. While `T.name` is not equal to `Current.name`: Decrement `open_tags[Current.name]` by 1. Let `content` be `Current.element` concatenated with `Current.content` Let `offset` be `Current.offset` Pop an element off `open_tags` and call it `Current`. Append `content` to `Current.content` Let `Current.offset` be `offset` Let `Parent` be the top element of `open_tags` (leave it on the stack). Append the substring between `Current.offset` and `T.index` to `Current.content` Let `output` be the output of rendering `Current` Append `output` to `Parent.content` Decrement `open_tags[Current.name]` by 1. 3. Let `Current` be the top element of `open_tags` (leave it on the stack). Append the substring between `Current.offset` and the end of input to `Current.content` 4. While `open_tags` has more than 1 element: Pop an element off `open_tags` and call it `Current` Let `Parent` be the top element of `open_tags` (leave it on the stack). Append `Current.element` to `Parent.content` Append `Current.content` to `Parent.content` 5. Let `Root` be the only element of `open_tags`. Return `Root.content`

You can see that even though the algorithm and the data it operates on is iterative, the text is rendered recursively from the inside out: Whenever a tag is closed, it is rendered and the output appended to the parent element's content.

If the tag being closed lies further down the stack, then all intermediate unclosed tags are discarded (though any complete tags inside them are kept). In the end, the same is considered to happen to the virtual "root" tag, that encloses the entire input - unclosed tags are discarded. (Note that "discarded" means "displayed in the output as unrendered BBCode", as is customary.)

The Costs

The first and greatest downside to this method is that tags cannot be weighted anymore. Every tag renderer will receive its content with all nested BBCode tags already rendered. Deferring the rendering of a tag is just not feasible in this algorithm.

However, this can be trivially circumvented by using a feature that is a lot easier to implement. Tags can be given a "nocode" property, which will tell this algorithm to ignore any tags nested inside it. The tag renderer could then independently run the BBCode filter on its content to render the tags that were left unrendered earlier.

(I'm not yet at the point where a dangling nocode tag will not stop enclosed tags from being rendered. It's not trivial, alas.)

Keywords: 
News Category: 
© 2006-2012: All content, unless otherwise noted, is the property of Arancaytar. It may be copied and modified with attribution for non-commercial purposes. By publishing comments on this site, you grant Arancaytar a non-exclusive, perpetual license to reproduce and publish these comments along with any identifying information provided. (You may request your comments to be deleted or edited voluntarily.)