Building a better Tag-Parser, part 2

As I concluded in the first part, one problem remains with the strict "inner-tag-first" evaluation of code (which roughly equates to a Left-Right-Root traversal of the code tree). Namely, some tags, like [nocode], require that further code inside them is not parsed.

At first glance, this looks easy: Just determine if the currently open tag allows any code inside it, and ignore any opening tags otherwise until it closes (other closing tags must still be examined in case the nocode tag is dangling). There are multiple conditions where this fails:

  1. The [nocode] tag is not closed. Even though it is ultimately discarded, all tags following it will remain unrendered. "Discarded" tags should not influence the rendering.
  2. The tag is discarded because its parent tag is closed before it itself is. This causes the same problem as condition 1, up to the closing parent tag.
  3. Even with valid code, if the nocode tag contains a valid pair of tags, and the opening one is ignored, then the closing one may be misinterpreted as closing a parent tag above nocode, resulting in condition 2.

From number 3 it is clear that even without rendering them, tags still have to be tracked inside the nocode block. But how to ensure that they will be retroactively rendered in case the nocode tag ends up discarded?

In this case, it is possible to recursively reapply the filter to the text block that hasn't yet been rendered. In pseudocode (some context added in parentheses; refer to the code of the first part), the additions and changes are:

1. Let `nocode` be 0. 2. (Inside the iteration over `tags`:) (If `T` is an opening tag:) If `T` is a nocode tag: Increment `nocode` by 1. (If `T` is a closing tag:) (While popping dangling tags off `open_tags`:) If `Current` is a nocode tag: Decrement `nocode` by 1. If `nocode` is 0: Run myself on the input `Current.content` and store it in `Current.content` (Instead of always rendering the tag when it is closed) If `nocode` is greater than 0: Append `Current.element` to `Parent.content` Append `Current.content` to `Parent.content` Append [/``] to `Parent.content` Else: Append the rendered form of `Current` to `Parent.content`

I am not yet sure whether the recursion/backtracking can be maliciously exploited, in the sense that certain crafted input might take a lot of server resources to process. Even if that is the case though, it would be simple to put in some safeguards.

The upside of this change is that the "weight" of tags becomes completely obsolete. Instead of having a specified order for every single tag, every single tag merely specifies whether it wants to be rendered before its content is, or after.

News Category: 
© 2006-2012: All content, unless otherwise noted, is the property of Arancaytar. It may be copied and modified with attribution for non-commercial purposes. By publishing comments on this site, you grant Arancaytar a non-exclusive, perpetual license to reproduce and publish these comments along with any identifying information provided. (You may request your comments to be deleted or edited voluntarily.)