xbbcode
Building a better Tag-Parser, part 2
As I concluded in the first part, one problem remains with the strict "inner-tag-first" evaluation of code (which roughly equates to a Left-Right-Root traversal of the code tree). Namely, some tags, like [nocode], require that further code inside them is not parsed.
At first glance, this looks easy: Just determine if the currently open tag allows any code inside it, and ignore any opening tags otherwise until it closes (other closing tags must still be examined in case the nocode tag is dangling). There are multiple conditions where this fails:
- The
[nocode]tag is not closed. Even though it is ultimately discarded, all tags following it will remain unrendered. "Discarded" tags should not influence the rendering. - The tag is discarded because its parent tag is closed before it itself is. This causes the same problem as condition 1, up to the closing parent tag.
- Even with valid code, if the nocode tag contains a valid pair of tags, and the opening one is ignored, then the closing one may be misinterpreted as closing a parent tag above nocode, resulting in condition 2.
From number 3 it is clear that even without rendering them, tags still have to be tracked inside the nocode block. But how to ensure that they will be retroactively rendered in case the nocode tag ends up discarded?
In this case, it is possible to recursively reapply the filter to the text block that hasn't yet been rendered. In pseudocode (some context added in parentheses; refer to the code of the first part), the additions and changes are:
I am not yet sure whether the recursion/backtracking can be maliciously exploited, in the sense that certain crafted input might take a lot of server resources to process. Even if that is the case though, it would be simple to put in some safeguards.
The upside of this change is that the "weight" of tags becomes completely obsolete. Instead of having a specified order for every single tag, every single tag merely specifies whether it wants to be rendered before its content is, or after.
- Add new comment
- 1212 reads
Building a better Tag-Parser
The Problem
If you read the code of my Extensible BBCode module for Drupal, you'll notice that my tag parsing algorithm is kind of complex. In two steps, the tags are first "paired" (inserting a matching ID into the opening and closing tag) and then "rendered", all by an evaluated regular expression that calls another function. The process certainly ensures that all tags are balanced - and it even allows some tags to be rendered before others are, regardless of nesting.
But for all its voodoo magic and messing about, it is vulnerable to the same bug as most primitive BBCode parsers. Namely, allowing input that will result in this output:
Now, the quirks mode of all layout engines are very good at dealing with simple cases like that. Inline markup will be rendered correctly even with overlapping elements.
But what if it includes block elements? In phpBB, I recently noticed that it is extremely easy to disrupt the page layout using just two simple tags; [spoiler] and [quote].
The following overlapping input will be accepted:
The following output will (roughly) be generated:
Unlike the inline markup, several major layout engines (Gecko (layout engine)|, WebKit and Presto (layout engine)| are the ones tested) take issue with this. The quirk mode behavior in all three drops the three unmatched div tags inside the blockquote element, and then applies the three closing div tags to parent div elements higher up in the tree. This naturally wreaks havoc on the page. (I'd almost consider this a mild denial-of-service vulnerability because any malicious user can break the page layout.)
XBBCode is vulnerable to the same, because in spite of all its careful "pair matching", it never actually checks that the tag tree is nested correctly.
The Algorithm
So now I'm working on a much simpler and in my opinion more stable approach. I can only suppose it hadn't occurred to me earlier because I was thinking that BBCode had to be rendered "in-place", replacing each tag in the string where it was found. It seems that a better approach is to first find all tags, and then gradually build the output from substrings of the original text.
In Pseudocode (the actual PHP code is not as clean, alas):
You can see that even though the algorithm and the data it operates on is iterative, the text is rendered recursively from the inside out: Whenever a tag is closed, it is rendered and the output appended to the parent element's content.
If the tag being closed lies further down the stack, then all intermediate unclosed tags are discarded (though any complete tags inside them are kept). In the end, the same is considered to happen to the virtual "root" tag, that encloses the entire input - unclosed tags are discarded. (Note that "discarded" means "displayed in the output as unrendered BBCode", as is customary.)
The Costs
The first and greatest downside to this method is that tags cannot be weighted anymore. Every tag renderer will receive its content with all nested BBCode tags already rendered. Deferring the rendering of a tag is just not feasible in this algorithm.
However, this can be trivially circumvented by using a feature that is a lot easier to implement. Tags can be given a "nocode" property, which will tell this algorithm to ignore any tags nested inside it. The tag renderer could then independently run the BBCode filter on its content to render the tags that were left unrendered earlier.
(I'm not yet at the point where a dangling nocode tag will not stop enclosed tags from being rendered. It's not trivial, alas.)
- Add new comment
- 1429 reads
XBBCode 7.x-0.9 released
I've spent a few more days improving and cleaning up the new version.
One thing I have managed to make loads better is the handler settings form: Instead of a table with simple checkboxes and weight selection elements, the form now uses tablesort and tabledrag. The extra space is used to display the tag description,
(Note: Due to a bug in core most of the cool stuff will not work without applying a core patch.)
http://ermarian.net/downloads/software/drupal/xbbcode/xbbcode-7.x-0.9-r4...
- 1 comment
- 1998 reads
Porting XBBCode to Drupal 7
Well, as you can see, the NaNoWriMo didn't really work out. Astonishingly, it turns out that a day has only twenty-four hours.
However, I have managed to devote an afternoon this weekend to porting my XBBCode module. This is particularly interesting if you are from Spiderweb, because XBBCode for Drupal 7 is one of the steps on the roadmap to the new Pied Piper Project and the new Blades Forge.
I've managed to achieve a few milestones in short time:
- Creating, editing and deleting custom tags works.
- The basic tag package has been ported.
- The actual filter works, including basic, custom and dynamic tags.
Other than cleanup, the only part that remains of the actual XBBCode port are the other two sub-modules, list and table. The far bigger task ahead is the porting of the Highlighter module. Highlighter unfortunately contains some nasty file juggling and PEAR voodoo, which will be hard to modernize.
- 3 comments
- 2203 reads
XBBCode 6.x-1.1.1 is out
This is just a notice of a bugfix update to XBBCode, my light-weight stack-based parser for customizable and extensible BBCode in Drupal.
I have been developing the module for two years now and using it on this blog for almost as long (see earlier posts), so I've had lots of time to work out most of the bugs.
It can be downloaded here:
http://ermarian.net/downloads/drupal-addons/xbbcode/xbbcode-6.x-1.1.1-r3...
It is also still available on SVN:
http://svn.ermarian.net/drupal/modules/xbbcode/trunk/xbbcode/
- 11 comments
- 4461 reads
XBBCode available for 6.x
My pet module XBBCode is finally available for Drupal 6 - at least the core engine and the basic tags. It has undergone a lot of clean-up, including the user interface.
The module will only be packaged as a public release when all sub-modules are converted, but for now, the trunk is available here:
- 1 comment
- 1975 reads
XBBCode ported to Drupal 6!
Drupal 6.x is looming. A few months ago, I wouldn't have chosen that particular word, but as it's getting closer and closer, I see past the smoothness and shiny new features and remember what a major Drupal version upgrade actually means for me as a site developer: Endless hours of coding to get my modules compatible with the new API.
DHTML menu, now, was easy. I pretty much spent half a day on getting it to work in D6 - after being stumped for several days regarding the new menu system, of course.
XBBCode, on the other hand, I set off to the side. This is not because its structure has to undergo some major refactoring - in fact the filter API hasn't changed at all (or at least not in a way that broke my module). Rather, it uses a few high-level menu items and configuration forms - and as always, the new Drupal version completely revamped the menu system and Form API. Forgive me for waxing cynical for a bit. Form API is a thing of beauty once you understand it - and hopefully that will be the case before it is rewritten again and gets even more beautiful.
Still, after some reading and bothering the other developers on IRC, I finally pushed XBBCode into a shape where it works in Drupal 6. The engine with all its settings forms and custom tags is functional - though not E_NOTICE-free, because of a strange behavior of the new menu system that I still have to figure out. The basic tags package required practically no updating.
I am expecting more trouble with the other sub-modules that implement their own settings forms, but none of them are as bad as the engine itself.
The new version of XBBCode 6.x-dev is not yet on SVN. I still have to split off the DRUPAL-5 branch before I commit the 6.x version into the repository. So this is little more than a hype topic. I expect to have the version done in another week, however.
- 2327 reads
XBBCode 5.x-0.2.1
A new version of XBBCode is out. This one is a bugfix that will finally get the module to work out of the box, so if you've tried it before and couldn't get it to work, give it another chance please.
http://ermarian.net/downloads/drupal-addons/xbbcode/xbbcode-5.x-0.2.1.ta...
Fixes:
- The module has been refactored completely; system names now reflect that its submodules are actually part of xbbcode (through names prefixed by "xbbcode_").
- The module now uses the .install file properly, which makes it install without requiring a database schema update.
------
As the code freeze for Drupal 6 has just hit, the next task ahead is porting this module to the new core version.
- 1660 reads
XBBCode now available through SVN
It's actually been some time since I began using my own SVN repository for the XBBCode module (described in more detail in earlier posts). It's far more convenient than the combination of Dreamweaver and vim I have been using before.
Also, I have made it publically available, so if you have an SVN client (which you undoubtedly do if you have Linux), you can now get the current version by checking out the repository at this URL:
http://svn.ermarian.net/drupal/modules/xbbcode/
Oh, and I've been working on a lists extension that makes use of some of the cooler things that CSS 2.0 can do.
Currently, XBBCode supports both the basic tags that any BBCode implementation has (emphasis, font-styling, links and images) and also a more advanced syntax-highlighting functionality that requires a few extra PHP libraries to be downloaded.
- 1759 reads
AvernumScript Syntax
I wrote of my first scenario two days ago, and of my success in combining the Pear syntax highlighter with my XBBcode module to integrate it in Drupal today.
So this shouldn't come as a surprise, really.
AvernumScript is one hell of a scripting language, and not really in the positive sense: Arbitrary spellings and abbreviations of functions are awful to remember, and the tight limitations (only integer and string types, only while and if-else flow control) aren't good to design for.
But it does have something for it: The syntax is extremely simplistic. So simplistic, in fact, that it was a trivial exercise to write a first approach of a descriptive XML document for it, which Text Highlighter compiled into a PHP class.
The result is below. And no, nothing in this was colored manually. I merely copied a part of one of my scripts, placed it in [avernumscript]...[/avernumscript] tags, and watched the parser at work.
[avernumscript=ln]
begintownscript;
variables;
short message_lost,night_time;
body;
beginstate INIT_STATE;
// This state called whenever this section is loaded into memory.
set_name(9,"Father Kernighan"); // Name Kernighan and give him a pic
set_char_dialogue_pic(9,1961,0);
set_name(11,"Ritchie"); // Likewise for the Innkeep
set_char_dialogue_pic(11,1948,0);
set_name(7,"Lenore"); // And Lenore
set_char_dialogue_pic(7,1959,0);
set_name(8,"Zak"); // Likewise for the assassin
set_char_dialogue_pic(8,1945,0);
set_name(16,"Grace"); // Likewise for Grace Hopper
set_char_dialogue_pic(16,1947,0);
add_char_to_group(9,1); // the three tavern guests are one group
add_char_to_group(8,1);
add_char_to_group(7,1);
add_char_to_group(10,2); // sendings
add_char_to_group(19,2);
add_char_to_group(20,2);
add_char_to_group(21,2);
if (get_flag(200,1)>0) // if they got robbed
{
set_creature_memory_cell(11,3,23); // change ritchie's dialogue.
}
break;
[/avernumscript]
- 1363 reads
