Long-time readers of the excellent blog
Baghdad Burning by Riverbend will have noticed ages ago that their feed readers stopped working.
Why this? The last item displayed in the
Atom feed is dated June 2005 - more than a year ago.
Unfortunately, this also means that readers are not alerted to new posts, which are usually weeks apart. In practice, this means either reading posts many days late or futilely checking a page all the time.
Along comes PHP, and a way to reverse-engineer a (reasonably well-structured) html page back into a database of posts, and from there to an RSS feed - the kind you can put into Bloglines, Google Reader or your favorite aggregator. The colloquial term for this is "scratching" as far as I know.
I've spent the last weeks experimenting with regular expressions on the
Pied Piper archives, so it wasn't hard to parse the site. It was actually far harder to find a way to escape/remove the special unicode characters, html tags and other stuff that aggregators don't like. It validates now, however. I present:
The Baghdad Burning newsfeed,
brought to you by feeds.ermarian.net!
Comments
I put the fool in Aranfoolcaytar
Oh well.
--
Arancaytar
PHP as CGI and Unicode
I'll have to work on this for a bit.
Also, I sincerely hope that this work will actually become meaningful again, once she returns unharmed.
--
Arancaytar
Just in time!
Turns out that - first - turning off the unicode character set in MySQL results will mean PHP stops trying to fit a single high-bit character into two ASCII slots. What character set is used instead, I have no idea. Typographical quotes are numbered 147 and 148; the single-character ellipsis (whose inventor needs to be beaten around the head with three periods) is 133 and so on. But at least we're now dealing with single characters, which are much easier to search for.
By running the string through a filter that looks up these values in a map (an "associative array"), I can turn these high-bit characters into more mundane ones - like three periods, a simple ", or a simple apostrophe.
By making these changes in the database, I save on runtime of the feed script as well.
I just managed to make the feed validate here for the first time in what must be a month or more. So I'm almost as happy about that as I am that Riverbend is okay.
--
Arancaytar