Baghdad Burning newsfeed

Arancaytar's picture
Long-time readers of the excellent blog Baghdad Burning by Riverbend will have noticed ages ago that their feed readers stopped working.

Why this? The last item displayed in the Atom feed is dated June 2005 - more than a year ago.

Unfortunately, this also means that readers are not alerted to new posts, which are usually weeks apart. In practice, this means either reading posts many days late or futilely checking a page all the time.
Along comes PHP, and a way to reverse-engineer a (reasonably well-structured) html page back into a database of posts, and from there to an RSS feed - the kind you can put into Bloglines, Google Reader or your favorite aggregator. The colloquial term for this is "scratching" as far as I know.

I've spent the last weeks experimenting with regular expressions on the Pied Piper archives, so it wasn't hard to parse the site. It was actually far harder to find a way to escape/remove the special unicode characters, html tags and other stuff that aggregators don't like. It validates now, however. I present:

The Baghdad Burning newsfeed,
brought to you by feeds.ermarian.net!

Comments

Arancaytar's picture

I put the fool in Aranfoolcaytar

I failed to realize that the best-coded feed scraper would not work if some server did not tirelessly call the script that updates the feed! Sheer luck had me checking the blog itself today, realizing that Riverbend had already posted yesterday and the feed hadn't yet updated.

Oh well.

--
Arancaytar
Arancaytar's picture

PHP as CGI and Unicode

With the host move, PHP 5 has improved a lot of things. The fact that it's run as a CGI module has caused a big mess. Among these is the problem with Unicode characters, which I haven't managed to completely mask/remove/filter/replace.

I'll have to work on this for a bit.

Also, I sincerely hope that this work will actually become meaningful again, once she returns unharmed.

--
Arancaytar
Arancaytar's picture

Just in time!

She returned, and, as I said, I got to work on the problem of the Invisible Pink Unicode. Seriously, what with PHP's handling of character sets, that's what they should call it.

Turns out that - first - turning off the unicode character set in MySQL results will mean PHP stops trying to fit a single high-bit character into two ASCII slots. What character set is used instead, I have no idea. Typographical quotes are numbered 147 and 148; the single-character ellipsis (whose inventor needs to be beaten around the head with three periods) is 133 and so on. But at least we're now dealing with single characters, which are much easier to search for.

By running the string through a filter that looks up these values in a map (an "associative array"), I can turn these high-bit characters into more mundane ones - like three periods, a simple ", or a simple apostrophe.

By making these changes in the database, I save on runtime of the feed script as well.

I just managed to make the feed validate here for the first time in what must be a month or more. So I'm almost as happy about that as I am that Riverbend is okay.

--
Arancaytar