Clean URI schemes using mod_rewrite and PHP (Part one)

I recently read an article over at w3.org, titled Choose URIs Wisely. The gist of it is that web addresses should be concise, easy to spell and remember, and be persistent even when the site is reorganized.

Now, unlike the XHTML standards and rules for valid markup, the advice given by the W3C are more like guidelines than actual rules. They show the way how the internet should ideally work. But in that way, I would compare web development more to an art than a craft. These standards are a way of life, which takes a lot of effort to implement, but which will reap great rewards in time - if you stick to it persistently.

As an example for how an URL should look, it might be best to examine that of the article itself:

http://www.w3.org/QA/Tips/uri-choose

The directory path doesn't look very spectacular, but it makes sense on a logical level. The article is part of the Tips series, which is part of Questions and Answers, which is a resource W3C offers.

What is unusual for a URL is the absence of a file extension. Extensions are ubiquitous in addresses. Almost all web pages end in .html, .htm or .php

However, this violates the W3 ideal that a resource address should contain only information about the identity of the resource, not any information on how it is stored or displayed. Several great disadvantages:

The idea is that file extensions should stay out of the URI. Your ideal web address looks like http://web.site.net/directory/directory/page.


That was the rant, now comes the useful stuff. How do you actually implement this ideal? How do you handle resource locators that have nothing to do with where the data is stored on your web server?

The easiest way would be Drupal (which powers this site). Drupal is the first CMS I have seen that supports this ideal for a URI scheme. Its combination of using clean URLs (URLs that contain no parameters, ie "?q=...") and URL aliases (URIs that are stored in a lookup table in MySQL, giving a resource a human-readable name) allows you to run any blog or forum according to the URI ideals given by W3C.

It gets harder when for whatever reason you can't have your website powered by a Drupal engine (I haven't yet come up with something that couldn't be done with it, but it is admittedly hard to migrate certain sites). You have to come up with a URI scheme yourself, and also implement it.


I do this with a combination of mod_rewrite and PHP. Both are things that require you to have reasonably high access to your web server, so doing this on Freewebs or Geocities is pretty much out of the question.

Top level directory

Purpose

You should use some top level directory that all resources are stored beneath. This is actually slightly counter to the ideal way, since it needlessly makes the URI longer, but it makes the resource-handling much easier to implement. Also, should you ever add new resources that are "outside" this resource handler, the top-level directory is an easy way to distinguish these resources.

The top level directory should of course be as short as possible while remaining uniquely identifying. Wikipedia uses a top directory named /wiki/. You will notice that there are resources outside this directory. The 503 error page, for example. /wiki/ identifies a certain resource handler - which is a MediaWiki. Anything that comes after /wiki/ is the name of the resource.

Implementation

For this, you will need the ability to write an .htaccess file. This is a configuration file for your Apache webserver, which allows you to set certain preferences for a specific directory rather than the entire server.

The .htaccess file must be uploaded to the root directory of your web site - ie, the same directory you would put index.html in.

This is what your .htaccess file should contain (if you already have one, this is what you should add to it):

RewriteEngine on RewriteRule ^name$ handler.php

"name" is the top-level directory (eg. "wiki"). handler.php

"handler.php" is the PHP program that handles all page requests.

Note: This may cause conflicts. Specifically, if you have another resource that does not use this resource handler (lies outside this top directory) but still contains "name" somewhere within its URL, you have problems. There are ways to avoid this, but this is too uncommon to be unavoidable. Note that resources in this top directory are not affected - a URL in the "name" directory that contains "name" more than once will still work.

Handler

The handler program is a PHP file that looks up the resource identifier in the MySQL database and displays the appropriate content. It is important that this file does not do or display anything else. This way, the same handler can handle many different resources and content types.

I will explain how to implement the handler (and the lookup table) in the next post, tomorrow.