WordPress XHTML Validator Plugin

WordPress is very good about supporting standards and producing valid markup, and at least when I started using it, it had a link in the standard theme to proclaim the validity of its pages and prove it to you by taking you to the W3C Markup Validation Service. A lot of people never pay this any attention and promptly produce a bunch of non-complying pages, all the while shamefully leaving the boastful link in the sidebar or footer of their site.

Being a retentive sort about many things, I’ve always worked to ensure my pages validate correctly, looking for the below highly satisfying message on each and every one of these posts that I strive mightily to create for you:

Screenshot of W3C Markup Validation Page

(Versus the alternative which is highlighted with red instead of green, and tells you that you are not just inadequate but are a bad person as well.)

But this can be a tedious chore. In order to validate by URI, your page has to be already published in order to be reachable by the W3C server. This will never do. We must have compliance before publication! I work on my blog locally, so I can publish the page and then capture the HTML source in order to enter it in the Direct Input Validator, but this isn’t at all convenient, especially if you have errors and need to fix them and therefore repeat the cycle.

WordPress XHTML Validator Plugin

I considered making a WordPress plugin to give me a button on the edit screen which would let me send the preview page source HTML directly to the W3C validator. Then I could perform a quick verification after composing a post. But I’m not versed in WordPress plugins and not keen on spending too much time in that area right now. I’d previously looked for plugins without finding a good one, but recently tried again. So often you can sweat over something and then find out someone has already done the work for you two years ago.

For example, this time my search turned up the WordPress XHTML Validator from rudd-o.com. It’s been waiting patiently since 2005 for me to find it, and it’s perfect for what I’m trying to do. I’ve been using it for a week now and it works great in WordPress 2.0.x. (As with most software I write about here, it’s free software, and I’m grateful to Manual Amador for creating and freely sharing it.)

It uses two popular command line utilities, xmllint and html tidy, and validates every time you press the “Save and Continue Editing” button. There is also a feature to check all posts and pages and produce a list of problem pages.

xmllint

I think I’ve heard of this program before but wasn’t familiar with what it can do. Not surprisingly given the topic of this post, it can validate XML. It was already installed on my Ubuntu 7.04/Feisty Fawn system. I thought it would be excessive and slow things down to validate against the W3C web site on every save, but it doesn’t use the external DTD. (However, I so far haven’t determined where it gets the DTD from.) A character entity file is included to let it correctly parse things like  . Since it has low overhead, I like the idea of validating every time and keeping it clean as I go. Saves an extra step of checking later and finding and fixing several things at once.

html tidy

This one isn’t installed on my machine but I could add it easily enough with sudo apt-get install tidy. I suspect it can be used to quickly fix problems with the html, but I’m just as happy managing this myself, so I haven’t installed it.

You can run the plugin with either program. It’s useful to me with only xmllint; I’m not sure how well it would do with only html tidy.

Web Host Considerations

If you want to run this plugin, you’ll need xmllint and/or html tidy installed on your web hosting server. It also relies on PHP5. My host, SurpassHosting, has xmllint, but is currently at PHP4, so I can’t use it there. But it works fine on my local machine which has PHP5, and this is where I compose my posts, so it’s not really necessary to run the plugin on my “real” blog.

Although I’d like to be able to use the feature to check all pages, just to be sure, speaking of which…

OCD Considerations

Even though I’ve tried validating all my pages along the way, I ran in to a problem when checking my old posts. When WordPress generates pages, it fixes many things for you. So even though the whole page might validate against the W3C site, the post content itself on a whole bunch of pages showed up with errors in the plugin report. It didn’t like unencoded ampersands in urls and missing </p> tags, among other things. So I felt compelled to fix all of these, because I want to be able to run all the pages and get a clean bill of health.

W3C DTD Netiquette

In related news, slashdot posted a story about excessive DTD requests to W3C servers that I found interesting in light of all this XHTML validation.

From the W3C blog posting:

If you view the source code of a typical web page, you are likely to see something like this near the top:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

and/or

<html xmlns="http://www.w3.org/1999/xhtml" ...>

These refer to HTML DTDs and namespace documents hosted on W3C’s site.

Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say “this is HTML”. In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven’t changed in years.

[…]

In one case we noticed, a number of IP addresses at one company were requesting DTDs from our site more than three hundred thousand times per day each, per IP address.

–Ted Guild, “W3C’s Excessive DTD Traffic”

With that said, I feel better about a solution that doesn’t continually check with the W3C web site!

But then again, if you look on their Markup Validation page, they encourage us like so:

Congratulations

The document located at <http://www.movingtofreedom.org/> was checked and found to be valid XHTML 1.0 Transitional. This means that the resource in question identified itself as “XHTML 1.0 Transitional” and that we successfully performed a formal validation using an SGML or XML Parser (depending on the markup language used).

To show your readers that you have taken the care to create an interoperable Web page, you may display this icon on any page that validates. Here is the HTML you could use to add this icon to your Web page:

Valid XHTML 1.0 Transitional

<p>
  <a href="http://validator.w3.org/check?uri=referer"><img
      src="http://www.w3.org/Icons/valid-xhtml10"
      alt="Valid XHTML 1.0 Transitional" height="31" width="88" /></a>
</p>

So they’re just asking for it. :-)

But to be fair, random requests from curious readers aren’t going to amount to much in the face of companies that make hundreds of thousands of requests every day.

If you enjoyed this article, please subscribe for free!
Via the atom or rss feed, or enter your email address to get updates when new entries are posted:
(Your email will not be shared nor used for anything other than sending new posts. See the policies page for more about subscriptions and privacy.)

You can skip to the end and leave a response. Pinging is currently not allowed.

Comments

  1. My guess is that the W3C servers get so many requests for these DTDs because of poorly-written (or just lazily-written) spiders that treat everything resembling a URL in page source as a link to follow. In particular, the one company whose computers generate 300,000 requests a day each is almost certainly running a spider out of its data center.

    The W3C could put a robots.txt on its site denying bot access to those URLs but a) a lot of the bots hammering it probably also don’t respect robots.txt; lazy is lazy; b) there may be legitimate need for bots to access those pages, including validators, and c) if it did work, they’d just get 300,000 requests a day for robots.txt from each of those machines and any similar ones.

    P.S. Why are you blocking the ability to post from IPs belonging to Bell Canada, AND making it lie to these users about why they cannot post? (It incorrectly claims that they got the spam validation wrong, even when it obviously wasn’t, e.g. was “12″ in response to “What is the sum of 5 + 7?”.)

  2. Hi, Somebody.

    First, this is likely a long-standing problem with Akismet that you’ve previously reported. It is unfortunate and I’m sorry that it happens, but I’m not planning to discontinue using Akismet at this time. (However, your comment did make it through without my having to approve it, so either you came from a different IP range, or something has changed with Akismet.)

    Second, “lie?” There may be some other technical issue, but is it necessary to imply duplicity is at work? I haven’t heard of any problems with the math comment plugin, and don’t know why it may have rejected a correct answer. Since your comment was allowed, it must have worked at least once.

    So, two things are at work. Like most WordPress blogs, I’m using Akismet for spam filtering. I’m also using the math comment spam plugin. (I previously used the Bad Behavior plugin but haven’t for a while now.) Many comment posting problems would be caused by one or the other of these components. I have little control over what Akismet does, and from experience, it doesn’t report anything when it swallows up a comment. The math comment plugin is so simple that I would be surprised if it suddenly started failing. Independent of Akismet, it will complain when it thinks you’ve entered a wrong answer. It’s certainly possible that it has a bug of some kind, but I assure you there is no intentional misdirection going on.

    Back to the main part of your comment, I agree that bad spiders are a likely culprit, and the slashdot discussion goes over other causes.

  3. I did use a different IP range to post.

    The math problem was “What is the sum of 5 + 7?” on the failed post. The answer I gave being 12, of course. It follows that it was not rejected because the math was incorrect, although that is the reason claimed in the error message. It also wasn’t a transient glitch; I resubmitted it a couple of times with identical results.

    In fact, the originating IP and the added P.S. were the only things to differ between the failed and the successful post. Which means that the only thing that the failed post contained that the successful one didn’t was a Bell Canada IP address. Which means that that was the criterion for rejection.

    Which means that somewhere, some software is configured to reject posts based on originating IP with an error message that incorrectly claims that they got the math problem wrong. Configured, in other words, to lie.

    It seems unlikely to be akismet, which apparently silently rejects posts rather than produce any error message at all.

    I’d suggest you examine closely the source code for the math test plugin. It may be that it has some extra “features” besides simply testing for the answer to be correct. Likely it accessed a remote database to decide what IPs to block, since it seems you didn’t yourself tell it to start blocking Bell Canada IPs, yet its starting to do so must have been triggered by something.

    The extra “features” are probably well-intentioned spam-blocking measures, but obviously have been implemented naively. Most likely, they used a blacklist intended for blocking e-mail spam that preemptively includes the ranges of large ISPs’ DHCP servers, since email originating directly from an ISP user’s computer instead of going through their ISP’s MX box is likely spam. It should be obvious why blocking such addresses is smart for email despamming but stupid for policing blog comments. :) As for why my ISP’s DHCP ranges would have been added only recently to whatever blacklist was used, that will likely remain a mystery. :P

You can follow any responses to this entry through the
comments feed.

Say Your Say

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

By submitting your comment here, you agree to license it under the same Creative Commons Attribution-ShareAlike 3.0 License as the movingtofreedom.org web site. Please see policies for more information about comments and privacy.