Stuff that's irritating me at the moment


  • It doesn't fetch robots.txt, and therefore has a hard time following it. I finally managed to stop it indexing my WordPress testbed by blocking it manually.
  • So all's fine, right? Well, not really. It's stopped picking up new posts properly. To test, I added myself to my Technorati Favorites [sic]. It turns out I updated 13 hours ago, but haven't made any new posts for 52 days.
  • Oh, and the title of the top post on that page was parsed wrong. I think it picked that up from my RSS2 feed, see below.
  • Not everything with a feed is a blog. That is all.


Like it or not, robots.txt is a web standard. Well-supported and pretty unambiguous (at least, the "Disallow: /" form is). It doesn't matter whether you're parsing HTML or RSS: If you're indexing data and using it in a search engine, you should follow robots.txt.


Overzealous web standards advocates

XHTML pages are better than HTML pages. If you validate your pages they'll work. Firefox supports web standards perfectly.

I'm not going to go over all the old arguments here. I don't feel I really need to prove that the statements in above paragraph are false. It's not even a complete list, those are just the three that annoy me most.

  • I still use XHTML in one place on this site - the Atom feed. It's helpful here, because it means that somebody parsing my feed only needs one parser, and, if they do it properly, is at less risk of screwing up things like entities. In practice of course they'd need a tagsoup parser for other feeds, but that's not my problem (yet).
  • IE doesn't support XHTML. Mozilla supports it poorly. And of course Opera and Mozilla are locked out of my pages completely if I make a stupid mistake, so until I'm perfect, or can hire someone to check every page on the site regularly, I'll stick to HTML.
  • That's not to say that HTML4 is supported perfectly (it isn't, and I wouldn't expect it to be), but it's more of a case of knowing where I stand. There's a reasonably clear subset of HTML that's been widely supported for about a decade, and just using that seems like the best strategy.
  • See the text below the search form to the right? It's too smal in Mozilla, because its parser thinks it's inside the form. This isn't a "Firefox sucks" point, just a reminder that there's no guarantee something valid will work in a "standards compliant" browser.


Okay, I know I'm late. I've finally got fed up of RSS2. Specifically, the people who can't read the spec and don't want it clarified.

The blurb on my blog page now describes Atom as the recommended format, RSS3 as the simple format, and RSS2 as the one-to-use-if-you're-insane format. This is to amuse myself more than anything, but I'll probably be putting in some redirects and denying all existence of the RSS feed at some point.

