Liberal Parsing – BenMeadowcroft.com

There has been some activity recently regarding liberally parsing incorrect data. Mark Pilgrim released an article on Oreilly that talked about liberal parsing of RSS [additional comments]. At the same time I came across a pointer on usenet to a masters thesis on the parsing of incorrect HTML [PDF version, also in PS format].

So now you’ve read all that you can see what my point is.

You can’t? Ok I’ll explain then, the masters thesis presents an analysis of the number of incorrectly written HTML sites out there, from a representative sample of 2.4 million URI’s [harvested from DMOZ], 14.5 thousand were valid, or 0.7%. I’ve included a data table based on the results of the analysis below. Luckily RSS isn’t in as bad a state as HTML, is it? Will the trend towards more liberal parsers lead to more authors not learning about RSS and just crudely hacking it together as happens with HTML at present? Does RSS need that ability to be hacked like HTML can be in order to gain wider acceptance?

Categories	Number of Documents	% Attempted Validations (2 dp)	% Total Requests (2 dp)
Invalid HTML documents	2,034,788	99.29	84.85
Not Downloaded	225,516	NA	9.40
Unknown DTD	123,359	NA	5.14
Valid HTML documents	14,563	0.71	0.61
(All) Grand Total	2,398,226	100.00	100.00

PS I’ve just worked out a few bugs in my weblog RSS feed, enjoy.