Atom and its discontents

A new post for discussion of markup issues in feeds.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Guys...

…you need to stop using malformed auto-inserted glossary entries that are six miles long. They make the posts completely unreadable. Try reading this in a newsreader and see if you can get past the second paragraph. I put a picture of how it looks here in case you don’t have an Atom reader.

The feed validator says it’s valid but “may cause problems” for some users, pointing to incorrectly-encoded non-ASCII characters, but the glossary entries regularly foobar the feed, and it’s not like this stuff needs to be harder to read. Please fix the glossaries, preferably by stripping the internal HTML and making them shorter than, say, War and Peace.

An aggravted reader thanks you in advance.

—Matt

RSS feeds and glossaries

Thanks for the image.

From the image, the reader is displaying data in the title attribute that appears after a carriage return. Normally, the value of the title attribute shouldn’t appear at all, assuming that the quotes are balanced properly, so it looks to me like the RSS reader isn’t parsing the HTML properly; i.e., the reader is broken, not the HTML.

What reader are you using?

Readers, does anyone else have this problem?

No authoritarians were tortured in the writing of this post.

HTML improperly escaped

lambert: it’s the market-leading RSS reader, actually with improvements that haven’t been released yet.

I’ll submit it as a bug when I get the time, but according to section 3.1.1.2 of RFC 4287 (the Atom 1.0 spec), markup inside a text element must be escaped, and the CorrenteWire Atom feed doesn’t do that, maybe because it’s in CDATA (which is not mentioned in the Atom spec, just the XML spec).

I’ll see what I can find out, but it seems interesting that the entries always seem to break where there are linebreaks or markup in the glossary entry.

—Matt

HTML properly escaped on the page

I looked at the RFC just looked again at the markup for the title attribute. All the entities are properly escaped. There are no tags (“child elements”) in the data. Linebreaks are just whitespace; they do not need to be escaped in HTML.

Now, for all I know, the Atom module is producing, for reasons of its own, different markup from that which is produced by the page builder. But the markup on the page is correct as is.

Sounds to me like I should abolish Atom for the front page and move to RSS.

Can you tell me if the same problem happens on the individual blogs, which use RSS, vs the front page, which uses Atom, for historical reasons?

NOTE And can we take this offline? This isn’t a suitable post for this discussion. Thanks.

No authoritarians were tortured in the writing of this post.

Ooh, didn't know you could move things

But, well done.

It’s not just entities that need to be escaped, it’s markup. You can’t use “<” directly in an element like “summary”, it has to be escaped as “&lt;” So the summary shouldn’t have text like “<p>This is a paragraph</p>” in it - it should have to be “&lt;p>This is a paragraph&lt;/p>”.

(To make this display right, I used entities, but my meaning is as that paragraph renders in a browser, not as you’d read it in the HTML source for this comment.)

The fact that they’re in a CDATA section may excuse that, but I’m not sure. The feed validator doesn’t report it as invalid, but I don’t know why. I can’t ask the guy who knows until Tuesday.

I can’t tell you about the individual blogs; I only read the main Atom feed (one of 416 in the mix right now), and just want it to be as clean and easily-readable as possible. :-)

—Matt

Thank you

I know what markup is, I know what the difference between a tag and an entity is, I know what CDATA is and how to mark it up, and I know how to do the escaping. That’s not the issue.

What you’re telling me, I think, is that in the feed, as opposed to the pages, there’s unescaped markup. That looks like a flaw in the markup generated by the Atom module, so perhaps I should simply disable it. (Unfortunately, the JPG image doesn’t allow anybody to determine what markup is actually being received by the reader, which itself be broken, since it’s not reporting any errors.)

Fortunately, nobody else is reporting the problem. Readers?

No authoritarians were tortured in the writing of this post.

I see it too...

…the glossary entries are making baby jesus (and/or Shrook) cry. They also hose Sharpreader.

Looks like the title attribute of the glossary link is so long it’s swizzling the readers.

Also looks like the glossary entry’s being inserted into the feed twice, which may be the actual problem…

As an experiment, I removed carriage returns from Clusterfuck

The title attribute is still very long, and it has escaped markup in it, but it doesn’t appear in the atom feed, at least when I go to that URL in FireFox.

Can somebody search on Clusterfuck in their version of the feed, in the post titled “Bush: “I put loyalty to the country third on the list,” and see if the problem is fixed, for that one glossary entry?

It looks to me like either the “industry standard” RSS reader doesn’t handle carriage returns in a way that conforms to the XHTML spec, or the Atom RFC doesn’t.

It's improved

I posted a picture of how it looks now here. I’ll ask about linebreaks inside HTML attributes inside CDATA inside atom:summary elements.

—Matt

Hey, Lam...

…would it be possible to break ths again by putting the line breaks back into the “clusterfuck” glossary entry for a day? The author of the newsreader would like to see the broken source in action so he can see what’s going on.

—Matt

I can't reproduce the orginal exactly

(I should have turned on versioning, sigh.)

Presumably it is not a problem now, however.

No authoritarians were tortured in the writing of this post.

Nope, it's still a problem

It shows up in all the recent posts where a glossary entry has a carriage return in it, but it seems the CR only triggers the problem, because it doesn’t happen on the Web pages, just in the Atom stuff. My thanks to Brent for showing me the error:

In the CorrenteWire Atom feeds, the “summary” and “content” elements are marked as of type “html” and contain a big “CDATA” section that’s supposed to be valid, displayable HTML. When that HTML includes a glossary entry that also includes a CR, the Atom generator is inserting bogus </p> tags that mess everything up.

(In the following examples, I’ve used entities so the XML will display property; if you look at the source for this message, you’ll have to substitute a real < character everywhere you see “&lt;”, of course. I’ve also deliberately misspelled some glossary terms so that the site doesn’t turn the ones in examples into new glossary entries, making it even harder to follow.)

So, in today’s item “Hey, come on! Stop picking on LeadFoot!”, in a Web browser, we get the following source:


<p>
<a href="http://www.martinirevolution.com/?p=70">“Wattles,”</a> forsooth. Make with the Civillity<a href="/glossary#term62" title="Civillity: A. Having been bullied by the VVRWC. Example: CBS brought a new level of civility to American political discourse by cancelling a mini-series on Reagan that wingers deemed insufficiently hagiographical.

"><img src="sites/all/modules/glossary/glossary.gif" /></a>, guys!
</p>
<p>
Doesn’t <a href="http://highclearing.com/index.php/archives/2007/04/29/6318">she have enough to deal with, now that Bush is drinking again</a>?
</p>

However, in the Atom feed, the “summary” and “content” elements both have the following HTML in the CDATA:


<p>
<a href="http://www.martinirevolution.com/?p=70">“Wattles,”</a> forsooth. Make with the Civillity<a href="/glossary#term62" title="Civillity: A. Having been bullied by the VVRWC. Example: CBS brought a new level of civility to American political discourse by cancelling a mini-series on Reagan that wingers deemed insufficiently hagiographical.

"><img src="sites/all/modules/glossary/glossary.gif" /></a><a href="/glossary#term62" title=" CBS brought a new level of civility to American political discourse by cancelling a mini-series on Reagan that wingers deemed insufficiently hagiographical.</p>
<p>">
</p>
<p>
”><img src="sites/all/modules/glossary/glossary.gif" /></a>, guys!
</p>
<p>
Doesn’t <a href="http://highclearing.com/index.php/archives/2007/04/29/6318">she have enough to deal with, now that Bush is drinking again</a>?
</p>

You see it? In the Atom version, the blank line in the glossary entry (the “title” attribute on a link) somehow makes your Atom generator insert <p> and </p> tags inside the title attribute, which is absolutely positively non-legal and unhappy.

That’s why it renders wackily in newsreaders - it’s bad HTML, but only in the Atom version. The bogus tags aren’t in the version you see in a browser, so I suspect a bug in the Atom generator. Since carriage returns in glossary entries are the biggest trigger, that’s where I was focusing, but I suppose it could come up in other situations.

—Matt

Hi Lambert, I am concerned

Hi Lambert,

I am concerned about your recent decision to allow only registered users to post comments. I think it will really cut down on new readers coming to Corrente, and that’s a shame, because you have a lot to offer. I actually registered in order to send you this note (though I really didn’t want to have to do that!).

Here’s why I hesitate to sign up as registered user of any newspaper, blog, website, etc.:

1) I’m not that good at the technical stuff so I am hesitant to sign up and give away my anonymity. I don’t really understand the ins and outs of blogging, and I want to protect myself and family from unpleasant contact, the selling of my e-mail address. I like the anonymity, sometimes use different names to post stuff, but have made a personal commitment to never abuse other readers or their trust.

2) I don’t consider myself a “member” of a blog, don’t have the commitment to register to each blog I read, etc. Yet when I do comment, it’s generally respectful. I take my cues from the tone and content of the blog I’m looking at. So if you keep the level of discourse civil and informed here, won’t most of the commenters follow your lead? Personally, I avoid any blog whose readers attack one another, use profanity, etc.

3) I initially started reading at Corrente BECAUSE it allowed me the occasional comment. Now, I’m more or less a regular reader and willing (thought reluctant) to register. But for the first few months I wouldn’t have bothered to register. And let’s face it, most people get more engaged if they do more than just read. Most people like a little back and forth. But they want to get to know their love interest before moving in.

So please reconsider the registered user only policy. Thanks.

I don't have the time or the energy to filter the crap right now

dana b:

Thanks for your note, and I hear you.

However, the checks I get from the Hillary campaign and George Soros don’t compensate for the aggravation of being on call during my waking hours. I feel like a furnace filter must feel when it needs to be changed.

I know the risks, but there you are. The shitstorm died down at Digby’s, at which point she re-opened. No doubt the same will happen here.

[x] Any (D) in the general. [ ] Any mullah-sucking billionaire-teabagging torture-loving pus-encrusted spawn of Cthulhu, bless his (R) heart.

Corrente and Privacy

For what it’s worth, Dana B, one of the big issues for Corrente is privacy, so I don’t think you should worry about your email address being sold or revealed.

Yahoo has free email...

… and all we need it for is as a point of contact for administrative functions like password renewal.

An email address is a pretty weak form of identity. You probably should be a lot more worried about our friends in the Stasi NSA than about us.

[x] Any (D) in the general. [ ] Any mullah-sucking billionaire-teabagging torture-loving pus-encrusted spawn of Cthulhu, bless his (R) heart.

Asking registration to comment is reasonable

No one would be offended at being asked to identify themselves before someone opens their door, we’d just chalk it up to reasonable caution. Same thing with a blog, which to my mind is very like being allowed into a discussion group in someone’s living room. No one has a “right” to be admitted and everyone should expect to behave and be treated with at least minimally civilized behavior.

That, and IMHO anyone too lazy to take the time to register is also unlikely to put enough effort into their comment to make it worth reading. Call me an elitist, but fewer comments with a higher level of content is a fair tradeoff. On the other hand I don’t have to worry about traffic numbers so keeping my nose in the air comes without any personal cost.

You’re doing fine with this matter, Lambert; stick with your own best judgment.