Sporkmonger

purveyor of fabulously ambiguous eating utensils

Directory Of Feed Parsers

Posted by sporkmonger
Written February 27th, 2006

I’m only doing a comparison of parsers here, not feed readers or parsers embedded within feed readers that aren’t available as a separate download, although I suppose… I’m really using a very loose interpretation of the word “parser” here.


Parser Language Rating
13th RSS 1.0 to Anything PHP Useless
Only supports RSS 1.0. As such, not very useful.
Atom.NET .NET Poor
Supports only Atom 0.3. Badly.
CaRP / Grouper Evolution PHP Free:
Useless
Commercial:
Not sure
CaRP has built-in caching support, but the cache can be difficult to set up. The free version of CaRP is decent for just displaying someone else’s content, but utterly useless for anyone who actually wants a proper parser. Apparently, Grouper Evolution has support for Atom, and if you want access to the actual data, the non-free API will give it to you.
FeedTools Ruby I’m Biased
I wrote it. I think it’s pretty good, and there’s a bunch of people who use it and seem to like it. It’s far from perfect, but it does a lot better than most. Which isn’t saying much.
Informa Java Halfway Decent
I haven’t used it, but from what I’ve seen of output from programs that do, it does a fairly good job. However, it doesn’t support Atom 1.0.
Jakarta FeedParser Java Halfway Decent
I haven’t used it. Output of programs that do seems to be pretty decent. Supports Atom up through the version 0.5 draft.
lastRSS.php PHP Useless
Doesn’t support any version of Atom and uses regular expressions to parse.
Magpie PHP Decent
Exposes the data pretty well, but can be difficult to use.
PEAR::Package::XML_Feed_Parser PHP Not Sure
Looks like one of the better PHP parsers around, at least on paper, but I haven’t used it, so I don’t want to call it “good” unless someone wants to vouch for it.
PEAR::Package::XML_RSS PHP Useless
Only supports RSS 1.0. As such, not very useful.
PyFeed Python Fair
Support for RSS 2.0 and Atom 1.0 parsing and generation. The code is still at a very early stage. Also supports OPML.
RSS.NET .NET Poor
No support for Atom of any kind.
Ruby Standard Library RSS Parser Ruby Poor
No support for Atom of any kind.
RDF (RSS) Parser PHP Useless
Only supports RSS 1.0. As such, not very useful.
Rome Java Good
Supports all of the major feed formats, including Atom 1.0. It’s a solid contender.
RSS-Parser PHP Useless
Doesn’t support Atom. Has caching support that requires MySQL.
rss2array PHP Useless
Judgeing from the code, this script will die on redirects. That’s kinda bad.
SimplePie PHP Good
Passes many of the Atom conformance tests, and can read all but the most obscure Atom edge cases. Parses RSS quite well. It’s almost certainly the best PHP-based parser right now.
Simple RSS Ruby Halfway Decent
Very, very flexible, but also easy to break.
Suttree PHP RSS parser PHP Useless
Not really a proper parser. Doesn’t seem to handle Atom.
TailRank FeedParser Java Decent
An upgraded version of the Jakarta FeedParser that handles Atom 1.0, among other things.
Universal Feed Parser Python Excellent
Considered by some to be the “golden standard in complete liberal feed parsing”, it offers more unit tests per square inch than all of the competing solutions combined. It’s still not quite perfect though, and does fail a few of the Atom conformance tests at the moment.
Untitled RSS Parser PHP Useless
Doesn’t support Atom, would require effort to adapt to other settings.
XML::RSS Perl Not Sure
I’ve never used it, and know nothing of its capabilities.
XML::RSS::Parser Perl Not Sure
I’ve never used it, and know nothing of its capabilities.
XML::RSSLite Perl Not Sure
I’ve never used it, and know nothing of its capabilities.

Or more accurately, a list of things which claim to parse feeds, but generally do a bloody terrible job of it, however, there’s a few on the list that actually manage to do reasonably well.

I will try to keep this list up to date, so let me know if I’ve missed a parser or if the links to any of the parsers on the list go dead. Feel free to argue with my ratings, since they’re completely subjective, or suggest a rating for the ones I’ve marked “Not Sure”, but be aware that I consider any parser that fails to parse Atom (or similarly, if it fails to parse RSS) to be “poor” unless the author also wrote an adequate sibling parser that does parse Atom.

Y2038 Feed

Posted by sporkmonger
Written December 28th, 2005

So, I’m curious, how many parsers/feed readers show an incorrect date for this feed or just blow up for that matter?

So far in my testing, NetNewsWire has been the only parser that correctly displays the date given in the feed. FeedTools can’t parse the date and reverts to the current date and time instead. So does Google Reader and Bloglines, as well as most of the online feed readers I tried it out with.

The Feed Validator does helpfully warn me of an “implausible date”, but the feed is perfectly valid Atom 1.0, so far as I can tell.

I’m honestly not too worried about the issue of parsers not being able to handle the concept of 2038 when it roles around. By 2038, the concept of feeds will likely seem utterly obsolete. But I can’t help but wonder if some parsers will end up tossing an exception on this feed. I couldn’t find anything in Mark Pilgrim ’s UFP test suite for dates being out of range. The largest date among the tests seems to be some time in 2004.

Chameleon Classes

Posted by sporkmonger
Written December 7th, 2005

One of the big advantages that Mark Pilgrim’s UFP has over FeedTools is in its *_detail methods. The UFP basically gives you a bunch more information without having to dive into XPath or some such. In a couple of obvious places, like the author field, FeedTools also does this, but it used to do it with an interesting trick.

I really didn’t like the idea of having two different methods to retrieve what was essentially the same piece of information. But I also didn’t like forcing people who were used to element names from RSS being stuck using a more unfamiliar Atom-like class structure.

So awhile back, I made it so that there were these special funky classes, like the FeedTools::Author class, that were basically a structure containing the author’s name, email, url, and the raw data in the case of an RSS feed. So you could write:

1
2
3
4
5
6
7
8
9

feed = FeedTools::Feed.open(
  'http://sporkmonger.com/xml/atom10/feed.xml')
feed.author.name
=> 'Bob Aman'
feed.author.email
=> 'bob@sporkmonger.com'
feed.author.url
=> 'http://sporkmonger.com/'

But… you could also write:

1
2
3
4
5

feed = FeedTools::Feed.open(
  'http://sporkmonger.com/xml/atom10/feed.xml')
feed.author
=> 'Bob Aman'

Because I had used method_missing to relay any missing methods on the class to the name object, and overwritten the inspect method and a few others to do the same.

I eventually removed it because the code was quite repetitive (though I know how to fix that at this point), but also because I was worried that someone might actually need one of those overwritten methods for some reason, and also because I thought ‘chameleon classes’ might be surprising to some people. Actually, quite possibly very surprising. And you know, we supposedly want ‘least surprise.’

Now, the reason I’m bringing this up is because I wanted to get more information attached to the content element. I could change the API and have the content method return a class instead of a string, I could leave it the way it is and go with the Mark Pilgrim method, or I could return to the ‘chameleon classes’ only better written the second time round.

My personal preference is actually the shape changing niftiness, despite my reservations. Aesthetically, I think it’s the nicest looking solution, because it keeps the API much, much closer to the names of the elements in the respective syndication formats. Remember, our fearless leaders are of the opinion that beauty and productivity go hand in hand. But I worry, so I’m just going to ask you guys, what do you prefer, and why?

Oh yeah, and I’m strongly considering following Mark’s example and using ‘href’ everywhere for consistency.

Iconv issues

Posted by sporkmonger
Written September 25th, 2005

I released a new version of FeedTools to address a couple of nagging issues. First, I finally tracked down the second issue with uninitialized constants (the first one was RUBYOPT). Thanks Benoit Domingue for all the help with that! Anyways, apparently, you also get the same error (or a similar one) if the iconv library is missing. FeedTools now drops a warn if you’re missing the iconv library instead of the completely unrelated ActiveSupport error you currently get with 0.2.8 and earlier.

I also updated the feed generation code to fix some issues with missing namespaces. You should now always see validating output from the generation methods.

There is also now a dependancy on UUIDTools because of the need to have proper unique ids for atom output.