Sporkmonger

purveyor of fabulously ambiguous eating utensils

Directory Of Feed Parsers

Posted by sporkmonger
Written February 27th, 2006

I’m only doing a comparison of parsers here, not feed readers or parsers embedded within feed readers that aren’t available as a separate download, although I suppose… I’m really using a very loose interpretation of the word “parser” here.


Parser Language Rating
13th RSS 1.0 to Anything PHP Useless
Only supports RSS 1.0. As such, not very useful.
Atom.NET .NET Poor
Supports only Atom 0.3. Badly.
CaRP / Grouper Evolution PHP Free:
Useless
Commercial:
Not sure
CaRP has built-in caching support, but the cache can be difficult to set up. The free version of CaRP is decent for just displaying someone else’s content, but utterly useless for anyone who actually wants a proper parser. Apparently, Grouper Evolution has support for Atom, and if you want access to the actual data, the non-free API will give it to you.
FeedTools Ruby I’m Biased
I wrote it. I think it’s pretty good, and there’s a bunch of people who use it and seem to like it. It’s far from perfect, but it does a lot better than most. Which isn’t saying much.
Informa Java Halfway Decent
I haven’t used it, but from what I’ve seen of output from programs that do, it does a fairly good job. However, it doesn’t support Atom 1.0.
Jakarta FeedParser Java Halfway Decent
I haven’t used it. Output of programs that do seems to be pretty decent. Supports Atom up through the version 0.5 draft.
lastRSS.php PHP Useless
Doesn’t support any version of Atom and uses regular expressions to parse.
Magpie PHP Decent
Exposes the data pretty well, but can be difficult to use.
PEAR::Package::XML_Feed_Parser PHP Not Sure
Looks like one of the better PHP parsers around, at least on paper, but I haven’t used it, so I don’t want to call it “good” unless someone wants to vouch for it.
PEAR::Package::XML_RSS PHP Useless
Only supports RSS 1.0. As such, not very useful.
PyFeed Python Fair
Support for RSS 2.0 and Atom 1.0 parsing and generation. The code is still at a very early stage. Also supports OPML.
RSS.NET .NET Poor
No support for Atom of any kind.
Ruby Standard Library RSS Parser Ruby Poor
No support for Atom of any kind.
RDF (RSS) Parser PHP Useless
Only supports RSS 1.0. As such, not very useful.
Rome Java Good
Supports all of the major feed formats, including Atom 1.0. It’s a solid contender.
RSS-Parser PHP Useless
Doesn’t support Atom. Has caching support that requires MySQL.
rss2array PHP Useless
Judgeing from the code, this script will die on redirects. That’s kinda bad.
SimplePie PHP Good
Passes many of the Atom conformance tests, and can read all but the most obscure Atom edge cases. Parses RSS quite well. It’s almost certainly the best PHP-based parser right now.
Simple RSS Ruby Halfway Decent
Very, very flexible, but also easy to break.
Suttree PHP RSS parser PHP Useless
Not really a proper parser. Doesn’t seem to handle Atom.
TailRank FeedParser Java Decent
An upgraded version of the Jakarta FeedParser that handles Atom 1.0, among other things.
Universal Feed Parser Python Excellent
Considered by some to be the “golden standard in complete liberal feed parsing”, it offers more unit tests per square inch than all of the competing solutions combined. It’s still not quite perfect though, and does fail a few of the Atom conformance tests at the moment.
Untitled RSS Parser PHP Useless
Doesn’t support Atom, would require effort to adapt to other settings.
XML::RSS Perl Not Sure
I’ve never used it, and know nothing of its capabilities.
XML::RSS::Parser Perl Not Sure
I’ve never used it, and know nothing of its capabilities.
XML::RSSLite Perl Not Sure
I’ve never used it, and know nothing of its capabilities.

Or more accurately, a list of things which claim to parse feeds, but generally do a bloody terrible job of it, however, there’s a few on the list that actually manage to do reasonably well.

I will try to keep this list up to date, so let me know if I’ve missed a parser or if the links to any of the parsers on the list go dead. Feel free to argue with my ratings, since they’re completely subjective, or suggest a rating for the ones I’ve marked “Not Sure”, but be aware that I consider any parser that fails to parse Atom (or similarly, if it fails to parse RSS) to be “poor” unless the author also wrote an adequate sibling parser that does parse Atom.

Item Sorting Order

Posted by sporkmonger
Written October 18th, 2005

I just released FeedTools 0.2.16 with some handy new goodies. It should hopefully work with Ruby 1.8.3 now, and should properly sort items by date. (It used to occaisionally leave the time field as nil, which would cause those entries to end up getting sorted to the very end.)

The parser now assumes that feeds are published in reverse chronological order, and so if a timestamp is missing, it tries to fix the problem and assign a timestamp of 1 second after the previous entry in order to maintain proper sorting order. It’s all nicely unit tested and should work for all feeds that aren’t intentionally designed to break it. (For example, cases where items 1 and 3 have no timestamp, but item 2 does.) I’ll probably deal with that situation in the next release, but since it’s highly unlikely to ever happen and has only mild side-effects, I’m not overly concerned. At worst, some items show up in the wrong order.

More importantly, if you’re playing around with the Universal Subscription Mechanism, FeedTools should properly figure out what the real url of the feed is. It will override the default feed url with the url within the feed if you gave FeedTools a url that doesn’t use http or https as the protocol. For example, a feed accessed via the file protocol that’s been stored in some temporary folder somewhere. Obviously, if you can retrieve the feed over http, you don’t really need to worry about correcting the subscription url, so it won’t override it in those cases.

Atom 0.3 became a headache this time round. As it turns out, there’s a namespace issue of sorts between Atom 1.0 and Atom 0.3. Or well, at least, they have different namespaces, but the same element names, so I have to check for the elements twice, once with the Atom 1.0 namespace, and then again with the Atom 0.3 namespace. Kind of irritating. But less so than, say, RSS, where there’s no namespace at all.

Speaking of which, I thought I should mention that, even if someone has messed up namespaces, FeedTools usually still won’t break. FeedTools checks for the elements it’s looking for, both in a namespace aware mode, as well as a namespace ignorant mode.

Anyways.

Namespace Awareness Week

Posted by sporkmonger
Written October 1st, 2005

FeedTools 0.2.15 includes the beginnings of namespace support for FeedTools. Namespace awareness was one of the last few features that I had on the list to have in for 0.2. Once those are fully in place I’ll be moving on to putting in more advanced, higher level features like subscription management with OPML import/export.

Version 0.3 will likely include some significant API changes, just as the jump from 0.1 to 0.2 did. Just a heads-up.

Still remaining on the to-do list for 0.2:

  • full namespace awareness
  • proper encoding/charset handling with iconv
  • dealing with relative urls
  • dealing with malformed feeds with multiple root nodes

(i.e. All the low-hanging fruit has already been picked.)

Patches are, of course, welcome! The parts of the code that need work are marked clearly with TODO in big all-caps letters. The unit tests for atom support could really use some work too. By which I mean, there’s only one or two tests right now. It’s definately the most undertested portion of the library.

Oh, yeah. And the API page has returned after a long hiatus due to lighttpd config file woes. (I made the mistake of copying from an example config file that contained a couple of unneccessary url rewrites. Those rewrites munged up the index files and the redirects and generally wreaked havoc on all things not-rails.) Apologies for not getting that back into working order sooner.

Iconv issues

Posted by sporkmonger
Written September 25th, 2005

I released a new version of FeedTools to address a couple of nagging issues. First, I finally tracked down the second issue with uninitialized constants (the first one was RUBYOPT). Thanks Benoit Domingue for all the help with that! Anyways, apparently, you also get the same error (or a similar one) if the iconv library is missing. FeedTools now drops a warn if you’re missing the iconv library instead of the completely unrelated ActiveSupport error you currently get with 0.2.8 and earlier.

I also updated the feed generation code to fix some issues with missing namespaces. You should now always see validating output from the generation methods.

There is also now a dependancy on UUIDTools because of the need to have proper unique ids for atom output.