Sporkmonger

purveyor of fabulously ambiguous eating utensils

FeedUpdater and FeedTools

Posted by sporkmonger
Written April 13th, 2006

Sitting here at Canada on Rails… Just released two new gems, FeedTools 0.2.24 and FeedUpdater 0.1.0.

I rewrote the HTTP retrieval code for FeedTools, and it now has full support for HTTP proxies as well as basic auth. Even better, I pulled all the HTTP stuff out of the load_remote_feed! method and put it into its own helper module so that the advanced HTTP stuff is available throughout FeedTools. Additionally, in the midst of the rewrite, the strange timeouts that would sometimes occur when caching was enabled, those went away by magic. Yay.

There is also a nice handy new tool I wrote for correctly using FeedTools in a Rails app. It doesn’t have to be used within a Rails app, but certainly, that’s where it will work best. It probably won’t work on Windows though, due to a lack of fork().

To use, sudo gem install feedupdater, then cd to your Rails app and feed_updater install. This will install the feed_updater script into your Rails app’s scripts directory, add a new config file for controlling it, and it unpacks the gems for FeedTools and FeedUpdater into your vendor directory. You will need to also write a new updater script for FeedUpdater. You’ll probably want to put this in your lib folder, and point to it with the config/feed_updater.yml config file.

Note: The install command will overwrite the currently existing feedtools or feedupdater directory in the vendor directory.

Here’s an example file (included with FeedUpdater, in the example directory):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

class CustomUpdater < FeedTools::FeedUpdater
  on_update do |feed, seconds|
    logger.info(
      "Loaded '#{feed.href}'.")
    logger.info(
      "=> Updated (#{feed.title}) in #{seconds} seconds.")
  end
  
  on_error do |href, error|
    logger.info("Error updating '#{href}':")
    logger.info(error)
  end

  on_complete do |updated_feed_hrefs|
  end
end

You would use the on_update method to copy data from the parsed feed into whatever database tables or other locations where it’s required.

Then once you’ve got your custom updater script, cd to your Rails app and run script/feed_updater start.

I will likely update FeedUpdater again fairly soon.

Update: Now, with multithreading!

  1. Mike B Mike B :
    Written April 13th, 2006 at 04:03 PM

    Thanks! I just started using the last version of Feedtools for an app I’m building, and it is working well so far.

  2. Jan S Jan S :
    Written April 13th, 2006 at 04:49 PM

    That’s great! I was wondering how I would better refactor updating feeds in my rails app, and this seems just excellent.

  3. Written April 13th, 2006 at 09:12 PM

    Thats great, because we were unable to get your feed from RubyCorner because of the the proxy issue. I hope that we will be testing the new version soon.

  4. Written April 14th, 2006 at 05:19 PM

    Great, I should really look into refactoring my code with that. I’ve been using your feedtools-0.2.18 for some time at my pet project http://feedmo.com .

  5. Written April 14th, 2006 at 05:53 PM

    Oh yikes, 0.2.18? That’s veritably ancient at this point. Definitely need to upgrade that…

  6. Ross Karchner Ross Karchner :
    Written April 15th, 2006 at 12:50 PM

    I post again if I figure it out, but feed_updater start (pointed at the example updater) keeps giving me:

    “Unexpected LoadError, it is likely that you don’t have one of the libraries installed correctly.”

    This happens before it parses the example script (if I point load_script: to a fake path, I get the above error AND “Could not locate “fake/custom_updater.rb” file.”).

    I’m still poking around the source for any gems I’m missing.

    Are there dependencies that aren’t defined in the gem?

  7. Ross Karchner Ross Karchner :
    Written April 15th, 2006 at 01:07 PM

    Figured it out—

    script/feed_updater is hardcoded with:

    #!/usr/local/bin/ruby

    which on my system is the wrong ruby executable, since I’m using darwinports (/opt/local)

  8. Ross Karchner Ross Karchner :
    Written April 15th, 2006 at 03:11 PM

    Last comment here, I promise…

    It would be cool if:

    • feeds that return 304 Not Modified would NOT call the on_update block.
    • given the above, delay actually parsing the feed until you know that you will be calling on_update (feed_updater would drive the http request and manipulate the cache directly)

    (I can send you a patch that accomplishes the first item in a very blunt, ugly way)

  9. Written April 15th, 2006 at 03:42 PM

    I agree. I’ll put it in right now.

  10. Written April 17th, 2006 at 01:34 PM

    Oh, and Ross, FeedTools does lazy parsing, so the delay thing is already in there to some extent.

  11. Stefan Stefan :
    Written May 5th, 2006 at 05:35 PM

    Hi Bob,

    I’m trying to get feedupdater running without success. The page for feedupdater on your blog seems to be down and as i an a newbee on rubyonrails its quite hard for me.

    The log alwas says:

    Not connected to the feed cache. Please create database.yml

    database.yml is there and seems to be ok. Do I need to create a special table for the feeds ?

    Stefan

  12. Written May 5th, 2006 at 05:42 PM

    Stefan, yes, you need to set up the cache for FeedTools to use the FeedUpdater. Use the migration file supplied with FeedTools in the db directory to create the table.

  13. zidoing zidoing :
    Written May 15th, 2006 at 01:45 PM

    I use the feedtools-0.2.24,but i get the following errors:

    undefined method `configurations’ for FeedTools:Module

    Application Trace | Framework Trace | Full Trace C:/ruby/lib/ruby/gems/1.8/gems/feedtools-0.2.24/lib/feed_tools/feed.rb:63:in `open’ #{RAILS_ROOT}/app/controllers/rss_controller.rb:7:in `index’ -e:4:in `load’ -e:4

    why?

  14. Written May 15th, 2006 at 02:39 PM

    zidoing: That should never happen. The only way I can imagine that it might occur is if you did require 'feed_tools/feed' but failed to do require 'feed_tools'. You may want to make sure that all of your require lines completed successfully and that no LoadError exceptions were thrown.

  15. Kevin Kevin :
    Written September 19th, 2006 at 12:08 AM

    I’m not sure if I understand exactly how FeedUpdater works in conjunction with FeedTools.

    Am I right in assuming that FeedUpdater checks for feed updates, and then, if there’s an update, loads the new version in the cache, and when a web page that uses FeedTools loads, it pulls the feed information from the cache?

    I’ve noticed if a feed has a new entry added, cached versions in the database won’t update until someone loads the page using FeedTools in their browsers. Is this how it should work?

  16. Written September 19th, 2006 at 12:48 PM

    Kevin:

    Close, but not exactly.

    FeedUpdater is a daemon that is designed to remove all of the feed parsing work from the request/response part of your application. FeedUpdater allows you to define a custom updater that give your application hooks so that you are able to copy the parsed fields that you need into separate database tables. You may have noticed that FeedTools doesn’t store very much parsed information in its cache. This is because the intent has been for the applications to copy the parsed data into their own tables. FeedUpdater merely makes it much easier to do that outside of the request/response part of your application so that users never see a delay caused by feed retrieval and parsing.

  17. Written September 19th, 2006 at 05:28 PM

    Been using this, had some code changes to offer to get the config to load, and use rails app model objects in your custom script…but comments won’t allow – so posted them to my blog:

    http://kookster.blogspot.com/2006/09/feedtools-and-feedupdater-fun.html

  18. Written September 20th, 2006 at 02:22 AM

    AndrewK:

    That’s some decently good advice for end-users, though it would work just as well to put the require line at the top of the custom updater file. It doesn’t have to be in on_begin.

    I am a little confused about why it was necessary to navigate up the directory tree from within the vendor installation, since the executable script that contains the logic for finding the config file should be in the script directory. I added something similar to what you had anyways, just in case though.

  19. Written September 20th, 2006 at 08:59 AM

    gimme a little credit – I tried putting it at the start of the customer updater first, but then feedtools didn’t work – for example, it no longer found the db cache table. Couldn’t figure out why, figured it must have to do with some order of the require calls, or changes to the ruby path.

    Yes the feed_updater is in the app scripts directory, but the feed_updater.rb file is down in the vendor directory, so when it goes on the ‘hunt’ for a relative path, or uses FILE, it was in that it was looking, o I had to get it up a few extra directories. Perhaps it behaves differently in another os/ruby version? I am using 1.8.4 on mac os.

  20. Written September 20th, 2006 at 11:05 AM

    To add a little detail to the problem with the file loading – what I found was that the FileStalker defined on line 35 of feed_updater does find files relative to the scripts dir. But then line 60 hits, and the feed_updater module is loaded, and the FileStalker used to find the feed_updater.yml is no longer the FileStalker class defined in the feed_updater script, it has been redefined to be the FileStalker from the vendor/feed_updater/lib/feed_updater.rb, and so now files found are relative to that file. As far as I can tell, that was what happened.

  21. Written September 20th, 2006 at 04:35 PM

    AndrewK:

    Oh, wow, yeah, you’re right… Good call. Easy fix though.

  22. Written September 20th, 2006 at 04:41 PM

    AndrewK:

    As for why it lost the database.yml file, check line 353 of feed_updater.rb.

  23. Ryan Ryan :
    Written October 2nd, 2006 at 04:07 PM

    I’m relatively new to Rails and I’m not sure exactly how to get FeedUpdater working. If I want to simply display a feed on a web page, what code do I need to put in the custom updater, and then in the controller for my web page, for this to work properly? Thanks in advance.

  24. Written October 2nd, 2006 at 08:31 PM

    Ryan:

    Depends on what you’re trying to accomplish. It won’t be as simple as copy-paste, and probably shouldn’t be something you try right off the bat if you’re completely new to Ruby and Rails. In fact, if you’re still just learning Ruby, I’d cut FeedUpdater out of the mix until you know what you’re doing.

    Displaying a feed in Rails using only FeedTools was covered by the tutorial entry I wrote awhile back. Just search the site for “Tutorial”.

    If you’re sure you really do need FeedUpdater though, you should create a new model for storing the fields from the feeds you’re loading. Then in your custom updater, you should require the Rails boot file and the relevant model files. The on_update block should then copy the fields of the feed passed into the block into the fields you set up on your model. Then back in the Rails environment, display the model just like you would any other model.

  25. Brent Brent :
    Written December 13th, 2006 at 05:19 PM

    What specifically do you mean by “Update: Now, with multithreading!”? FeedTools has a note “Update: Apparently, threads aren’t so hot when a database-backed cache is involved because it opens up tons of extra connections to the database.”, has that been resolved?

  26. Written December 13th, 2006 at 08:20 PM

    Brent:

    FeedTools does not work well in a multithreaded Rails environment because Rails itself doesn’t work well in a multithreaded Rails environment.

    FeedUpdater, however, is not a Rails environment. Unlike Rails, FeedUpdater doesn’t work in a request-response cycle, so it only opens as many connections as there are threads. Rails tends to open connections for every thread and then often doesn’t close them when the thread exits, or something to that effect. I don’t remember the exact details. But the effect is that you have 100 open connections to the database. FeedUpdater will only open 5 connections if the number of threads is set to 5.

  27. Jason Jason :
    Written December 26th, 2006 at 11:28 AM

    From the on_update() function in the custom updater script, is it possible to access the feed’s ID record?

    My custom updater loads all feeds from feeds table. On update, it creates or updates the feed_items table. However, I need to assign feed_items.feed_id to the feed table’s ID.

    Thanks!

  28. Written December 26th, 2006 at 12:35 PM

    Jason:

    I can’t think of any reason that wouldn’t be possible. The on_update block gives you a reference to the updated feed, and you can get the id from there and assign it to whatever you need.

  29. Jason Jason :
    Written December 26th, 2006 at 05:34 PM

    Thanks Bob for the reply!

    In my custom updater script, on_begin() sets self.feed_href_list with an array of URLs. However I think I need to set the feed’s IDs here somehow so that the on_update() hook can know which feed is being updated. By ID, I mean the feed’s primary key ID record, not the ID returned in the returned XML feed.

    My structure is as follows: feeds- has_many :feed_items feed_items – belongs_to :feeds cached_feeds

    Hope this makes sense, and I can supply some actual code examples if that will help. Thanks so much!

  30. Written December 27th, 2006 at 02:12 PM

    Jason:

    If you have the hrefs stored in your feeds table (which you should), you can just do a find_by_href to get the right row.

  31. Jason Jason :
    Written December 27th, 2006 at 04:19 PM

    Thanks Bob. With the additional Feed.find_by_href(), I was able to get the record I needed.

    However, my next hurdle is figuring out how to keep the list of feed URLs current. (feed_href_list)

    When starting the daemon, it loads them all properly and holds the list in memory. But—if this list changes (such as a user adding a new feed URL), this new URL never gets polled by the daemon until it’s restarted again. Is there a way to hook into FeedUpdater so that I can reload feed_href_list each time it polls, to keep this list synchronized with the db. (I’m obviously storing my Feed URLs in a database that will be updated by users)

    Thanks again.

  32. Written December 28th, 2006 at 11:36 AM

    Jason:

    on_begin is run every time an update goes. Just update the list there.

  33. Brent Brent :
    Written January 4th, 2007 at 05:20 PM

    I have a question similar to Jason’s but a slightly different slant. 1) on_begin loads self.feed_href_list with all the feed HREFs from my feed table 2) on_update - finds my feed by HREF - creates feed_items in my table

    however is looks like if FeedTools follows a redirect for the feed then the feed_tool.href is updated to the redirected HREF and doesn’t match the one I passed it so I can’t do a find_by_href on my table. Any suggestions?

  34. Written January 5th, 2007 at 01:54 PM

    Brent:

    Um, that’s a good point. Given that, I’d recommend a) making sure you retrieve the feed at least once before putting the URI of the feed into your database so that the actual URI is resolved ahead of time and b) don’t key off the URI if you can help it since obviously, it’s subject to redirections and changes. The feed’s id, however, should never change. At least in theory. In practice, you’re going to have to key off both of them. Search for the URI first, and if you can’t find it, then search for the id.

  35. Brent Brent :
    Written January 9th, 2007 at 10:47 AM

    Got another issue. Is this the right place to post these? I’m basically running the generic custom_updater.rb and FeedUpdater doesn’t seem to reliably timeout for unreachable URLs. Some timeout properly but custom_updater just hung indefinitely this morning on http://rss.csmonitor.com/feeds/books when it seemed to be briefly unreachable. Is there a way to set another timeout regardless of the connection error? Or do you have any debugging suggestions?

  36. Written January 10th, 2007 at 05:49 PM

    Brent:

    I don’t see an obvious solution for dealing with this issue. I could add another timeout block around the open method… but technically, it already has one around the HTTP request itself, so I don’t think that’d help. Are you sure it’s not a fluke?

  37. Brent Brent :
    Written January 12th, 2007 at 01:15 PM

    Bob, I did some further research and actually my host is killing the batch process. Can anyone suggest a reasonable priced Rails host that allows long running procs such as this?

  38. Written January 12th, 2007 at 05:46 PM

    Brent:

    That doesn’t surprise me at all. However, if you’re paying less than $40 a month, you’re not likely to find one because of the adverse effect a system like this can have on your neighbors.

  39. Robert Robert :
    Written January 23rd, 2007 at 08:26 AM

    Bob,

    In post #24 above, you said,

    ”. . . you should create a new model for storing the fields from the feeds you’re loading. Then in your custom updater, you should require the Rails boot file and the relevant model files.”

    Could you be more specific about how to do this? I getting an error ‘uninitialized constant CustomUpdater::FeedThing’ in the feed_updater log file. The FeedThing refers to the model/database table I am attempting to save the parse feed to. I believe the correct ‘require’ code will fix my problem.

  40. Written January 23rd, 2007 at 05:15 PM

    Robert:

    Yeah, everyone seems to run into that. I should probably make this easier.

    I believe this should do the trick:

    1
    2
    3
    4
    5
    6
    
    
    $:.unshift(File.expand_path(
      File.join(File.dirname(__FILE__), '..')))
    $:.uniq!
    require 'config/environment'
    require 'app/models/feed_thing'
    
  41. Robert Robert :
    Written January 24th, 2007 at 02:44 AM

    Bob, Thanks for the ‘require’ code above. I really appreciate the help. Based on the above I was able to get feed_updater working.

    However, in my case I was unable to use the line:

    require ‘config/environment’

    Instead I used this line in its place:

    require ‘config/boot’

    I am placing these ‘require’ lines above the ‘Class CustomUpdater < . . .’ line.

    If I used the ‘environment’ line above, feed_updater would give the following error:

    FeedUpdater Using environment: development FeedUpdater Not connected to the feed cache. FeedUpdater The FeedTools cache table is missing. FeedUpdater Daemon stopped.

    My Ruby on Rails is not that great yet, so I am unsure why I was not able to use the ‘environment’ line.

    Also, to help anyone else who might be having trouble getting this up and running. For the longest time I was trying to start feed_updater like I started a mongrel daemon. Don’t do that. Make sure you run it with this command from your application directory:

    ruby script/feed_updater start

    Again, Thanks. And thanks for FeedTools.

  42. Written April 7th, 2008 at 07:39 AM

    Hi,

    I’m having troubling install feed_updater in my app directory. I get the following error:
    steve-odoms-computer:~/Development/blog steveodom$ feed_updater install
    /usr/local/lib/ruby/gems/1.8/gems/feedupdater-0.2.5/lib/feed_updater.rb:60: undefined method `require_gem' for main:Object (NoMethodError)
            from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `gem_original_require'
            from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `require'
            from /usr/local/lib/ruby/gems/1.8/gems/feedupdater-0.2.5/bin/feed_updater:60
            from /usr/local/bin/feed_updater:19:in `load'
            from /usr/local/bin/feed_updater:19
    
    I looked at line 60 of feed_updater.rb and that line is referencing the deamons gem. My version of that gem is 1.0.10.

    FeedTools is working fine.

    any suggestions?

    Thanks.

Leave a Response

NOTE: I'm afraid Javascript needs to be on in order to comment.

Comments should be formatted using Textile.

Ruby code should be enclosed within a <macro:code lang="ruby"> element. Other languages are supported. For output you can simply omit the lang attribute.