FeedUpdater and FeedTools
Sitting here at Canada on Rails… Just released two new gems, FeedTools 0.2.24 and FeedUpdater 0.1.0.
I rewrote the HTTP retrieval code for FeedTools, and it now has full support for HTTP proxies as well as basic auth. Even better, I pulled all the HTTP stuff out of the load_remote_feed! method and put it into its own helper module so that the advanced HTTP stuff is available throughout FeedTools. Additionally, in the midst of the rewrite, the strange timeouts that would sometimes occur when caching was enabled, those went away by magic. Yay.
There is also a nice handy new tool I wrote for correctly using FeedTools in a Rails app. It doesn’t have to be used within a Rails app, but certainly, that’s where it will work best. It probably won’t work on Windows though, due to a lack of fork().
To use, sudo gem install feedupdater, then cd to your Rails app and feed_updater install. This will install the feed_updater script into your Rails app’s scripts directory, add a new config file for controlling it, and it unpacks the gems for FeedTools and FeedUpdater into your vendor directory. You will need to also write a new updater script for FeedUpdater. You’ll probably want to put this in your lib folder, and point to it with the config/feed_updater.yml config file.
Note: The install command will overwrite the currently existing feedtools or feedupdater directory in the vendor directory.
Here’s an example file (included with FeedUpdater, in the example directory):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
class CustomUpdater < FeedTools::FeedUpdater on_update do |feed, seconds| logger.info( "Loaded '#{feed.href}'.") logger.info( "=> Updated (#{feed.title}) in #{seconds} seconds.") end on_error do |href, error| logger.info("Error updating '#{href}':") logger.info(error) end on_complete do |updated_feed_hrefs| end end |
You would use the on_update method to copy data from the parsed feed into whatever database tables or other locations where it’s required.
Then once you’ve got your custom updater script, cd to your Rails app and run script/feed_updater start.
I will likely update FeedUpdater again fairly soon.
Update: Now, with multithreading!
Thanks! I just started using the last version of Feedtools for an app I’m building, and it is working well so far.
That’s great! I was wondering how I would better refactor updating feeds in my rails app, and this seems just excellent.
Thats great, because we were unable to get your feed from RubyCorner because of the the proxy issue. I hope that we will be testing the new version soon.
Great, I should really look into refactoring my code with that. I’ve been using your feedtools-0.2.18 for some time at my pet project http://feedmo.com .
Oh yikes, 0.2.18? That’s veritably ancient at this point. Definitely need to upgrade that…
I post again if I figure it out, but feed_updater start (pointed at the example updater) keeps giving me:
“Unexpected LoadError, it is likely that you don’t have one of the libraries installed correctly.”
This happens before it parses the example script (if I point load_script: to a fake path, I get the above error AND “Could not locate “fake/custom_updater.rb” file.”).
I’m still poking around the source for any gems I’m missing.
Are there dependencies that aren’t defined in the gem?
Figured it out—
script/feed_updateris hardcoded with:#!/usr/local/bin/rubywhich on my system is the wrong ruby executable, since I’m using darwinports (
/opt/local)Last comment here, I promise…
It would be cool if:
(I can send you a patch that accomplishes the first item in a very blunt, ugly way)
I agree. I’ll put it in right now.
Oh, and Ross, FeedTools does lazy parsing, so the delay thing is already in there to some extent.
Hi Bob,
I’m trying to get feedupdater running without success. The page for feedupdater on your blog seems to be down and as i an a newbee on rubyonrails its quite hard for me.
The log alwas says:
Not connected to the feed cache. Please create database.yml
database.yml is there and seems to be ok. Do I need to create a special table for the feeds ?
Stefan
Stefan, yes, you need to set up the cache for FeedTools to use the FeedUpdater. Use the migration file supplied with FeedTools in the
dbdirectory to create the table.I use the feedtools-0.2.24,but i get the following errors:
undefined method `configurations’ for FeedTools:Module
Application Trace | Framework Trace | Full Trace C:/ruby/lib/ruby/gems/1.8/gems/feedtools-0.2.24/lib/feed_tools/feed.rb:63:in `open’ #{RAILS_ROOT}/app/controllers/rss_controller.rb:7:in `index’ -e:4:in `load’ -e:4
why?
zidoing: That should never happen. The only way I can imagine that it might occur is if you did
require 'feed_tools/feed'but failed to dorequire 'feed_tools'. You may want to make sure that all of yourrequirelines completed successfully and that noLoadErrorexceptions were thrown.I’m not sure if I understand exactly how FeedUpdater works in conjunction with FeedTools.
Am I right in assuming that FeedUpdater checks for feed updates, and then, if there’s an update, loads the new version in the cache, and when a web page that uses FeedTools loads, it pulls the feed information from the cache?
I’ve noticed if a feed has a new entry added, cached versions in the database won’t update until someone loads the page using FeedTools in their browsers. Is this how it should work?
Kevin:
Close, but not exactly.
FeedUpdater is a daemon that is designed to remove all of the feed parsing work from the request/response part of your application. FeedUpdater allows you to define a custom updater that give your application hooks so that you are able to copy the parsed fields that you need into separate database tables. You may have noticed that FeedTools doesn’t store very much parsed information in its cache. This is because the intent has been for the applications to copy the parsed data into their own tables. FeedUpdater merely makes it much easier to do that outside of the request/response part of your application so that users never see a delay caused by feed retrieval and parsing.
Been using this, had some code changes to offer to get the config to load, and use rails app model objects in your custom script…but comments won’t allow – so posted them to my blog:
http://kookster.blogspot.com/2006/09/feedtools-and-feedupdater-fun.html
AndrewK:
That’s some decently good advice for end-users, though it would work just as well to put the require line at the top of the custom updater file. It doesn’t have to be in on_begin.
I am a little confused about why it was necessary to navigate up the directory tree from within the vendor installation, since the executable script that contains the logic for finding the config file should be in the script directory. I added something similar to what you had anyways, just in case though.
gimme a little credit – I tried putting it at the start of the customer updater first, but then feedtools didn’t work – for example, it no longer found the db cache table. Couldn’t figure out why, figured it must have to do with some order of the require calls, or changes to the ruby path.
Yes the feed_updater is in the app scripts directory, but the feed_updater.rb file is down in the vendor directory, so when it goes on the ‘hunt’ for a relative path, or uses FILE, it was in that it was looking, o I had to get it up a few extra directories. Perhaps it behaves differently in another os/ruby version? I am using 1.8.4 on mac os.
To add a little detail to the problem with the file loading – what I found was that the FileStalker defined on line 35 of feed_updater does find files relative to the scripts dir. But then line 60 hits, and the feed_updater module is loaded, and the FileStalker used to find the feed_updater.yml is no longer the FileStalker class defined in the feed_updater script, it has been redefined to be the FileStalker from the vendor/feed_updater/lib/feed_updater.rb, and so now files found are relative to that file. As far as I can tell, that was what happened.
AndrewK:
Oh, wow, yeah, you’re right… Good call. Easy fix though.
AndrewK:
As for why it lost the
database.ymlfile, check line 353 offeed_updater.rb.I’m relatively new to Rails and I’m not sure exactly how to get FeedUpdater working. If I want to simply display a feed on a web page, what code do I need to put in the custom updater, and then in the controller for my web page, for this to work properly? Thanks in advance.
Ryan:
Depends on what you’re trying to accomplish. It won’t be as simple as copy-paste, and probably shouldn’t be something you try right off the bat if you’re completely new to Ruby and Rails. In fact, if you’re still just learning Ruby, I’d cut FeedUpdater out of the mix until you know what you’re doing.
Displaying a feed in Rails using only FeedTools was covered by the tutorial entry I wrote awhile back. Just search the site for “Tutorial”.
If you’re sure you really do need FeedUpdater though, you should create a new model for storing the fields from the feeds you’re loading. Then in your custom updater, you should require the Rails boot file and the relevant model files. The
on_updateblock should then copy the fields of the feed passed into the block into the fields you set up on your model. Then back in the Rails environment, display the model just like you would any other model.What specifically do you mean by “Update: Now, with multithreading!”? FeedTools has a note “Update: Apparently, threads aren’t so hot when a database-backed cache is involved because it opens up tons of extra connections to the database.”, has that been resolved?
Brent:
FeedTools does not work well in a multithreaded Rails environment because Rails itself doesn’t work well in a multithreaded Rails environment.
FeedUpdater, however, is not a Rails environment. Unlike Rails, FeedUpdater doesn’t work in a request-response cycle, so it only opens as many connections as there are threads. Rails tends to open connections for every thread and then often doesn’t close them when the thread exits, or something to that effect. I don’t remember the exact details. But the effect is that you have 100 open connections to the database. FeedUpdater will only open 5 connections if the number of threads is set to 5.
From the on_update() function in the custom updater script, is it possible to access the feed’s ID record?
My custom updater loads all feeds from feeds table. On update, it creates or updates the feed_items table. However, I need to assign feed_items.feed_id to the feed table’s ID.
Thanks!
Jason:
I can’t think of any reason that wouldn’t be possible. The
on_updateblock gives you a reference to the updated feed, and you can get the id from there and assign it to whatever you need.Thanks Bob for the reply!
In my custom updater script, on_begin() sets self.feed_href_list with an array of URLs. However I think I need to set the feed’s IDs here somehow so that the on_update() hook can know which feed is being updated. By ID, I mean the feed’s primary key ID record, not the ID returned in the returned XML feed.
My structure is as follows: feeds- has_many :feed_items feed_items – belongs_to :feeds cached_feeds
Hope this makes sense, and I can supply some actual code examples if that will help. Thanks so much!
Jason:
If you have the hrefs stored in your feeds table (which you should), you can just do a
find_by_hrefto get the right row.Thanks Bob. With the additional Feed.find_by_href(), I was able to get the record I needed.
However, my next hurdle is figuring out how to keep the list of feed URLs current. (feed_href_list)
When starting the daemon, it loads them all properly and holds the list in memory. But—if this list changes (such as a user adding a new feed URL), this new URL never gets polled by the daemon until it’s restarted again. Is there a way to hook into FeedUpdater so that I can reload feed_href_list each time it polls, to keep this list synchronized with the db. (I’m obviously storing my Feed URLs in a database that will be updated by users)
Thanks again.
Jason:
on_beginis run every time an update goes. Just update the list there.I have a question similar to Jason’s but a slightly different slant. 1) on_begin loads self.feed_href_list with all the feed HREFs from my feed table 2) on_update - finds my feed by HREF - creates feed_items in my table
however is looks like if FeedTools follows a redirect for the feed then the feed_tool.href is updated to the redirected HREF and doesn’t match the one I passed it so I can’t do a find_by_href on my table. Any suggestions?
Brent:
Um, that’s a good point. Given that, I’d recommend a) making sure you retrieve the feed at least once before putting the URI of the feed into your database so that the actual URI is resolved ahead of time and b) don’t key off the URI if you can help it since obviously, it’s subject to redirections and changes. The feed’s id, however, should never change. At least in theory. In practice, you’re going to have to key off both of them. Search for the URI first, and if you can’t find it, then search for the id.
Got another issue. Is this the right place to post these? I’m basically running the generic custom_updater.rb and FeedUpdater doesn’t seem to reliably timeout for unreachable URLs. Some timeout properly but custom_updater just hung indefinitely this morning on http://rss.csmonitor.com/feeds/books when it seemed to be briefly unreachable. Is there a way to set another timeout regardless of the connection error? Or do you have any debugging suggestions?
Brent:
I don’t see an obvious solution for dealing with this issue. I could add another timeout block around the open method… but technically, it already has one around the HTTP request itself, so I don’t think that’d help. Are you sure it’s not a fluke?
Bob, I did some further research and actually my host is killing the batch process. Can anyone suggest a reasonable priced Rails host that allows long running procs such as this?
Brent:
That doesn’t surprise me at all. However, if you’re paying less than $40 a month, you’re not likely to find one because of the adverse effect a system like this can have on your neighbors.
Bob,
In post #24 above, you said,
”. . . you should create a new model for storing the fields from the feeds you’re loading. Then in your custom updater, you should require the Rails boot file and the relevant model files.”
Could you be more specific about how to do this? I getting an error ‘uninitialized constant CustomUpdater::FeedThing’ in the feed_updater log file. The FeedThing refers to the model/database table I am attempting to save the parse feed to. I believe the correct ‘require’ code will fix my problem.
Robert:
Yeah, everyone seems to run into that. I should probably make this easier.
I believe this should do the trick:
Bob, Thanks for the ‘require’ code above. I really appreciate the help. Based on the above I was able to get feed_updater working.
However, in my case I was unable to use the line:
require ‘config/environment’
Instead I used this line in its place:
require ‘config/boot’
I am placing these ‘require’ lines above the ‘Class CustomUpdater < . . .’ line.
If I used the ‘environment’ line above, feed_updater would give the following error:
FeedUpdater Using environment: development FeedUpdater Not connected to the feed cache. FeedUpdater The FeedTools cache table is missing. FeedUpdater Daemon stopped.
My Ruby on Rails is not that great yet, so I am unsure why I was not able to use the ‘environment’ line.
Also, to help anyone else who might be having trouble getting this up and running. For the longest time I was trying to start feed_updater like I started a mongrel daemon. Don’t do that. Make sure you run it with this command from your application directory:
ruby script/feed_updater start
Again, Thanks. And thanks for FeedTools.
Hi,
I’m having troubling install feed_updater in my app directory. I get the following error:steve-odoms-computer:~/Development/blog steveodom$ feed_updater install /usr/local/lib/ruby/gems/1.8/gems/feedupdater-0.2.5/lib/feed_updater.rb:60: undefined method `require_gem' for main:Object (NoMethodError) from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `gem_original_require' from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in `require' from /usr/local/lib/ruby/gems/1.8/gems/feedupdater-0.2.5/bin/feed_updater:60 from /usr/local/bin/feed_updater:19:in `load' from /usr/local/bin/feed_updater:19I looked at line 60 of feed_updater.rb and that line is referencing the deamons gem. My version of that gem is 1.0.10.FeedTools is working fine.
any suggestions?
Thanks.
Leave a Response