Sporkmonger

purveyor of fabulously ambiguous eating utensils

GentleCMS Development Log: Part 3

Posted by sporkmonger
Written July 8th, 2006

The extract method is basically done. I’m sure it could be improved a bit more, but it seems to be fairly effective. I added a few extra features beyond the original URI class’s capabilities, such as supplying a base uri to resolve relative uris against. You can also have it return the parsed URIs instead of the strings. At no extra processing cost since it has to parse each URI internally anyways. Tried it out on Sam Ruby’s feed (as you may have noticed, currently my favorite chunk of text to try just about everything out on) and it seems to have gone ok:


>> GentleCMS::URI.extract(text,
  :base => "http://www.intertwingly.net/blog/index.atom")
=> ["http://www.w3.org/2005/Atom",
 "http://purl.org/syndication/thread/1.0",
 "http://www.intertwingly.net/blog/index.atom",
 "http://www.intertwingly.net/blog/index.atom",
 "tag:intertwingly.net,2004:2340",
 "http://www.w3.org/1999/xhtml",
 "http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
 "http://www.w3.org/1999/xhtml",
 "http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
 "http://www.w3.org/2000/svg",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.link",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#rfc.section.3.1.1",
 "http://www.w3.org/TR/2001/REC-xmlbase-20010627/",
 "http://www.bloglines.com/preview?siteid=235142",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.author",
 "http://www.bloglines.com/preview?siteid=235142",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.source",
 "http://www.bloglines.com/preview?siteid=5319444",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.updated",
 "http://www.bloglines.com/preview?siteid=2375595",
 "http://www.bloglines.com/preview?siteid=50",
 "http://www.bloglines.com/preview?siteid=2438392",
 "http://weblog.philringnalda.com/2005/12/18/who-knows-a-title-from-a-hole-in-the-ground",
 "http://www.niallkennedy.com/blog/archives/2006/07/google-sitemaps-2.html",
 "http://www.stephenduncanjr.com/2006/06/atom-10-and-blogger.shtml",
 "tag:intertwingly.net,2004:2339",
 "http://www.w3.org/1999/xhtml",
 "http://www.1060.org/blogxter/entry?publicid=8A0DC194929914711F1C0470FFDB7B73",
 "http://www.intertwingly.net/slides/2005/xmlconf/",
 "http://www.intertwingly.net/slides/2005/etcon/",
 "tag:intertwingly.net,2004:2338",
 "http://www.w3.org/1999/xhtml",
 "http://www.w3.org/2000/svg",
 "http://en.wikipedia.org/wiki/Penrose_tiling",
 "http://intertwingly.net/stories/2006/07/06/penroseTiling.svg",
 "tag:intertwingly.net,2004:2337",
 "http://www.w3.org/1999/xhtml",
 "http://www.w3.org/2000/svg",
 "http://www.unto.net/unto/work/on-rss-and-atom/",
 "http://www.unto.net/unto/opensearch/more-on-rss-and-atom/",
 "tag:intertwingly.net,2004:2336",
 "http://www.w3.org/1999/xhtml",
 "http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c",
 "http://www.intertwingly.net/blog/",
 "http://www.intertwingly.net/blog/2006/07/08/Bloglines-Edge-Cases",
 "http://www.intertwingly.net/blog/2340.atom",
 "http://www.intertwingly.net/blog/2006/07/06/Blame-Somebody",
 "http://www.intertwingly.net/blog/2339.atom",
 "http://www.intertwingly.net/blog/2006/07/06/Penrose-Tiling",
 "http://www.intertwingly.net/blog/2338.atom",
 "http://www.intertwingly.net/blog/2006/07/04/Just-a-Technical-Detail",
 "http://www.intertwingly.net/blog/2337.atom",
 "http://www.intertwingly.net/blog/2006/07/04/Clean-utf-8-for-XML",
 "http://www.intertwingly.net/blog/2336.atom"]

The original’s output:


URI.extract(text)
=> ["http://www.w3.org/2005/Atom",
 "xmlns:thr=",
 "http://purl.org/syndication/thread/1.0",
 "http://www.intertwingly.net/blog/index.atom",
 "http://www.intertwingly.net/blog/index.atom",
 "T20:30:05-04:00",
 "tag:intertwingly.net,2004:2340",
 "thr:count=",
 "thr:when=",
 "T20:30:01-04:00",
 "http://www.w3.org/1999/xhtml",
 "http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
 "http://www.w3.org/1999/xhtml",
 "http://www.tbray.org/ongoing/When/200x/2006/07/07/With-Bloglines-to-Atom",
 "http://www.w3.org/2000/svg",
 "float:right",
 "out:",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.link",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#rfc.section.3.1.1",
 "http://www.w3.org/TR/2001/REC-xmlbase-20010627/",
 "xml:base",
 "http://www.bloglines.com/preview?siteid=235142",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.author",
 "http://www.bloglines.com/preview?siteid=235142",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.source",
 "http://www.bloglines.com/preview?siteid=5319444",
 "http://www.atomenabled.org/developers/syndication/atom-format-spec.php#element.updated",
 "http://www.bloglines.com/preview?siteid=2375595",
 "http://www.bloglines.com/preview?siteid=50",
 "http://www.bloglines.com/preview?siteid=2438392",
 "http://weblog.philringnalda.com/2005/12/18/who-knows-a-title-from-a-hole-in-the-ground",
 "http://www.niallkennedy.com/blog/archives/2006/07/google-sitemaps-2.html",
 "http://www.stephenduncanjr.com/2006/06/atom-10-and-blogger.shtml",
 "T18:06:55-04:00",
 "tag:intertwingly.net,2004:2339",
 "thr:count=",
 "thr:when=",
 "T12:45:00-04:00",
 "http://www.w3.org/1999/xhtml",
 "http://www.1060.org/blogxter/entry?publicid=8A0DC194929914711F1C0470FFDB7B73",
 "http://www.intertwingly.net/slides/2005/xmlconf/",
 "http://www.intertwingly.net/slides/2005/etcon/",
 "T21:07:59-04:00",
 "tag:intertwingly.net,2004:2338",
 "thr:count=",
 "thr:when=",
 "T19:56:01-04:00",
 "http://www.w3.org/1999/xhtml",
 "http://www.w3.org/2000/svg'",
 "float:right",
 "http://en.wikipedia.org/wiki/Penrose_tiling",
 "http://intertwingly.net/stories/2006/07/06/penroseTiling.svg",
 "T17:55:35-04:00",
 "tag:intertwingly.net,2004:2337",
 "thr:count=",
 "thr:when=",
 "T08:45:19-04:00",
 "http://www.w3.org/1999/xhtml",
 "http://www.w3.org/2000/svg",
 "float:right",
 "http://www.unto.net/unto/work/on-rss-and-atom/",
 "http://www.unto.net/unto/opensearch/more-on-rss-and-atom/",
 "T12:15:13-04:00",
 "T21:19:04-04:00",
 "tag:intertwingly.net,2004:2336",
 "thr:count=",
 "thr:when=",
 "T22:27:59-04:00",
 "http://www.w3.org/1999/xhtml",
 "http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c",
 "T08:59:42-04:00"]

Here’s the diffs:


(uri_result - gentle_uri_result)
=> ["xmlns:thr=",
 "T20:30:05-04:00",
 "thr:count=",
 "thr:when=",
 "T20:30:01-04:00",
 "float:right",
 "out:",
 "xml:base",
 "T18:06:55-04:00",
 "thr:count=",
 "thr:when=",
 "T12:45:00-04:00",
 "T21:07:59-04:00",
 "thr:count=",
 "thr:when=",
 "T19:56:01-04:00",
 "http://www.w3.org/2000/svg'",
 "float:right",
 "T17:55:35-04:00",
 "thr:count=",
 "thr:when=",
 "T08:45:19-04:00",
 "float:right",
 "T12:15:13-04:00",
 "T21:19:04-04:00",
 "thr:count=",
 "thr:when=",
 "T22:27:59-04:00",
 "T08:59:42-04:00"]
(gentle_uri_result - uri_result)
=> [".",
 "2006/07/08/Bloglines-Edge-Cases",
 "2340.atom",
 "2006/07/06/Blame-Somebody",
 "2339.atom",
 "2006/07/06/Penrose-Tiling",
 "2338.atom",
 "2006/07/04/Just-a-Technical-Detail",
 "2337.atom",
 "2006/07/04/Clean-utf-8-for-XML",
 "2336.atom"]

The extract code was designed to work especially well with SGMLish text and Textile-formatted text. The regular expressions should work perfectly with BBCode and Markdown as well, though I haven’t tried it.

I do admit that I totally cheated and threw out basically all of those false-positives specifically for this example, but i’ll probably also be expanding the rejection list as time goes on, since it’s a fairly lightweight check. Good enough for my purposes anyhow.

Tags: ,

Leave a Response

NOTE: I'm afraid Javascript needs to be on in order to comment.

Comments should be formatted using Textile.

Ruby code should be enclosed within a <macro:code lang="ruby"> element. Other languages are supported. For output you can simply omit the lang attribute.