plagg, a RSS aggregator

0. What is this?

plagg is a weblog/news aggregator that works in conjunction with Rael Dornfest's blosxom. It can be easily extended to support other blogging tools.

plagg reads an OPML file containing a list of RSS or Atom feeds, and generates blosxom blog entries from these feeds. The items of each feed are generated into their own directory/blosxom category, which allows to read the news all at once or per feed.

You can see examples of plagg's output on my news page.

1. Installation

  1. Download plagg
  2. Untar the distribution file to a directory of your choice
  3. Run python setup.py install as root
  4. Set up an OPML file containing the feeds you'd like to read
  5. Run plagg -d newsdir opmlfile as often as you like from a cron job, where newsdir is somewhere within your blosxom data directory
  6. Enjoy your personalized news feed!

2. Usage

2.1. Synopsis

plagg -fFnvVh [-d newsdir] [opmlfile [nickname ...]]

2.2. Options

2.3. Arguments

The default arguments for opmlfile and destdir can be set in the plagg script.

3. The OPML file

The distribution contains my OPML file as an example.

The basic OMPL syntax is defined in the OPML specification.

3.1. RSS/Atom feeds

Set the type attribute to "rss". This is the default feed type. Plagg reads the feed given by the xmlUrl attribute and generates news items from its content.

Example:

<outline text="Linux Weekly News" nick="lwn" type="rss"
    htmlUrl="http://lwn.net/" xmlUrl="http://lwn.net/headlines/rss"
/>

The htmlUrl attribute is not used by plagg itself, but by opml.xsl, which I use to generate my blogroll.

If you need support for Atom 1.0 feeds and have installed a feedparser older than version 4, please apply the patch from http://fucoder.com/wp-content/feedparser/feedparser-atom10.patch.

3.2. HTML scraping

Set the type to "x-plagg-html". In this case, plagg reads the HTML page whose URL is in the htmlUrl attribute. It then uses the regex attribute to extract an item title, a link and, optionally, a body. I use this type to grab a few comics off sites that don't provide an RSS feed.

There are two ways to specify the regex:

In each case, at the moment the regex is matched against the page's HTML source, all relative URLs have already been converted to absolute ones, which means that you can't simply copy a regex from a page's HTML source.

Example:

<outline text="Dilbert" type="x-plagg-html" htmlUrl="http://www.dilbert.com/"
    regex="&lt;img src=&quot;(http://www\.dilbert\.com/comics/dilbert/archive/images/dilbert(\d+)\.[gj][ip][fg])&quot;"
    hours="8-10"
/>;

3.3. Computed items

Set type to "x-plagg-computed", and set the commands attribute to the Python commands that should be executed. These commands should set self.itemLink and optionally self.itemTitle and self.itemBody (cf. Garfield)

Example:

<outline text="Garfield" type="x-plagg-computed" link="http://garfield.ucomics.com"
    commands="import time&#10;tm = time.gmtime()&#10;self.itemTitle = '%02d%02d%02d' % (tm[0] % 100, tm[1], tm[2])&#10;self.itemLink = 'http://images.ucomics.com/comics/ga/%d/ga%s.gif' % (tm[0], self.itemTitle)"
    hours="8"
/>

Please keep in mind that in the actual OPML file, the linefeeds have to be escaped as .

3.4. Saving scraped bits

This feature is available for HTML-scraped and computed items.

Using the savePath and saveUrl attributes, it is possible to save whatever the link points to. savePath indicates the directory in the local file system where the file should be saved; saveUrl defines the URL that is substituted instead of the original URL (cf. Userfriendly).

Example: savepath="/home/myself/www/news/uf" saveurl="/news/uf"

If necessary, you can define the referrer attribute which will be passed in the HTTP request. The default referrer is either the link attribute, or, if empty, the item link itself.

3.5. OPML extensions

I have extended the <outline> element with two attributes and an optional, repeatable child element.

3.5.1. Time restrictions

This is a new attribute of <outline>.

hours="string"

Defines a set of hours of the day. The feed is read only during these hours. Values are in 24-hour format relative to UTC. Ranges may be given as from-to; separate simple values or ranges with a comma. The range includes both from and to values.

Example: hours="8-10,15,22"

3.5.2. Overriding a feed's directory name

This is a new attribute of <outline>.

nick="string"

Sets the "nickname" of a feed. The nickname is used as directory name under newsdir and when selectively updating with the nickname command line argument. The default nickname of an outline element is the lowercase text attribute.

3.5.3 Replacements: <replace>

This is a new, repeatable child element of <outline>.

<replace what="body" from="regex" to="string"/>

Defines a replacement inside an item's element. Use this to remove ads from an item, for example (cf. Engadget). The allowed values for what are "body", "link" and "title".

Example:

<replace what="body" from="(?s)&lt;div&gt;&lt;span&gt;.*?&lt;/span&gt;.*?&lt;/div&gt;" to=""/>

This deletes every <div> that immediately begins with a <span>.

The to attribute is optional. If omitted, the text matched by the from regex is deleted.

Please keep in mind that in the actual OPML file, the "less than", "greater than" and "quote" signs in attribute values have to be escaped as &lt;, &gt; and &quot;, respectively.

3.5.4 Ignoring the entry date

This is a new attribute of <outline>.

ignoredate="yes"

If "yes", the entry's date is ignored and the current time used instead. Useful for feeds that are published with delays of more than one hour.

3.5.5 Tidying the feed HTML

This is a new attribute of <outline>.

tidy="no"

The default value is "yes". If "yes", the entry body is run through tidy, an external tool that cleans up HTML. Tidy must be installed in /usr/bin.

3.6. Rendering the OPML file as XHTML

The distribution tar file contains the XSL style sheet opml.xsl that transforms an OPML file into XHTML. Place your OPML file and the style sheet (or symbolic links to them) in a directory that your HTTP server can access and you can display a properly indented view of your OPML file just by entering its URL in your browser. Modern browsers understand enough of XML to apply the XSL file before displaying the page.

If you want to adjust the resulting XHTML, you have to adjust the location and name of your CSS style sheet in the XSL file, as well as the CSS class names of the generated <div> tags.

4. Changelog

5. TODO

6. Author

Beat Bolli <me+plagg@drbeat.li>, http://drbeat.li/py/plagg