Tales from the Gryphon/ archives/ 2007/

Tales from the Gryphon

Archives for 2007/08

Manoj's hackergotchi
Add a new post titled:
Tuesday 21 August
2007
Link: Arch Hook

Posted early Tuesday morning, August 21st, 2007

Arch Hook

All the version control systems I am familiar with run scripts on checkout and commit to take additional site specific actions, and arch is no different. Well, actually, arch is perhaps different in the sense that arch runs a script on almost all actions, namely, ~/.arch-params/hook script. Enough information is passed in to make this mechanism one of the most flexible I have had the pleasure to work with.

In my hook script, I do the following things:

  • On a commit, or an initial import
    • For my publicly replicated repositories (and only for my public repositories), the script creates a full source tree in the repository for every 20th commit. This can speed up subsequent calls to get for that and subsequent revisions, since users do not have to get the base version and all patches.
    • For the public repositories, the script removes older cached versions, keeping two cached versions in place. I assume there is not much demand for versions more than 40 patches out of date; and so having to download a few extra patches in that uncommon case is not a big issue.
    • If it is an ikiwiki commit, the script makes sure that it updates the checked out sources of the wiki on the webserver, and rebuilds the wiki.
    • If this is a commit to an archive for which I have a corresponding -MIRROR defined, the script updates the mirror now, and logs an informational message to the screen.
    • There is special handling for my Debian packages.
      * If the category matches one of my packages, the script
        looks to see if any bugs have been closed in this commit,
        and, if so, sends the log to the bug, and tags it fixed.
      * If the category being checked in is one that corresponds
        to one of my Debian packages, or to the `./debian`
        directory that belongs to one of my packages, then the
        script sends a cleaned up change log by mail to the
        *packages.qa.debian.org*. People can subscribe to the
        mailing list setup for each package to get commit logs, if
        they so desire.
      * Arch has the concept of a grab file, and people can get
        all the components of a software package by just feeding
        arch either the grab file (either locally, or via a http
        URL). The script makes sure that a arch config file is
        created , as well as a grab file (using the script
        [arch\_create\_config](/software/misc/arch_create_config.html)),
        and uploads the grab file to to a public location (using
        the script
        [arch\_upload\_grab](/software/misc/arch_upload_grab.html))
        mentioned in `./debian/control` for all my packages.
      * For commits to the Debian policy package, the script also
        sends mail to the policy list with full commit logs. This
        is a group maintained package, so changes to this are
        disseminated slightly more volubly.
      * Whenever a new category, branch, or version is added to
        the repository corresponding to the Debian policy package,
        the script sends mail to the policy list. Again, changes
        to the Policy repository are fairly visible.
      
    • The scripts send myself mail, for archival purposes, whenever a new category or branch is created in any of my repositories (but not for every revision).
    • Additional action is taken to ensure that versions are cached in the local revision library. I am no longer sure if this is strictly needed.

I'd be happy to hear about what other people add to their commit scripts, to see if I have missed out on anything.

Manoj

Monday 20 August
2007
Link: Mail Filtering with CRM114: Part 4

Posted early Monday morning, August 20th, 2007

Mail Filtering with CRM114: Part 4

Training the Discriminators

It has been a while since I posted on this category -- actually, it has been a long while since my last blog. When I last left you, I had mail (mbox format) folders called ham and/or junk, which were ready to be used for training either CRM114 or Spamassassin or both.

Setting up Spamassassin

This post lays the groundwork for the training, and details how things are set up. The first part is setting up Spamassassin. One of the things that bothered me about the default settings for Spamassassin was how swiftly Bayes information was expired; indeed, it seems really eager to dumb the Bayes information (don't they trust their engine?). I have spent some effort building a large corpus, and keeping ti clean, but Spamassassin would discard most of the information from the DB after training over my corpus, and the decrease in accuracy was palpable. To prevent this information from leeching away, I firstly increased the size of the database, and turned off automatic expiration, by putting the following lines into ~/.spamassassin/user_prefs:

bayes_expiry_max_db_size  4000000
bayes_auto_expire         0

I also have regularly updated spam rules from the spamassassin rules emporium to improve the efficiency of the rules; my current user_prefs is available as an example.

Initial training

I keep my Spam/Ham corpus under the directory /backup/classify/Done, in the subdirectories Ham and Spam. At the time of writing, I have approximately 20,000 mails in each of these subdirectories, for a total of 41,000+ emails.

I have created a couple of scripts to train the discriminators from scratch using the extant Spam corpus; and these scripts are also used for re-learning, for instance, when I moved from a 32-bit machine to a 64-bit one, or when I change CRM114 discrimators. I generally run them from ~/.spamassassin/ and ~/var/lib/crm114 (which contains my CRM114 setup) directories.

I have found that training Spamassassin works best if you alternate Spam and Ham message chunks; and this Spamassassin learning script delivers chunks of 50 messages for learning.

With CRM114, I have discovered that it is not a good idea to stop learning based on the number of times the corpus has been gone over; since stopping before all messages i the Corpus are correctly handled is also disastrous. So I set the repeat count to a ridiculously high number, and tell mailtrainer to continue training until a streak larger than the sum of Spam and Ham messages has occurred. This CRM114 trainer script does the hob nicely; running it under screen is highly recommend.

Routine updates

Coming back to where we left off, we had mail (mbox format) folders called ham and/or junk sitting in the local mail delivery directory, which were ready to be used for training either CRM114 or Spamassassin or both.

There are two scripts that help me automate the training. The first script, called mail-process, does most of the heavy listing. This processes a bunch of mail folders, which are supposed to contain mail which is either all ham or all spam, indicated by the command line arguments. We go looking though every mail, and any mail where either the CRM114 or the Spamassassin judgement was not what we expected, we strip out mail gathering headers, and then we save the mail, one to a file, and we train the approprite filter. This ensures that we only train on error, and it does not matter if we accidentally try to train on correctly classified mail, since that would be a no-op (apart from increasing the size of the corpus).

The second script, called mproc is a convenience front-end; it just calls mail-process with the proper command line arguments, and feeds them the ham and junk in sequence; and takes no arguments. So, after human classification, just calling mproc does the classification.

This pretty much finishes the series of posts I had in mind about spam filtering, I hope it has been useful.

Manoj


Webmaster <webmaster@golden-gryphon.com>
Last commit: terribly early Sunday morning, June 8th, 2014
Last edited terribly early Sunday morning, June 8th, 2014