Tales from the Gryphon/ archives/ 2006/

Tales from the Gryphon

Archives for 2006/12

Manoj's hackergotchi
Add a new post titled:
Sunday 24 December
Link: Arch, Ikiwiki, blogging

Posted early Sunday morning, December 24th, 2006

Arch, Ikiwiki, blogging

One of the reasons I have only blogged 21 times in thirty months is because of the very kludgey work flow I had for blogging; I had to manually create the file, and then scp by hand, and ensure that any ancillary files were in place on the remote machine that serves up my blogs.

After moving to ikiwiki, and thus arch, there would be even more overhead, were it not so amenable to scripting. Since this is arch, and therefore creating branches and merging is both easy and natural, I have two sets of branches -- one set related to the templates and actual blog content I server on my local, development box, and a parallel set of branches that I publish. The devel branches are used by ikiwiki on my local box; the remote ikiwiki uses the publish branch. So I can make changes to my hearts content on the devel branch, and the merge into my publish branch. When I commit the publish branches, the hook function ensure that there is a fresh checkout of the publish branch on the remote server, and that ikiwiki is run to regenerate web pages to reflect the new commit.

The hook functions are nice, but not quite enough to make blogging as effortless as it could be. With the movge to ikiwiki, and dissociation of classification and tagging from the file system layout, I have followed the lead of Roland Mas and organized my blog layout by date; posts are put in blog/$year/$month/$escaped_title. The directory hierarchy might not exist for a new year or month. A blog posting may also show in in two different archive indices: the annual archive index for the year, and a monthly index page created for every month I blog in. However, at the time of writing, there is no annual index for the next year (2007), or the next month (January 2007). These have to be created as required.

All this would get quite tedious, and indeed, would frequently remain undone -- were it not for automation. To make my life easier, I have blogit!, which takes care of the niggling details. When called with the title of the prospective post; this script figures out the date, ensures that the blog directory structure exists, creating path components and adding them to the repository as required, creates a blog entry template, adds the blog entry to the repository, creates the annual or the monthly archive index and adds those to the repository as needed, and finally, calls emacs on the blog posting file. whew.


Saturday 23 December
Link: Mail Filtering With CRM114: Part 1

Posted early Saturday morning, December 23rd, 2006

Mail Filtering With CRM114: Part 1

The first step is to configure the CRM114 files, and these are now in pretty good shape as shipped with Debian. All that I needed to do was set a password, say that I'd use openssl base64 -d, and stick with the less verbose defaults (so no munging subjects, no saving all mail, no saving rejects, etc, since I have other mechanisms that do all that). The comments in the configuration files are plenty good enough. This part went off like a breeze; the results can be found here (most of the values are still the upstream default).

The next step was to create new, empty .css files. I have noticed that creating more buckets makes crm114 perform better, so I go with larger than norm crm114 .css files. I have no idea if this is the right thing to do, but I make like Nike and just do it. At some point I'll ask on the crm114-general mailing list.

    % cssutil -b -r -S 4194000 spam.css
    % cssutil -b -r -S 4194000 nonspam.css

Now we have a blank slate; at this time the filter knows nothing, and is equally likely to call something Spam or non-Spam. We are now ready to learn. So, I girded my loins, and set about feeding my whole mail corpus to the filter:

 /usr/share/crm114/mailtrainer.crm               \
   --spam=/backup/classify/Done/Spam/           \
   --good=/backup/classify/Done/Ham/            \
   --repeat=100 --streak=35000 |                \
         egrep -i '^ +|train|Excell|Running'

And this failed spectacularly (see Debian bug #399306). Faced with unexpected segment violations, and not being proficient in crm114's rather arcane syntax, I was forced to speculation: I assumed (as it turns out, incorrectly) that if if you throw too many changes at the crm114 database, things rapidly escalate out of control. I went o to postulate that as my mail corpus was gathered over a period of errors, the characteristic of Spam drifted over time, and what I consider Spam has also evolved. So, some parts of the early corpus are at variance with the more recent bits.

Based on this assumption, I created a wrapper script which did what Clint has called training to exhaustion -- it iterated over the corpus several times, starting with a small and geometrically increasing chunk size. Given the premise I was working under, it does a good job of training crm114 on a localized window of Spam: it feeds chunks of the corpus to the trainer, with each successive chunk overlapping the previous and succeeding chunks, and ensuring that crm114 is happy at any given chunk of the corpus. Then it doubles the chunk size, and tries goes at it again. All very clever, and all pretty wrong.

I also created another script to retrain crm114, which was less exhaustive than the previous one, but did incorporate any further drift in the nature of Spam. I no longer use these scripts; but I would like to record them for posterity as an indication of how far one can take an hypothesis.

What it did to was have crm114 learn without segfaulting -- and show me that there was a problem in the corpus. I noticed that in some cases the trainer would find a pair of mail messages, classify them wrongly, and retrain and refute -- iteration after iteration, back and forth. I noticed this when I added the egrep filter above, and was not drowning in the needless chatter from the trainer. It turns out, I had very similar emails (sometimes, even the same email) in the Ham and the Spam corpus, and no wonder crm114 was throwing hissy fits. Having small chunks ensured that I had not too many such errors in any chunk;and crm114 did try to forget a mail differently classified in an older chunk and learn whatever this chunk was trying to teach it. The downside was that the count of the differences between ham nd Spam went down, and the similarities increased -- which meant that the filter was not as good at separating ham and Spam as it could have been.

So my much vaunted mail corpus was far from clean -- over the years, I had misclassified mails, been wishy-washy and changed my mind about what was and was not Spam. I have a script that uses md5sums to find duplicate files in a directory, and found, to my horror, that there were scores of duplicates in the corpus. After eliminating outright duplicates, I started examining anything that showed up with an ER (error, refute) tag in the trainer output; on the second iteration of the training script these were likely to be misclassification. I spent days examining my corpus and cleaning it out; and was gratified to see the ratio of differences to similarities between ham and Spam css files climb from a shade under 3 to around 7.35.

Next post we'll talk about lessons learned about training, and how a nominal work flow of training on errors and training when classifiers disagree can be set up.


Tuesday 19 December
Link: Mail Filtering With CRM114: Introduction

Posted early Tuesday morning, December 19th, 2006

Mail Filtering With CRM114: Introduction

I have a fairly sophisticated Spam filtering mechanism setup, using MIMeDefang for SMTP level rejection of Spam, grey-listing for mails not obviously ham or Spam, and using both crm114 and spamassassin for discrimination, since they compensate for each other when one of them can't tell what it is looking at. People have often asked me to write up the details of my setup, and the support infrastructure, and I have decided to oblige. What follows is a series of blog posts detailing how I set about redoing my crm114 setup, with enough detail that interested parties can tag along.

I noticed that the new crm114 packages have a new facility called mailtrainer, which can be used to setup initial css database files. I liked the fact that it can run over the training data several times, it can keep back portions of the training data as a control group, and you can tell it to keep on going until it gets a certain number of mails discriminated correctly. This is cool, since I have a corpus of about 15K Spam and 16K ham messages, mostly stored on my previous train-on-error practice (whenever crm114 classified a message incorrectly, I train it, and store all training emails). I train whenever crm114 and spamassassin disagree with each other.

This would also be a good time to switch from crm114's mailfilter to the new mailreaver, which is the new, 3rd Generation mail filter script for crm114. It is easier to maintain, and since it flags mails for which it is unsure, and you ought to train all such mails anyway, it learns faster.

Also, since the new packages come with a brand new discrimination algorithms which are supposed to be as accurate but also faster, but may store data in incompatible ways, I figured that it might be time to redo my CRM mail filter from the ground up. The default now is "osb unique microgroom". This change requires me to empty out the css files.

I also decided to change my learning scripts to not send command via email to a 'testcrm' user, instead, now I train the filter using a command line mode. Apart from saving mails incorrectly classified into folders ham and junk (Spam was already taken), I have a script that grabs and saves mails which are classified differently by crm114 and spamassassin from my incoming mail spool and saves it into a grey.mbox file, which I can manually separate out into ham and junk. Then I have a processing script, that takes the ham and junk folders and trains spamassassin, or crm114, or both; and stashes the training files away into my cached corpus for the future.

In subsequent blog postings, I'll walk people through how I setup and initialized my filter, and provide examples of the scripts I used, along with various and sundry missteps and lessons learned and all.


Monday 18 December
Link: I am now an Ikiwiki user!

Posted early Monday morning, December 18th, 2006

I am now an Ikiwiki user!

[[!template Error: failed to process template <span class="createlink"><a href="/manoj/blog/ikiwiki.cgi?do=create&amp;from=blog%2F2006%2F12%2F18%2FMigrated_to_IkiWiki&amp;page=%2Ftemplates%2Ficon" rel="nofollow">?</a>icon</span> template icon not found ]]

Well, this is first post. I have managed to migrate my blog over to Ikiwiki, including all the historical posts. The reasons for migration was that development on my older blogging mechanism, Blosxom, entered a hiatus, though recently it has been revived on sourceforge. I like the fact that IkiWiki is based on a revision control system, and that I know the author pretty darned well :-).

One of my primary requirements for the migration was that I be able to replicate all the functionality of my existing Blog, and this included the look and feel (which I do happen to like, despite wincing I see from some visitors to my pages) of my blog. This meant replicating the page template and CSS from my blog.

I immediately ran into problems: for example, my CSS markup for my blogs was based on being able to markup components of the date of the entry (day, day of week, month, etc) and achieve fancy effects; and there was no easy way to use preexisting functionality of IkiWiki to present the information to the page template. Thus was born the varioki plugin; which attempts to provide a means to add variables for use in ikiwiki templates, based on variables set by the user in the ikiwiki configuration file. This is fairly powerful, allowing for uses like:

    varioki => {
      'motto'    => '"Manoj\'s musings"',
      'toplvl'   => 'sub {return $page eq "index"}',
      'date'     => 'sub { return POSIX::strftime("%d", gmtime((stat(srcfile($pagesources{$page})))[9])); }'
      'arrayvar' => '[0, 1, 2, 3]',
      'hashvar'  => '{1, 1, 2, 2}'

The next major stumbling block was archive browsing for older postings; Blosxom has a nice calendar plugin that uses a calendar interface to let the user navigate to older blog postings. Since I really liked the way this looks, I set about scratching this itch as well; and now ikiwiki has attained parity vis. a vis. calendar plugins with Blosxom.

The calendar plugin, and the archive index pages, led me start thinking about the physical layout of the blog entries on the file system. Since the tagging mechanism used in ikiwiki does not depend on the location in the file system (an improvement over my Blosxom system), I could layout the blog postings in a more logical fashion. I ended up taking Roland Mas' advice and arranging for the blog postings to be created in files like:


The archives contain annual and monthly indices, and the calendar front end provides links to recent postings and to recent monthly indices. So, a few additions to the arch hook scripts, and perhaps an script to automatically create the directory structure for new posts, and to automatically create annual and monthly indices as needed, and I'll have a low threshold of effort blogging work flow for blog entries, and I might manage to blog more often than the two blog postings I have had all through the year so far.


Webmaster <webmaster@golden-gryphon.com>
Last commit: terribly early Sunday morning, June 8th, 2014
Last edited terribly early Sunday morning, June 8th, 2014