Tales from the Gryphon/ archives/

Tales from the Gryphon

Archives for 2006

Manoj's hackergotchi
Add a new post titled:
Sunday 24 December
Link: Arch, Ikiwiki, blogging

Posted early Sunday morning, December 24th, 2006

Arch, Ikiwiki, blogging

One of the reasons I have only blogged 21 times in thirty months is because of the very kludgey work flow I had for blogging; I had to manually create the file, and then scp by hand, and ensure that any ancillary files were in place on the remote machine that serves up my blogs.

After moving to ikiwiki, and thus arch, there would be even more overhead, were it not so amenable to scripting. Since this is arch, and therefore creating branches and merging is both easy and natural, I have two sets of branches -- one set related to the templates and actual blog content I server on my local, development box, and a parallel set of branches that I publish. The devel branches are used by ikiwiki on my local box; the remote ikiwiki uses the publish branch. So I can make changes to my hearts content on the devel branch, and the merge into my publish branch. When I commit the publish branches, the hook function ensure that there is a fresh checkout of the publish branch on the remote server, and that ikiwiki is run to regenerate web pages to reflect the new commit.

The hook functions are nice, but not quite enough to make blogging as effortless as it could be. With the movge to ikiwiki, and dissociation of classification and tagging from the file system layout, I have followed the lead of Roland Mas and organized my blog layout by date; posts are put in blog/$year/$month/$escaped_title. The directory hierarchy might not exist for a new year or month. A blog posting may also show in in two different archive indices: the annual archive index for the year, and a monthly index page created for every month I blog in. However, at the time of writing, there is no annual index for the next year (2007), or the next month (January 2007). These have to be created as required.

All this would get quite tedious, and indeed, would frequently remain undone -- were it not for automation. To make my life easier, I have blogit!, which takes care of the niggling details. When called with the title of the prospective post; this script figures out the date, ensures that the blog directory structure exists, creating path components and adding them to the repository as required, creates a blog entry template, adds the blog entry to the repository, creates the annual or the monthly archive index and adds those to the repository as needed, and finally, calls emacs on the blog posting file. whew.


Saturday 23 December
Link: Mail Filtering With CRM114: Part 1

Posted early Saturday morning, December 23rd, 2006

Mail Filtering With CRM114: Part 1

The first step is to configure the CRM114 files, and these are now in pretty good shape as shipped with Debian. All that I needed to do was set a password, say that I'd use openssl base64 -d, and stick with the less verbose defaults (so no munging subjects, no saving all mail, no saving rejects, etc, since I have other mechanisms that do all that). The comments in the configuration files are plenty good enough. This part went off like a breeze; the results can be found here (most of the values are still the upstream default).

The next step was to create new, empty .css files. I have noticed that creating more buckets makes crm114 perform better, so I go with larger than norm crm114 .css files. I have no idea if this is the right thing to do, but I make like Nike and just do it. At some point I'll ask on the crm114-general mailing list.

    % cssutil -b -r -S 4194000 spam.css
    % cssutil -b -r -S 4194000 nonspam.css

Now we have a blank slate; at this time the filter knows nothing, and is equally likely to call something Spam or non-Spam. We are now ready to learn. So, I girded my loins, and set about feeding my whole mail corpus to the filter:

 /usr/share/crm114/mailtrainer.crm               \
   --spam=/backup/classify/Done/Spam/           \
   --good=/backup/classify/Done/Ham/            \
   --repeat=100 --streak=35000 |                \
         egrep -i '^ +|train|Excell|Running'

And this failed spectacularly (see Debian bug #399306). Faced with unexpected segment violations, and not being proficient in crm114's rather arcane syntax, I was forced to speculation: I assumed (as it turns out, incorrectly) that if if you throw too many changes at the crm114 database, things rapidly escalate out of control. I went o to postulate that as my mail corpus was gathered over a period of errors, the characteristic of Spam drifted over time, and what I consider Spam has also evolved. So, some parts of the early corpus are at variance with the more recent bits.

Based on this assumption, I created a wrapper script which did what Clint has called training to exhaustion -- it iterated over the corpus several times, starting with a small and geometrically increasing chunk size. Given the premise I was working under, it does a good job of training crm114 on a localized window of Spam: it feeds chunks of the corpus to the trainer, with each successive chunk overlapping the previous and succeeding chunks, and ensuring that crm114 is happy at any given chunk of the corpus. Then it doubles the chunk size, and tries goes at it again. All very clever, and all pretty wrong.

I also created another script to retrain crm114, which was less exhaustive than the previous one, but did incorporate any further drift in the nature of Spam. I no longer use these scripts; but I would like to record them for posterity as an indication of how far one can take an hypothesis.

What it did to was have crm114 learn without segfaulting -- and show me that there was a problem in the corpus. I noticed that in some cases the trainer would find a pair of mail messages, classify them wrongly, and retrain and refute -- iteration after iteration, back and forth. I noticed this when I added the egrep filter above, and was not drowning in the needless chatter from the trainer. It turns out, I had very similar emails (sometimes, even the same email) in the Ham and the Spam corpus, and no wonder crm114 was throwing hissy fits. Having small chunks ensured that I had not too many such errors in any chunk;and crm114 did try to forget a mail differently classified in an older chunk and learn whatever this chunk was trying to teach it. The downside was that the count of the differences between ham nd Spam went down, and the similarities increased -- which meant that the filter was not as good at separating ham and Spam as it could have been.

So my much vaunted mail corpus was far from clean -- over the years, I had misclassified mails, been wishy-washy and changed my mind about what was and was not Spam. I have a script that uses md5sums to find duplicate files in a directory, and found, to my horror, that there were scores of duplicates in the corpus. After eliminating outright duplicates, I started examining anything that showed up with an ER (error, refute) tag in the trainer output; on the second iteration of the training script these were likely to be misclassification. I spent days examining my corpus and cleaning it out; and was gratified to see the ratio of differences to similarities between ham and Spam css files climb from a shade under 3 to around 7.35.

Next post we'll talk about lessons learned about training, and how a nominal work flow of training on errors and training when classifiers disagree can be set up.


Tuesday 19 December
Link: Mail Filtering With CRM114: Introduction

Posted early Tuesday morning, December 19th, 2006

Mail Filtering With CRM114: Introduction

I have a fairly sophisticated Spam filtering mechanism setup, using MIMeDefang for SMTP level rejection of Spam, grey-listing for mails not obviously ham or Spam, and using both crm114 and spamassassin for discrimination, since they compensate for each other when one of them can't tell what it is looking at. People have often asked me to write up the details of my setup, and the support infrastructure, and I have decided to oblige. What follows is a series of blog posts detailing how I set about redoing my crm114 setup, with enough detail that interested parties can tag along.

I noticed that the new crm114 packages have a new facility called mailtrainer, which can be used to setup initial css database files. I liked the fact that it can run over the training data several times, it can keep back portions of the training data as a control group, and you can tell it to keep on going until it gets a certain number of mails discriminated correctly. This is cool, since I have a corpus of about 15K Spam and 16K ham messages, mostly stored on my previous train-on-error practice (whenever crm114 classified a message incorrectly, I train it, and store all training emails). I train whenever crm114 and spamassassin disagree with each other.

This would also be a good time to switch from crm114's mailfilter to the new mailreaver, which is the new, 3rd Generation mail filter script for crm114. It is easier to maintain, and since it flags mails for which it is unsure, and you ought to train all such mails anyway, it learns faster.

Also, since the new packages come with a brand new discrimination algorithms which are supposed to be as accurate but also faster, but may store data in incompatible ways, I figured that it might be time to redo my CRM mail filter from the ground up. The default now is "osb unique microgroom". This change requires me to empty out the css files.

I also decided to change my learning scripts to not send command via email to a 'testcrm' user, instead, now I train the filter using a command line mode. Apart from saving mails incorrectly classified into folders ham and junk (Spam was already taken), I have a script that grabs and saves mails which are classified differently by crm114 and spamassassin from my incoming mail spool and saves it into a grey.mbox file, which I can manually separate out into ham and junk. Then I have a processing script, that takes the ham and junk folders and trains spamassassin, or crm114, or both; and stashes the training files away into my cached corpus for the future.

In subsequent blog postings, I'll walk people through how I setup and initialized my filter, and provide examples of the scripts I used, along with various and sundry missteps and lessons learned and all.


Monday 18 December
Link: I am now an Ikiwiki user!

Posted early Monday morning, December 18th, 2006

I am now an Ikiwiki user!

[[!template Error: failed to process template <span class="createlink"><a href="/manoj/blog/ikiwiki.cgi?do=create&amp;from=blog%2F2006%2F12%2F18%2FMigrated_to_IkiWiki&amp;page=%2Ftemplates%2Ficon" rel="nofollow">?</a>icon</span> template icon not found ]]

Well, this is first post. I have managed to migrate my blog over to Ikiwiki, including all the historical posts. The reasons for migration was that development on my older blogging mechanism, Blosxom, entered a hiatus, though recently it has been revived on sourceforge. I like the fact that IkiWiki is based on a revision control system, and that I know the author pretty darned well :-).

One of my primary requirements for the migration was that I be able to replicate all the functionality of my existing Blog, and this included the look and feel (which I do happen to like, despite wincing I see from some visitors to my pages) of my blog. This meant replicating the page template and CSS from my blog.

I immediately ran into problems: for example, my CSS markup for my blogs was based on being able to markup components of the date of the entry (day, day of week, month, etc) and achieve fancy effects; and there was no easy way to use preexisting functionality of IkiWiki to present the information to the page template. Thus was born the varioki plugin; which attempts to provide a means to add variables for use in ikiwiki templates, based on variables set by the user in the ikiwiki configuration file. This is fairly powerful, allowing for uses like:

    varioki => {
      'motto'    => '"Manoj\'s musings"',
      'toplvl'   => 'sub {return $page eq "index"}',
      'date'     => 'sub { return POSIX::strftime("%d", gmtime((stat(srcfile($pagesources{$page})))[9])); }'
      'arrayvar' => '[0, 1, 2, 3]',
      'hashvar'  => '{1, 1, 2, 2}'

The next major stumbling block was archive browsing for older postings; Blosxom has a nice calendar plugin that uses a calendar interface to let the user navigate to older blog postings. Since I really liked the way this looks, I set about scratching this itch as well; and now ikiwiki has attained parity vis. a vis. calendar plugins with Blosxom.

The calendar plugin, and the archive index pages, led me start thinking about the physical layout of the blog entries on the file system. Since the tagging mechanism used in ikiwiki does not depend on the location in the file system (an improvement over my Blosxom system), I could layout the blog postings in a more logical fashion. I ended up taking Roland Mas' advice and arranging for the blog postings to be created in files like:


The archives contain annual and monthly indices, and the calendar front end provides links to recent postings and to recent monthly indices. So, a few additions to the arch hook scripts, and perhaps an script to automatically create the directory structure for new posts, and to automatically create annual and monthly indices as needed, and I'll have a low threshold of effort blogging work flow for blog entries, and I might manage to blog more often than the two blog postings I have had all through the year so far.


Wednesday 22 November
Link: Burninating spam

Posted at midnight, November 22nd, 2006

Burninating spam

It is funny that people say that no one has written software that does Beysian filtering in order to reject Spam at the SMTP level, just when I felt like writing up my mimedefang setup. I am of the school of thought that the only acceptable method of dealing with spam is to reject it at the SMTP protocol level. To accomplish this, I run crm114 and spamassassin from my mimedefang-filter, which runs as a milter while my sendmail is processing mail. I also do conditional grey listing, more on that below.

The reason for running two different Spam detection mechanisms is that while one or the other is often fooled and mis-categorizes mail, the score or confidence level of such mis-categorization is low, and the mechanism that is not fooled enables the combined system to do the right thing in most cases. I tend to quarantine and keep even the messages I reject for about a month, and do spot checks, so out of roughly a 800 mails a day volume, I get about one or two spam messages a day, and did not ever reject a mail I manually would classify as ham in the 6 months n which I double checked every decision taken by my system, and in the spot checks I continue to perform. While this is not as good a record as the author of crm114 reports, it is good enough for me.

The grey listing: since I try to be conservative, and opt to lean towards not discarding any email that is legitimate at the expense of not trapping all spam, and since I do not discard any mail I accept unread based on an automated decision, I was getting more Spam than I cared for still. So, I added the grey listing layer -- but only for mail that were not ambiguously spam or viruses (which are rejected), or ham, which are accepted for delivery. Most of the Spam that was slipping through came from the grey area. So, I implemented grey listing only for these mails.

Grey listing happens on a triplet of the class C domain of the senders IP, the sender, and recipient. A grey listed email encounters processing problems, and is temporarily rejected with a 503 for 30 minutes, and then retries are allowed through for 4 hours, in which case the triplet is white-listed for 35 days (catches monthly announcement messages). Known Spam resets the white-listing, and retrying past the window does too.

Doing it this way, known good correspondents suffer no delay, unambiguous mail from strangers flows right through, and I improve my filters based on the results all the time, so my discriminator improves.

My filter is available for examination. See also the older posting, like the time honoured tradition of /., this posting is a duplicate.


Monday 02 January
Link: In vain the sage with retrospective eye --Pope

Posted early Monday morning, January 2nd, 2006

In vain the sage with retrospective eye --Pope

It has been a tumultuous decade of involvement with Debian for me. I had been on the mailing list since mid 1994, but I was reasonably happy with my SLS system (installed using 40 floppies, including about a dozen for just X11 alone), and while I found Debian intriguing, I was not about to go through the pain of a brand new install until I felt that the new project was viable in the long term :-)

I actually jumped ship in the spring of '95 and installed 0.93R5. The next step came with Bug 1766, my very first bug report: Bug in script checksecurity in package cron, on 25 Oct 1995.

Once in, I rapidly went to the next phase: Here is the sum toto of the NM process I went through: my Hello, World mail. Those were the days :-). There was nothing between my ITP and the upload.

My first significant package was kernel-package, since I was always missing something in the series of steps needed to build a kernel, and I started getting into it in the summer of '96. This is where the second part of my apprenticeship started -- even though I had 3-4 packages in the archive, my kernel images were not yet trusted; so I sent my images to my sponsor (Hi Simon), who then uploaded the images to master.

Somewhat later, I also was involved in the early design stages of apt, and the dependency sorting algorithms.

While I was fairly silent during the whole DFSG/SC debates (to the extent I was labelled a mindless follower of Bruce --heh), I took an active part in the constitution debates (possibly due to the fallout of the beach story. Anyway, I seemed to want to get us a constitution, after it had seemed stalled for months.

It is interesting to note how the technical committee was initially setup, there was a proposal, and then Ian Jackson responded by saying he wanted to appoint the ctte members, since he had been around for a while and was also the DPL. The earliest data I can find is from June of '98, as seen in a mail later that year, when the initial list of committee members was created (I also seem to recall I was not on the list initially, but I was added in the early days, before the committee was actually formed).

I was interested in the technical policy fairly early, taking a stance that policy was more than just a bunch of ignorable guidelines. Eventually, a [thread that tried to reach a compromise][other] gave us something like the views we (well, I) hold today -- including what happens when packages do not follow policy, and about policy editors. And abuse of power. About this time, the policy Czar resigned, burned out, and hounded by accusations of delusion of power.

So I first proposed the whole policy editor and consensus approach to letting policy evolve, which eventually led to a formal delegation recently.

Brian Basset was our very first secretary. My first involvement with things that would lead on to the secretaries job started with trying to do voting for policy. Then, around 2000, our the secretary and treasurer Darren Benham went MIA, and Raul, as the chair of the technical committee, had to take over and run the DPL election for 2001 -- and forgot that DPL elections are supposed to be secret ballots. Ben Collins tapped me for the secretaries position mid-2001, as I recall.

It is hard, in a decade or so, to find anything I have not touched -- but NM is one such area. Apart from an early pre-current-NM mail, I have not been very involved in NM. Or the Debian installer. Or Debconf. Hmm. I seem to have drifted away from things that Joey Hess is involved in, which is a pity, he is high on the list of people I respect in the project, and this lack of interaction with as time goes on irks me.


Webmaster <webmaster@golden-gryphon.com>
Last commit: terribly early Sunday morning, June 8th, 2014
Last edited terribly early Sunday morning, June 8th, 2014