Tales from the Gryphon/ archives/ 2007/

Tales from the Gryphon

Archives for 2007/01

Manoj's hackergotchi
Add a new post titled:
Sunday 14 January
2007
Link: Mail Filtering with CRM114: Part 3

Posted early Sunday morning, January 14th, 2007

Mail Filtering with CRM114: Part 3

Uphold, maintain, sustain: life in the trenches

Now that I have a baseline filter, how do I continue to train it, without putting too much of an effort? There are two separate activities here, firstly selecting the mails to be used in training, and secondly, automating the training and saving to the mail corpus. On going training is essential; Spam mutates, and even ham changes over time, and well trained filters drift. However, if training disrupts normal work-flow, it won't happen; so a minimally intrusive set of tools is critical

Selecting mail to train filters

There are three broad categories of mails that fit the criteria:

1. Misclassified mail

This is where human judgement comes in, to separate the wheat from the chaff.

  • misclassified Spam. I do nothing special for this category -- I assume that I would notice these mails, and when I do, I just save them in a junk folder, for latter processing. The volume of such messages has fallen to about one or two a month, and having them slip though is not a major problem in the first place
  • misclassified ham is far more critical, and, unfortunately, somewhat harder to get right, since you do want to reject the worst of the Spam at the SMTP level. A mistake here is worse than a false negative: all that happens with a false negative is that you curse, save to the junk folder for later retraining, and mode on. With missed Ham, you never know what you might have missed -- and hope it is nothing important.

    • If one of the two filters did the right thing, then the mlg script can catch it -- more on it below. The only thing to remember to do is to look carefully at the grey.mbox folder that is produced. While this is not good to have misclassified ham, at least this sub-category is easy to detect.
    • If both filters misclassified it, this would then mean a human would have to catch the error. The good news is that I haven't had very many mails fall into this category (the last one I know about was in the late summer of 2005). How can one be sure that there are not ham messages falling through the cracks all the time? The idea is to accept mail with scores one would normally discard, and treat this as a canary in a mine: no ham should ever show up in these realspam messages. My schema for handling Spam is shown in the figure below.

    I try and keep all the mail that I have rejected in quarantine for about a month or so, and so can retrieve a mail if informed about a mistaken rejection. I also do spot checks once in a while, though as time has gone on with no known false positives, the frequency of my checks has dropped.

2. Partially misclassified mail

This is mail correctly classified overall, but misclassified by either crm114, or spamassassin, but not both. This is an early warning sign, and is more common than mail that is misclassified, since usually the filter that is wrong is wrong weakly. But this is the point where training should occur, so that the filter does not drift to the point that mails are misclassified. Again, the mlg script catches this.

3. Mail that crm114 is unsure about

This is Mail correctly classified, but something that mailreaver is unsure about -- and this category is why mailreaver learns faster than mailtrainer.

Spam handling and disposition

Spam handling schema.

At this point I should say something about how I generally handle mails scored as Spam by the filters. As you can see, the mail handling is simple; depending on the combined score given to the mail by the filters. The handling rules are:

  • score <= 5.0: Accept unconditionally
  • 5.0 < score <= 15: Grey list
  • 15 < score: reject

So, any mail with score less than 15 is accepted, potentially after grey-listing. The disposition is done according to the following set of rules:

  • score <= 0: Classify into folder based on origin
  • 0 < score <= 10: file into Spam (some of this survived grey-listing)
  • 10 < score: file into realspam (Must have survived grey-listing)

In the last 18+months, I have not seen a Ham mail in my realspam folder; chances of Ham being rejected are pretty low. My Spam folder gets a ham message every few months, but these are usually spamassassin misclassifying things; and mlg detects those. I have not seen one of these in the last 6 months. So my realspam canary has done wonders for my peace of mind. With ideally trained filters, spam and realspam folders would be empty.

mail list grey

I have created a script called mlg ("Mail List Grey") that I run periodically over my mail folder, that picks out mails that either (a) are classified differently by spamassassin and crm114, or (b) are marked as unsure by mailreaver. The script takes these mails and saves them into a grey.mbox folder. I tend to run them over Spam and non-Spam folders in different runs, so that the grey.mbox folder can be renamed to either ham or junk, in the vast majority of the cases. Only for misclassified mails do I have to individually pick the misplaced email and classify it separately from the rest of the emails in that batch.

At this point, I should have mail mbox folders called ham and/or junk, which are now ready to be used for training either crm114 or spamassassin or both. Processing these folders is the subject of the next article in this series.

Manoj

Thursday 11 January
2007
Link: Mail Filtering with CRM114: Part 2

Posted early Thursday morning, January 11th, 2007

Mail Filtering with CRM114: Part 2

Or, Cleanliness is next to godliness

The last time when I blogged about Spam fighting Mail Filtering With CRM114 Part 1, I left y'all with visions of non-converging learning, various ingenious ways of working around a unclean corpus, and perhaps a sinking feeling that this whole thing was more fragile than it ought to be.

During this eriod of flailing around, trying to get mailtrainer to learn the full corpus correctly, I upgraded to an unpackaged version of crm114. Thanks to the excellent packaging effort by the maintianers, this was dead easy: get the debian sources using apt-get source crm114, download tghe new tarball from crm114 upstream, cp the debian dir over, and just edit the changelog file to reflect the new version. I am currently running my own, statically linked 20061103-Blame Dalkey.src-1.

Cleaning the corpus made a major difference to the quality of discrimination. As mentioned earlier, I examined every mail that was initially incorrectly classified during learning. Now, there are two ways this can happen: That the piece of mail was correctly placed in the corpus, but had a feature that was different from those learned before; or that it was wrongly classified by me. When I started the chances were almost equally likely; I have now hopefully eliminated most of the misclassifications. When mailtrainer goes into cycles, retraining on a couple of emails round after round, you almost certainly are trying to train in conflicting ways. Cyclic retraining is almost always a human's error in classification.

Some of the errors discovered were not just misclassifications: some where things that were inappropriate mail, but not Spam; for instance there was the whole conversation where someone one subscribed debian-devel to another mailing list, there was the challenge, the subscription notice, the un-subscription, challenge, and notice -- all of which were inappropriate, and interrupted the flow, and contributed to the noise -- but were not really Spam. I had, in a fit of pique, labelled them as Spam; but they were really like any other mailing list subscription conversations, which I certainly want to see for my subscriptions. crm114 did register the conversations as Spam and non-Spam, as requested, but that increased the similarity between Spam and non-Spam features -- and probably decreased the accuracy. I've since decided to train only on Spam, not on inappropriate mails; and let Gnus keep inappropriate mails from my eyes.

I've also waffled over time about whether or not to treat newsletters from Dr. Dobbs Journal or USENIX as Spam or not -- now my rule of thumb is that since I signed up for them at some point, they are not Spam -- though I don't feel guilty about letting mailagent squirrel them away mostly out of sight.

A few tips about using mailtrainer:

  • Make sure you have a clean corpus
  • Try and make it so that your have roughly equal numbers of Ham and Spam
  • Don't let mailtrainer quit after a set number of rounds. I use a ridiculous repeat count of about 100 -- never expecting to reach anywhere close to that. Instead, I set the streak to a number = number of Spam + number of Ham + 10. This means that mailtrainer does not quit until it has processed every email in the corpus correctly without needing to retrain. Letting mailtrainer quit after it repeated the corpus twice, but before it got a streak of correct classifications left me with a filter with horrible accuracy.
  • Make sure you have a clean corpus (yes, really)
  • While the ratio of similarities between .css files to the differences between .css files is a fun metric, and I use it as an informal, rough benchmark while training, it is not really correlated to accuracy, so don't get hung up on it, like I was. Usually, the value I get when training from scratch on my corpus is somewhere around 7.5; but over time the ratio degrades (falling to about 5), while the filter accuracy keeps increasing.
  • Train only on errors. The good news is that both mailtrainer and mailreaver already do so, so this is easy rule to follow. The reason is that you should only be doing minimal training, in case you have to modify/change the underlying rule in the future. So your filter should be trained to correctly classify the mails you get, but don't shoot for overwhelming scores.
  • Use mailreaver. It has the nice behaviour of asking for retraining when it is unsure of some mail, and it caches mail for some time which really helps in training and rebuilding .css files. The reason mailreaver learns faster than mailtrainer is just this feature, I think.
  • Stash mails you train on in your corpus, and don't hesitate to re-run mailtrainer over your corpus again after you have been training the unsure emails. When you train single emails, the changed filter may no longer correctly classify some email in the corpus you train with. Running mailtrainer over the corpus adjusts the filter to correctly classify every mail again. I generally retrain about once a week, though I no longer retrain from scratch, unless I change the classification algorithm.

Manoj


Webmaster <webmaster@golden-gryphon.com>
Last commit: terribly early Sunday morning, June 8th, 2014
Last edited terribly early Sunday morning, June 8th, 2014