Tales from the Gryphon/ categories/

Tales from the Gryphon

Blog postings related to fighting spam

Manoj's hackergotchi
RSS Atom Add a new post titled:
Tuesday 05 May
Link: Debian list spam reporting the Gnus way

Posted early Tuesday morning, May 5th, 2009

License: GPL

Debian list spam reporting the Gnus way

#+TITLE: Debian spam reporting the Gnus way #+AUTHOR: Manoj Srivastava #+EMAIL: srivasta@debian.org #+DATE: #+LANGUAGE: en #+OPTIONS: H:0 num:nil toc:nil \n:nil @:t ::t |:t ^:t -:t f:t *:t TeX:t LaTeX:t skip:nil d:nil tags:not-in-toc #+INFOJS_OPT: view:showall toc:nil ltoc:nil mouse:underline buttons:nil path:http://orgmode.org/org-info.js #+LINK_UP: http://www.golden-gryphon.com/blog/manoj/ #+LINK_HOME: http://www.golden-gryphon.com/ So, recently our email overlords graciously provided means for us minions to help them in their toils and help clean up the spammish clutter in the mailing lists by helping report the spam. And the provided us with a dead simple means of reporting such spam to them. Now, us folks who knoweth that there is but one editor, the true editor, and its, err, proponent is RMS, use Gnus to follow the emacs mailing lists, either directly, or through [[http://www.gmane.org][gmane]]. There are plenty of examples out there showing how to automate reporting spam to /gmane/, so I won't bore y'all with the details. Here I only show how one serves our list overlords, and smite the spam at the same time. Some background, from the Gnus info page. I'll try to keep it brief. There is far more functionality present if you read the documentation, but you can see that for yourself. #+BEGIN_QUOTE The Spam package provides Gnus with a centralized mechanism for detecting and filtering spam. It filters new mail, and processes messages according to whether they are spam or ham. There are two "contact points" between the Spam package and the rest of Gnus: checking new mail for spam, and leaving a group. Checking new mail for spam is done in one of two ways: while splitting incoming mail, or when you enter a group. Identifying spam messages is only half of the Spam package's job. The second half comes into play whenever you exit a group buffer. At this point, the Spam package does several things: it can add the contents of the ham or spam message to the dictionary of the filtering software, and it can report mail to various places using different protocols. #+END_QUOTE All this is very plugin and modular. The advantage is, that you can use various plugin front ends to identify spam and ham, or mark messages as you go through a group, and when you exit the group, spam is reported, ham and spam messages are copied to special destinations for future training of your filter. Since you inspect the marks put into the group buffer as you read the messages, there is a human involved in the processing, but as much as possible can be automated away. /Do/ read the info page on the Spam package in Gnus, it is edifying. Anyway, here is a snippet from my ~etc/emacs/news/gnusrc.el~ file, which can help automate the tedium of reporting spam. This is perhaps more like how Gnus does things than having to press a special key for every spam, and which does nothing to help train your filter. #+BEGIN_HTML [[!syntax language="Common Lisp" linenumbers=1 bars=1 text=""" (add-to-list 'gnus-parameters '("^nnml:\\(debian-.*\\)$" (to-address . "\\1@lists.debian.org") (to-list . "\\1@lists.debian.org") (admin-address . "\\1-request@lists.debian.org") (spam-autodetect . t) (spam-autodetect-methods spam-use-gmane-xref spam-use-hashcash spam-use-BBDB) (spam-process '(spam spam-use-resend)) (spam-report-resend-to . "report-listspam@lists.debian.org") (subscribed . t) (total-expire . t) )) """]] #+END_HTML


Sunday 25 November
Link: Filtering Accuracy: Brown paper bag time

Posted early Sunday morning, November 25th, 2007

License: GPL

Filtering Accuracy: Brown paper bag time

After posting about filtering accuracy I got to thinking about the test I was using. It appeared to me that there should be no errors in the mails that crm114 had already been trained upon -- but here I was, coming up with errors when I trained the css files until there were no errors, and then used a reg exp that tried to find the accuracy of classification for all the files, not just new ones. This did not make sense.

The only explanation was that my css files were not properly created -- and I thought to try an experiment where isntead of trying to throw my whole corpus as one chunk at a blank css file, I would feed the corpus in chunks. I cam up with an initialization script to feed my corpus to a blank css file in 200 mail chunks; and, while it was at it, renumber the corpus mails (over the years, as I cleaned the corpus, gaps had appeared in the numbering). I have also updated the retraining script

Voila. I am getting a new set of css files which do not appear to show any errors for mails crm114 has already learned about -- in other words, for mails it has seen, the accuracy is now 100%, not 99.5% as it was a couple of days ago.

While it is good news in that my classification accuracy is better than it was last week; the bad news is that I no longer have concrete number on accuracy for crm114 anymore -- the mechanism used now gives 100% accuracy all the time. The funny thing is, I recall going through this analysis a couple of years ago, where I predicted that one could only check for accuracy with a test corpus that had the same characteristics as real mail inflow, and which had not been used for learning. That wold mean I would have classified testing corpus that could improve the efficiency of my filter, but was not being used to provide accuracy numbers -- I have gone for improving the filter, at the cost of knowing how accurate they actually are.


Monday 05 November
Link: Filtering accuracy: Hard numbers

Posted early Monday morning, November 5th, 2007

License: GPL

Filtering accuracy: Hard numbers

UPDATE: This posting has severe flaws, which were discovered subsequently. Please ignore.

I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%).

I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) -- about 2/3'rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server.

Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter

However, all these numbers are manually gathered, and I still have not gotten around to automating my setup's overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I'll update the numbers below via cron.

UPDATE: The css files used below were malformed, and the process of creating them detailed below is flawed. Please see newer postings in this category.

First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate two sets of numbers for every training run, and leave out messages with those two trailing digits, in effect reserving 2% of all mails for the accuracy run.

An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam messages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle -- in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle -- The table below starts with data from a newly minted .css file.

Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_]|[0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_]|[2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_]|[7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_]|[0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_]|[6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_]|[7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_]|[6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_]|[7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_]|[9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_]|[8][3][_][_]

As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.


Monday 20 August
Link: Mail Filtering with CRM114: Part 4

Posted early Monday morning, August 20th, 2007

License: GPL

Mail Filtering with CRM114: Part 4

Training the Discriminators

It has been a while since I posted on this category -- actually, it has been a long while since my last blog. When I last left you, I had mail (mbox format) folders called ham and/or junk, which were ready to be used for training either CRM114 or Spamassassin or both.

Setting up Spamassassin

This post lays the groundwork for the training, and details how things are set up. The first part is setting up Spamassassin. One of the things that bothered me about the default settings for Spamassassin was how swiftly Bayes information was expired; indeed, it seems really eager to dumb the Bayes information (don't they trust their engine?). I have spent some effort building a large corpus, and keeping ti clean, but Spamassassin would discard most of the information from the DB after training over my corpus, and the decrease in accuracy was palpable. To prevent this information from leeching away, I firstly increased the size of the database, and turned off automatic expiration, by putting the following lines into ~/.spamassassin/user_prefs:

bayes_expiry_max_db_size  4000000
bayes_auto_expire         0

I also have regularly updated spam rules from the spamassassin rules emporium to improve the efficiency of the rules; my current user_prefs is available as an example.

Initial training

I keep my Spam/Ham corpus under the directory /backup/classify/Done, in the subdirectories Ham and Spam. At the time of writing, I have approximately 20,000 mails in each of these subdirectories, for a total of 41,000+ emails.

I have created a couple of scripts to train the discriminators from scratch using the extant Spam corpus; and these scripts are also used for re-learning, for instance, when I moved from a 32-bit machine to a 64-bit one, or when I change CRM114 discrimators. I generally run them from ~/.spamassassin/ and ~/var/lib/crm114 (which contains my CRM114 setup) directories.

I have found that training Spamassassin works best if you alternate Spam and Ham message chunks; and this Spamassassin learning script delivers chunks of 50 messages for learning.

With CRM114, I have discovered that it is not a good idea to stop learning based on the number of times the corpus has been gone over; since stopping before all messages i the Corpus are correctly handled is also disastrous. So I set the repeat count to a ridiculously high number, and tell mailtrainer to continue training until a streak larger than the sum of Spam and Ham messages has occurred. This CRM114 trainer script does the hob nicely; running it under screen is highly recommend.

Routine updates

Coming back to where we left off, we had mail (mbox format) folders called ham and/or junk sitting in the local mail delivery directory, which were ready to be used for training either CRM114 or Spamassassin or both.

There are two scripts that help me automate the training. The first script, called mail-process, does most of the heavy listing. This processes a bunch of mail folders, which are supposed to contain mail which is either all ham or all spam, indicated by the command line arguments. We go looking though every mail, and any mail where either the CRM114 or the Spamassassin judgement was not what we expected, we strip out mail gathering headers, and then we save the mail, one to a file, and we train the approprite filter. This ensures that we only train on error, and it does not matter if we accidentally try to train on correctly classified mail, since that would be a no-op (apart from increasing the size of the corpus).

The second script, called mproc is a convenience front-end; it just calls mail-process with the proper command line arguments, and feeds them the ham and junk in sequence; and takes no arguments. So, after human classification, just calling mproc does the classification.

This pretty much finishes the series of posts I had in mind about spam filtering, I hope it has been useful.


Sunday 14 January
Link: Mail Filtering with CRM114: Part 3

Posted early Sunday morning, January 14th, 2007

License: GPL

Mail Filtering with CRM114: Part 3

Uphold, maintain, sustain: life in the trenches

Now that I have a baseline filter, how do I continue to train it, without putting too much of an effort? There are two separate activities here, firstly selecting the mails to be used in training, and secondly, automating the training and saving to the mail corpus. On going training is essential; Spam mutates, and even ham changes over time, and well trained filters drift. However, if training disrupts normal work-flow, it won't happen; so a minimally intrusive set of tools is critical

Selecting mail to train filters

There are three broad categories of mails that fit the criteria:

1. Misclassified mail

This is where human judgement comes in, to separate the wheat from the chaff.

  • misclassified Spam. I do nothing special for this category -- I assume that I would notice these mails, and when I do, I just save them in a junk folder, for latter processing. The volume of such messages has fallen to about one or two a month, and having them slip though is not a major problem in the first place
  • misclassified ham is far more critical, and, unfortunately, somewhat harder to get right, since you do want to reject the worst of the Spam at the SMTP level. A mistake here is worse than a false negative: all that happens with a false negative is that you curse, save to the junk folder for later retraining, and mode on. With missed Ham, you never know what you might have missed -- and hope it is nothing important.

    • If one of the two filters did the right thing, then the mlg script can catch it -- more on it below. The only thing to remember to do is to look carefully at the grey.mbox folder that is produced. While this is not good to have misclassified ham, at least this sub-category is easy to detect.
    • If both filters misclassified it, this would then mean a human would have to catch the error. The good news is that I haven't had very many mails fall into this category (the last one I know about was in the late summer of 2005). How can one be sure that there are not ham messages falling through the cracks all the time? The idea is to accept mail with scores one would normally discard, and treat this as a canary in a mine: no ham should ever show up in these realspam messages. My schema for handling Spam is shown in the figure below.

    I try and keep all the mail that I have rejected in quarantine for about a month or so, and so can retrieve a mail if informed about a mistaken rejection. I also do spot checks once in a while, though as time has gone on with no known false positives, the frequency of my checks has dropped.

2. Partially misclassified mail

This is mail correctly classified overall, but misclassified by either crm114, or spamassassin, but not both. This is an early warning sign, and is more common than mail that is misclassified, since usually the filter that is wrong is wrong weakly. But this is the point where training should occur, so that the filter does not drift to the point that mails are misclassified. Again, the mlg script catches this.

3. Mail that crm114 is unsure about

This is Mail correctly classified, but something that mailreaver is unsure about -- and this category is why mailreaver learns faster than mailtrainer.

Spam handling and disposition

Spam handling schema.

At this point I should say something about how I generally handle mails scored as Spam by the filters. As you can see, the mail handling is simple; depending on the combined score given to the mail by the filters. The handling rules are:

  • score <= 5.0: Accept unconditionally
  • 5.0 < score <= 15: Grey list
  • 15 < score: reject

So, any mail with score less than 15 is accepted, potentially after grey-listing. The disposition is done according to the following set of rules:

  • score <= 0: Classify into folder based on origin
  • 0 < score <= 10: file into Spam (some of this survived grey-listing)
  • 10 < score: file into realspam (Must have survived grey-listing)

In the last 18+months, I have not seen a Ham mail in my realspam folder; chances of Ham being rejected are pretty low. My Spam folder gets a ham message every few months, but these are usually spamassassin misclassifying things; and mlg detects those. I have not seen one of these in the last 6 months. So my realspam canary has done wonders for my peace of mind. With ideally trained filters, spam and realspam folders would be empty.

mail list grey

I have created a script called mlg ("Mail List Grey") that I run periodically over my mail folder, that picks out mails that either (a) are classified differently by spamassassin and crm114, or (b) are marked as unsure by mailreaver. The script takes these mails and saves them into a grey.mbox folder. I tend to run them over Spam and non-Spam folders in different runs, so that the grey.mbox folder can be renamed to either ham or junk, in the vast majority of the cases. Only for misclassified mails do I have to individually pick the misplaced email and classify it separately from the rest of the emails in that batch.

At this point, I should have mail mbox folders called ham and/or junk, which are now ready to be used for training either crm114 or spamassassin or both. Processing these folders is the subject of the next article in this series.


Thursday 11 January
Link: Mail Filtering with CRM114: Part 2

Posted early Thursday morning, January 11th, 2007

License: GPL

Mail Filtering with CRM114: Part 2

Or, Cleanliness is next to godliness

The last time when I blogged about Spam fighting Mail Filtering With CRM114 Part 1, I left y'all with visions of non-converging learning, various ingenious ways of working around a unclean corpus, and perhaps a sinking feeling that this whole thing was more fragile than it ought to be.

During this eriod of flailing around, trying to get mailtrainer to learn the full corpus correctly, I upgraded to an unpackaged version of crm114. Thanks to the excellent packaging effort by the maintianers, this was dead easy: get the debian sources using apt-get source crm114, download tghe new tarball from crm114 upstream, cp the debian dir over, and just edit the changelog file to reflect the new version. I am currently running my own, statically linked 20061103-Blame Dalkey.src-1.

Cleaning the corpus made a major difference to the quality of discrimination. As mentioned earlier, I examined every mail that was initially incorrectly classified during learning. Now, there are two ways this can happen: That the piece of mail was correctly placed in the corpus, but had a feature that was different from those learned before; or that it was wrongly classified by me. When I started the chances were almost equally likely; I have now hopefully eliminated most of the misclassifications. When mailtrainer goes into cycles, retraining on a couple of emails round after round, you almost certainly are trying to train in conflicting ways. Cyclic retraining is almost always a human's error in classification.

Some of the errors discovered were not just misclassifications: some where things that were inappropriate mail, but not Spam; for instance there was the whole conversation where someone one subscribed debian-devel to another mailing list, there was the challenge, the subscription notice, the un-subscription, challenge, and notice -- all of which were inappropriate, and interrupted the flow, and contributed to the noise -- but were not really Spam. I had, in a fit of pique, labelled them as Spam; but they were really like any other mailing list subscription conversations, which I certainly want to see for my subscriptions. crm114 did register the conversations as Spam and non-Spam, as requested, but that increased the similarity between Spam and non-Spam features -- and probably decreased the accuracy. I've since decided to train only on Spam, not on inappropriate mails; and let Gnus keep inappropriate mails from my eyes.

I've also waffled over time about whether or not to treat newsletters from Dr. Dobbs Journal or USENIX as Spam or not -- now my rule of thumb is that since I signed up for them at some point, they are not Spam -- though I don't feel guilty about letting mailagent squirrel them away mostly out of sight.

A few tips about using mailtrainer:

  • Make sure you have a clean corpus
  • Try and make it so that your have roughly equal numbers of Ham and Spam
  • Don't let mailtrainer quit after a set number of rounds. I use a ridiculous repeat count of about 100 -- never expecting to reach anywhere close to that. Instead, I set the streak to a number = number of Spam + number of Ham + 10. This means that mailtrainer does not quit until it has processed every email in the corpus correctly without needing to retrain. Letting mailtrainer quit after it repeated the corpus twice, but before it got a streak of correct classifications left me with a filter with horrible accuracy.
  • Make sure you have a clean corpus (yes, really)
  • While the ratio of similarities between .css files to the differences between .css files is a fun metric, and I use it as an informal, rough benchmark while training, it is not really correlated to accuracy, so don't get hung up on it, like I was. Usually, the value I get when training from scratch on my corpus is somewhere around 7.5; but over time the ratio degrades (falling to about 5), while the filter accuracy keeps increasing.
  • Train only on errors. The good news is that both mailtrainer and mailreaver already do so, so this is easy rule to follow. The reason is that you should only be doing minimal training, in case you have to modify/change the underlying rule in the future. So your filter should be trained to correctly classify the mails you get, but don't shoot for overwhelming scores.
  • Use mailreaver. It has the nice behaviour of asking for retraining when it is unsure of some mail, and it caches mail for some time which really helps in training and rebuilding .css files. The reason mailreaver learns faster than mailtrainer is just this feature, I think.
  • Stash mails you train on in your corpus, and don't hesitate to re-run mailtrainer over your corpus again after you have been training the unsure emails. When you train single emails, the changed filter may no longer correctly classify some email in the corpus you train with. Running mailtrainer over the corpus adjusts the filter to correctly classify every mail again. I generally retrain about once a week, though I no longer retrain from scratch, unless I change the classification algorithm.


Saturday 23 December
Link: Mail Filtering With CRM114: Part 1

Posted early Saturday morning, December 23rd, 2006

License: GPL

Mail Filtering With CRM114: Part 1

The first step is to configure the CRM114 files, and these are now in pretty good shape as shipped with Debian. All that I needed to do was set a password, say that I'd use openssl base64 -d, and stick with the less verbose defaults (so no munging subjects, no saving all mail, no saving rejects, etc, since I have other mechanisms that do all that). The comments in the configuration files are plenty good enough. This part went off like a breeze; the results can be found here (most of the values are still the upstream default).

The next step was to create new, empty .css files. I have noticed that creating more buckets makes crm114 perform better, so I go with larger than norm crm114 .css files. I have no idea if this is the right thing to do, but I make like Nike and just do it. At some point I'll ask on the crm114-general mailing list.

    % cssutil -b -r -S 4194000 spam.css
    % cssutil -b -r -S 4194000 nonspam.css

Now we have a blank slate; at this time the filter knows nothing, and is equally likely to call something Spam or non-Spam. We are now ready to learn. So, I girded my loins, and set about feeding my whole mail corpus to the filter:

 /usr/share/crm114/mailtrainer.crm               \
   --spam=/backup/classify/Done/Spam/           \
   --good=/backup/classify/Done/Ham/            \
   --repeat=100 --streak=35000 |                \
         egrep -i '^ +|train|Excell|Running'

And this failed spectacularly (see Debian bug #399306). Faced with unexpected segment violations, and not being proficient in crm114's rather arcane syntax, I was forced to speculation: I assumed (as it turns out, incorrectly) that if if you throw too many changes at the crm114 database, things rapidly escalate out of control. I went o to postulate that as my mail corpus was gathered over a period of errors, the characteristic of Spam drifted over time, and what I consider Spam has also evolved. So, some parts of the early corpus are at variance with the more recent bits.

Based on this assumption, I created a wrapper script which did what Clint has called training to exhaustion -- it iterated over the corpus several times, starting with a small and geometrically increasing chunk size. Given the premise I was working under, it does a good job of training crm114 on a localized window of Spam: it feeds chunks of the corpus to the trainer, with each successive chunk overlapping the previous and succeeding chunks, and ensuring that crm114 is happy at any given chunk of the corpus. Then it doubles the chunk size, and tries goes at it again. All very clever, and all pretty wrong.

I also created another script to retrain crm114, which was less exhaustive than the previous one, but did incorporate any further drift in the nature of Spam. I no longer use these scripts; but I would like to record them for posterity as an indication of how far one can take an hypothesis.

What it did to was have crm114 learn without segfaulting -- and show me that there was a problem in the corpus. I noticed that in some cases the trainer would find a pair of mail messages, classify them wrongly, and retrain and refute -- iteration after iteration, back and forth. I noticed this when I added the egrep filter above, and was not drowning in the needless chatter from the trainer. It turns out, I had very similar emails (sometimes, even the same email) in the Ham and the Spam corpus, and no wonder crm114 was throwing hissy fits. Having small chunks ensured that I had not too many such errors in any chunk;and crm114 did try to forget a mail differently classified in an older chunk and learn whatever this chunk was trying to teach it. The downside was that the count of the differences between ham nd Spam went down, and the similarities increased -- which meant that the filter was not as good at separating ham and Spam as it could have been.

So my much vaunted mail corpus was far from clean -- over the years, I had misclassified mails, been wishy-washy and changed my mind about what was and was not Spam. I have a script that uses md5sums to find duplicate files in a directory, and found, to my horror, that there were scores of duplicates in the corpus. After eliminating outright duplicates, I started examining anything that showed up with an ER (error, refute) tag in the trainer output; on the second iteration of the training script these were likely to be misclassification. I spent days examining my corpus and cleaning it out; and was gratified to see the ratio of differences to similarities between ham and Spam css files climb from a shade under 3 to around 7.35.

Next post we'll talk about lessons learned about training, and how a nominal work flow of training on errors and training when classifiers disagree can be set up.


Tuesday 19 December
Link: Mail Filtering With CRM114: Introduction

Posted early Tuesday morning, December 19th, 2006

License: GPL

Mail Filtering With CRM114: Introduction

I have a fairly sophisticated Spam filtering mechanism setup, using MIMeDefang for SMTP level rejection of Spam, grey-listing for mails not obviously ham or Spam, and using both crm114 and spamassassin for discrimination, since they compensate for each other when one of them can't tell what it is looking at. People have often asked me to write up the details of my setup, and the support infrastructure, and I have decided to oblige. What follows is a series of blog posts detailing how I set about redoing my crm114 setup, with enough detail that interested parties can tag along.

I noticed that the new crm114 packages have a new facility called mailtrainer, which can be used to setup initial css database files. I liked the fact that it can run over the training data several times, it can keep back portions of the training data as a control group, and you can tell it to keep on going until it gets a certain number of mails discriminated correctly. This is cool, since I have a corpus of about 15K Spam and 16K ham messages, mostly stored on my previous train-on-error practice (whenever crm114 classified a message incorrectly, I train it, and store all training emails). I train whenever crm114 and spamassassin disagree with each other.

This would also be a good time to switch from crm114's mailfilter to the new mailreaver, which is the new, 3rd Generation mail filter script for crm114. It is easier to maintain, and since it flags mails for which it is unsure, and you ought to train all such mails anyway, it learns faster.

Also, since the new packages come with a brand new discrimination algorithms which are supposed to be as accurate but also faster, but may store data in incompatible ways, I figured that it might be time to redo my CRM mail filter from the ground up. The default now is "osb unique microgroom". This change requires me to empty out the css files.

I also decided to change my learning scripts to not send command via email to a 'testcrm' user, instead, now I train the filter using a command line mode. Apart from saving mails incorrectly classified into folders ham and junk (Spam was already taken), I have a script that grabs and saves mails which are classified differently by crm114 and spamassassin from my incoming mail spool and saves it into a grey.mbox file, which I can manually separate out into ham and junk. Then I have a processing script, that takes the ham and junk folders and trains spamassassin, or crm114, or both; and stashes the training files away into my cached corpus for the future.

In subsequent blog postings, I'll walk people through how I setup and initialized my filter, and provide examples of the scripts I used, along with various and sundry missteps and lessons learned and all.


Wednesday 22 November
Link: Burninating spam

Posted at midnight, November 22nd, 2006

License: GPL

Burninating spam

It is funny that people say that no one has written software that does Beysian filtering in order to reject Spam at the SMTP level, just when I felt like writing up my mimedefang setup. I am of the school of thought that the only acceptable method of dealing with spam is to reject it at the SMTP protocol level. To accomplish this, I run crm114 and spamassassin from my mimedefang-filter, which runs as a milter while my sendmail is processing mail. I also do conditional grey listing, more on that below.

The reason for running two different Spam detection mechanisms is that while one or the other is often fooled and mis-categorizes mail, the score or confidence level of such mis-categorization is low, and the mechanism that is not fooled enables the combined system to do the right thing in most cases. I tend to quarantine and keep even the messages I reject for about a month, and do spot checks, so out of roughly a 800 mails a day volume, I get about one or two spam messages a day, and did not ever reject a mail I manually would classify as ham in the 6 months n which I double checked every decision taken by my system, and in the spot checks I continue to perform. While this is not as good a record as the author of crm114 reports, it is good enough for me.

The grey listing: since I try to be conservative, and opt to lean towards not discarding any email that is legitimate at the expense of not trapping all spam, and since I do not discard any mail I accept unread based on an automated decision, I was getting more Spam than I cared for still. So, I added the grey listing layer -- but only for mail that were not ambiguously spam or viruses (which are rejected), or ham, which are accepted for delivery. Most of the Spam that was slipping through came from the grey area. So, I implemented grey listing only for these mails.

Grey listing happens on a triplet of the class C domain of the senders IP, the sender, and recipient. A grey listed email encounters processing problems, and is temporarily rejected with a 503 for 30 minutes, and then retries are allowed through for 4 hours, in which case the triplet is white-listed for 35 days (catches monthly announcement messages). Known Spam resets the white-listing, and retrying past the window does too.

Doing it this way, known good correspondents suffer no delay, unambiguous mail from strangers flows right through, and I improve my filters based on the results all the time, so my discriminator improves.

My filter is available for examination. See also the older posting, like the time honoured tradition of /., this posting is a duplicate.


Monday 02 January
Link: CRM114, Spamassassin, MIMEDefang, Sendmail, MailAgent -- and GreyListing

Posted early Monday morning, January 2nd, 2006

License: GPL

CRM114, Spamassassin, MIMEDefang, Sendmail, MailAgent -- and GreyListing

I have been fairly comfortable with my current mail filtering solution. Unlike some other blogs on Planet, when last I was doing exhaustive checks, I had not had a false positive in six months (now I just do random spot checks). The false negatives creep up from a few a week to a few a day (with a mail feed of about 1k emails a day). And the stuff that is classified as a definite spam is REJECTed (I do keep a copy), which means that a legitimate correspondent would have an idea something went wrong (unless they are ignoring such bounces form their own MAILER-DAEMON).

Part of the reason for this satisfactory performance is that I use a layered approach, with several tools playing off each other, ameliorating each others mistakes. Admittedly, this took painful training -- I created a testcrm user, and bounced a copy of every mail I got to it using a .forward. Then, over a course of two months, I would painstakingly go over the classification, training on error, until the failure rates dropped to levels I felt comfortable with, and moved the CRM114 and Spamassassin configuration over for my own use.

I recently added a hand crafted Greylisting implementation to MIMEDefang -- in good MIMEDefang tradition, this is a heavily tweaked version of an implementation on the mailing list, using PostgreSQL. I have modified it to not greylist every single new email that comes my way, but to only greylist stuff that CRM114 and Spamassassin have been unable to classify strongly as Spam or ham. mail that has been strongly classified already, in turn, affects greylisting.

Here is the SQL code for the implementation, and the mimedefang-filter itself, showing how I integrate CRM114 and Spamassassin along with greylisting in a Sendmail Milter.

Have fun.


Webmaster <webmaster@golden-gryphon.com>
Last commit: in the wee hours of Friday night, May 31st, 2014
Last edited in the wee hours of Friday night, May 31st, 2014