Posted in the wee hours of Monday night, May 5th, 2009
License: GPL
Posted in the wee hours of Monday night, May 5th, 2009
License: GPL
So, recently our email overlords graciously provided means for us minions to help them in their toils and help clean up the spammish clutter in the mailing lists by helping report the spam. And the provided us with a dead simple means of reporting such spam to them. Now, us folks who knoweth that there is but one editor, the true editor, and its, err, proponent is RMS, use Gnus to follow the emacs mailing lists, either directly, or through gmane. There are plenty of examples out there showing how to automate reporting spam to gmane, so I won’t bore y’all with the details. Here I only show how one serves our list overlords, and smite the spam at the same time.
Some background, from the Gnus info page. I’ll try to keep it brief. There is far more functionality present if you read the documentation, but you can see that for yourself.
The Spam package provides Gnus with a centralized mechanism for detecting and filtering spam. It filters new mail, and processes messages according to whether they are spam or ham. There are two “contact points” between the Spam package and the rest of Gnus: checking new mail for spam, and leaving a group.
Checking new mail for spam is done in one of two ways: while splitting incoming mail, or when you enter a group. Identifying spam messages is only half of the Spam package’s job. The second half comes into play whenever you exit a group buffer. At this point, the Spam package does several things: it can add the contents of the ham or spam message to the dictionary of the filtering software, and it can report mail to various places using different protocols.
All this is very plugin and modular. The advantage is, that you can use various plugin front ends to identify spam and ham, or mark messages as you go through a group, and when you exit the group, spam is reported, ham and spam messages are copied to special destinations for future training of your filter. Since you inspect the marks put into the group buffer as you read the messages, there is a human involved in the processing, but as much as possible can be automated away. Do read the info page on the Spam package in Gnus, it is edifying.
Anyway, here is a snippet from my
etc/emacs/news/gnusrc.el file, which can help automate
the tedium of reporting spam. This is perhaps more like how Gnus
does things than having to press a special key for every spam, and
which does nothing to help train your filter.
1 (add-to-list 2 'gnus-parameters 3 '("^nnml:\\(debian-.*\\)$" 4 (to-address . "\\1@lists.debian.org") 5 (to-list . "\\1@lists.debian.org") 6 (admin-address . "\\1-request@lists.debian.org") 7 (spam-autodetect . t) 8 (spam-autodetect-methods spam-use-gmane-xref spam-use-hashcash spam-use-BBDB) 9 (spam-process '(spam spam-use-resend)) 10 (spam-report-resend-to . "report-listspam@lists.debian.org") 11 (subscribed . t) 12 (total-expire . t) 13 ))
After posting about filtering accuracy I got to thinking about the test I was using. It appeared to me that there should be no errors in the mails that crm114 had already been trained upon — but here I was, coming up with errors when I trained the css files until there were no errors, and then used a reg exp that tried to find the accuracy of classification for all the files, not just new ones. This did not make sense.
The only explanation was that my css files were not properly created — and I thought to try an experiment where isntead of trying to throw my whole corpus as one chunk at a blank css file, I would feed the corpus in chunks. I cam up with an initialization script to feed my corpus to a blank css file in 200 mail chunks; and, while it was at it, renumber the corpus mails (over the years, as I cleaned the corpus, gaps had appeared in the numbering). I have also updated the retraining script
Voila. I am getting a new set of css files which do not appear to show any errors for mails crm114 has already learned about — in other words, for mails it has seen, the accuracy is now 100%, not 99.5% as it was a couple of days ago.
While it is good news in that my classification accuracy is better than it was last week; the bad news is that I no longer have concrete number on accuracy for crm114 anymore — the mechanism used now gives 100% accuracy all the time. The funny thing is, I recall going through this analysis a couple of years ago, where I predicted that one could only check for accuracy with a test corpus that had the same characteristics as real mail inflow, and which had not been used for learning. That wold mean I would have classified testing corpus that could improve the efficiency of my filter, but was not being used to provide accuracy numbers — I have gone for improving the filter, at the cost of knowing how accurate they actually are.
UPDATE: This posting has severe flaws, which were discovered subsequently. Please ignore.
I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%).
I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) — about 2/3’rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server.
Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter
However, all these numbers are manually gathered, and I still have not gotten around to automating my setup’s overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I’ll update the numbers below via cron.
UPDATE: The css files used below were malformed, and the process of creating them detailed below is flawed. Please see newer postings in this category.
First, some context: when training CRM114 using the
mailtrainer command, one can specify to leave out a
certain percentage of the training set in the learn phase, and run
a second pass over the mails so skipped to test the accuracy of the
training. The way you do this is by specifying a regular expression
to match the file names. Since my training set has message numbers,
it was simple to use the least significant two digits as a regexp;
but I did not like the idea of always leaving out the same
messages. So I now generate two sets of numbers for every training
run, and leave out messages with those two trailing digits, in
effect reserving 2% of all mails for the accuracy run.
An interesting thing to note is the assymetry in the accuracy:
CRM114 has never identified a Spam message incorrectly. This is
because the training mechanism is skewed towards letting a few spam
messages slip through, rather than let a good message slip into the
spam folder. I like that. So, here are the accuracy numbers for
CRM114; adding in Spamassassin into the mix only improves the
numbers. Also, I have always felt that a freshly learned css file
is somewhat brittle — in the sense that if one trains an
unsure
message, and then tried to TUNE (Train Until No
Errors) the css file, a large number of runs through the training
set are needed until the thing stabilizes. So it is as if the
learning done initially was minimalistic, and adding the
information for the new unsure message required all kinds of
tweaking. After a while TOEing (Training on Errors) and TUNEing,
this brittleness seems to get hammered out of the CSS files. I also
expect to see accuracy rise as the css files get less brittle — The
table below starts with data from a newly minted .css file.
| Date | Corpus | Ham | Spam | Overall | Validation | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Size | Count | Correct | Accuracy | Count | Correct | Accuracy | Count | Correct | Accuracy | Regexp | |
| Wed Oct 31 10:22:23 UTC 2007 | 43319 | 492 | 482 | 97.967480 | 374 | 374 | 100.000000 | 866 | 856 | 98.845270 | [1][6][_][_]|[0][3][_][_] |
| Wed Oct 31 17:32:44 UTC 2007 | 43330 | 490 | 482 | 98.367350 | 378 | 378 | 100.000000 | 868 | 860 | 99.078340 | [3][7][_][_]|[2][3][_][_] |
| Thu Nov 1 03:01:35 UTC 2007 | 43334 | 491 | 483 | 98.370670 | 375 | 375 | 100.000000 | 866 | 858 | 99.076210 | [2][0][_][_]|[7][9][_][_] |
| Thu Nov 1 13:47:55 UTC 2007 | 43345 | 492 | 482 | 97.967480 | 376 | 376 | 100.000000 | 868 | 858 | 98.847930 | [1][2][_][_]|[0][2][_][_] |
| Sat Nov 3 18:27:00 UTC 2007 | 43390 | 490 | 480 | 97.959180 | 379 | 379 | 100.000000 | 869 | 859 | 98.849250 | [4][1][_][_]|[6][4][_][_] |
| Sat Nov 3 22:38:12 UTC 2007 | 43394 | 491 | 482 | 98.167010 | 375 | 375 | 100.000000 | 866 | 857 | 98.960740 | [3][1][_][_]|[7][8][_][_] |
| Sun Nov 4 05:49:45 UTC 2007 | 43400 | 490 | 483 | 98.571430 | 377 | 377 | 100.000000 | 867 | 860 | 99.192620 | [4][6][_][_]|[6][8][_][_] |
| Sun Nov 4 13:35:15 UTC 2007 | 43409 | 490 | 485 | 98.979590 | 377 | 377 | 100.000000 | 867 | 862 | 99.423300 | [3][7][_][_]|[7][9][_][_] |
| Sun Nov 4 19:22:02 UTC 2007 | 43421 | 490 | 486 | 99.183670 | 379 | 379 | 100.000000 | 869 | 865 | 99.539700 | [7][2][_][_]|[9][4][_][_] |
| Mon Nov 5 05:47:45 UTC 2007 | 43423 | 490 | 489 | 99.795920 | 378 | 378 | 100.000000 | 868 | 867 | 99.884790 | [4][0][_][_]|[8][3][_][_] |
As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.
It has been a while since I posted on this category — actually, it has been a long while since my last blog. When I last left you, I had mail (mbox format) folders called ham and/or junk, which were ready to be used for training either CRM114 or Spamassassin or both.
This post lays the groundwork for the training, and details how things are set up. The first part is setting up Spamassassin. One of the things that bothered me about the default settings for Spamassassin was how swiftly Bayes information was expired; indeed, it seems really eager to dumb the Bayes information (don’t they trust their engine?). I have spent some effort building a large corpus, and keeping ti clean, but Spamassassin would discard most of the information from the DB after training over my corpus, and the decrease in accuracy was palpable. To prevent this information from leeching away, I firstly increased the size of the database, and turned off automatic expiration, by putting the following lines into ~/.spamassassin/user_prefs:
bayes_expiry_max_db_size 4000000
bayes_auto_expire 0
I also have regularly updated spam rules from the spamassassin rules emporium to improve the efficiency of the rules; my current user_prefs is available as an example.
I keep my Spam/Ham corpus under the directory
/backup/classify/Done, in the subdirectories
Ham and Spam. At the time of writing, I
have approximately 20,000 mails in each of these subdirectories,
for a total of 41,000+ emails.
I have created a couple of scripts to train the discriminators
from scratch using the extant Spam corpus; and these scripts are
also used for re-learning, for instance, when I moved from a 32-bit
machine to a 64-bit one, or when I change CRM114
discrimators. I generally run them from
~/.spamassassin/ and ~/var/lib/crm114
(which contains my CRM114 setup) directories.
I have found that training Spamassassin works best if you alternate Spam and Ham message chunks; and this Spamassassin learning script delivers chunks of 50 messages for learning.
With CRM114, I have discovered that it is not a
good idea to stop learning based on the number of times the corpus
has been gone over; since stopping before all messages i the Corpus
are correctly handled is also disastrous. So I set the repeat count
to a ridiculously high number, and tell mailtrainer to
continue training until a streak larger than the sum of Spam and
Ham messages has occurred. This CRM114 trainer
script does the hob nicely; running it under
screen is highly recommend.
Coming back to where we left off, we had mail (mbox format) folders called ham and/or junk sitting in the local mail delivery directory, which were ready to be used for training either CRM114 or Spamassassin or both.
There are two scripts that help me automate the training. The first script, called mail-process, does most of the heavy listing. This processes a bunch of mail folders, which are supposed to contain mail which is either all ham or all spam, indicated by the command line arguments. We go looking though every mail, and any mail where either the CRM114 or the Spamassassin judgement was not what we expected, we strip out mail gathering headers, and then we save the mail, one to a file, and we train the approprite filter. This ensures that we only train on error, and it does not matter if we accidentally try to train on correctly classified mail, since that would be a no-op (apart from increasing the size of the corpus).
The second script, called mproc is a convenience front-end;
it just calls mail-process with the proper command
line arguments, and feeds them the ham and junk
in sequence; and takes no arguments. So, after human
classification, just calling mproc does the
classification.
This pretty much finishes the series of posts I had in mind about spam filtering, I hope it has been useful.
Now that I have a baseline filter, how do I continue to train it, without putting too much of an effort? There are two separate activities here, firstly selecting the mails to be used in training, and secondly, automating the training and saving to the mail corpus. On going training is essential; Spam mutates, and even ham changes over time, and well trained filters drift. However, if training disrupts normal work-flow, it won’t happen; so a minimally intrusive set of tools is critical
There are three broad categories of mails that fit the criteria:
This is where human judgement comes in, to separate the wheat from the chaff.
misclassified ham is far more critical, and, unfortunately, somewhat harder to get right, since you do want to reject the worst of the Spam at the SMTP level. A mistake here is worse than a false negative: all that happens with a false negative is that you curse, save to the junk folder for later retraining, and mode on. With missed Ham, you never know what you might have missed — and hope it is nothing important.
I try and keep all the mail that I have rejected in quarantine for about a month or so, and so can retrieve a mail if informed about a mistaken rejection. I also do spot checks once in a while, though as time has gone on with no known false positives, the frequency of my checks has dropped.
This is mail correctly classified overall, but misclassified by either crm114, or spamassassin, but not both. This is an early warning sign, and is more common than mail that is misclassified, since usually the filter that is wrong is wrong weakly. But this is the point where training should occur, so that the filter does not drift to the point that mails are misclassified. Again, the mlg script catches this.
This is Mail correctly classified, but something that mailreaver is unsure about — and this category is why mailreaver learns faster than mailtrainer.
At this point I should say something about how I generally handle mails scored as Spam by the filters. As you can see, the mail handling is simple; depending on the combined score given to the mail by the filters. The handling rules are:
So, any mail with score less than 15 is accepted, potentially after grey-listing. The disposition is done according to the following set of rules:
In the last 18+months, I have not seen a Ham mail in my realspam folder; chances of Ham being rejected are pretty low. My Spam folder gets a ham message every few months, but these are usually spamassassin misclassifying things; and mlg detects those. I have not seen one of these in the last 6 months. So my realspam canary has done wonders for my peace of mind. With ideally trained filters, spam and realspam folders would be empty.
I have created a script called mlg (“Mail List Grey”) that I run periodically over my mail folder, that picks out mails that either (a) are classified differently by spamassassin and crm114, or (b) are marked as unsure by mailreaver. The script takes these mails and saves them into a grey.mbox folder. I tend to run them over Spam and non-Spam folders in different runs, so that the grey.mbox folder can be renamed to either ham or junk, in the vast majority of the cases. Only for misclassified mails do I have to individually pick the misplaced email and classify it separately from the rest of the emails in that batch.
At this point, I should have mail mbox folders called ham and/or junk, which are now ready to be used for training either crm114 or spamassassin or both. Processing these folders is the subject of the next article in this series.
The last time when I blogged about Spam fighting Mail Filtering With CRM114 Part 1, I left y’all with visions of non-converging learning, various ingenious ways of working around a unclean corpus, and perhaps a sinking feeling that this whole thing was more fragile than it ought to be.
During this eriod of flailing around, trying to get mailtrainer to learn the full corpus correctly, I upgraded to an unpackaged version of crm114. Thanks to the excellent packaging effort by the maintianers, this was dead easy: get the debian sources using apt-get source crm114, download tghe new tarball from crm114 upstream, cp the debian dir over, and just edit the changelog file to reflect the new version. I am currently running my own, statically linked 20061103-Blame Dalkey.src-1.
Cleaning the corpus made a major difference to the quality of discrimination. As mentioned earlier, I examined every mail that was initially incorrectly classified during learning. Now, there are two ways this can happen: That the piece of mail was correctly placed in the corpus, but had a feature that was different from those learned before; or that it was wrongly classified by me. When I started the chances were almost equally likely; I have now hopefully eliminated most of the misclassifications. When mailtrainer goes into cycles, retraining on a couple of emails round after round, you almost certainly are trying to train in conflicting ways. Cyclic retraining is almost always a human’s error in classification.
Some of the errors discovered were not just misclassifications: some where things that were inappropriate mail, but not Spam; for instance there was the whole conversation where someone one subscribed debian-devel to another mailing list, there was the challenge, the subscription notice, the un-subscription, challenge, and notice — all of which were inappropriate, and interrupted the flow, and contributed to the noise — but were not really Spam. I had, in a fit of pique, labelled them as Spam; but they were really like any other mailing list subscription conversations, which I certainly want to see for my subscriptions. crm114 did register the conversations as Spam and non-Spam, as requested, but that increased the similarity between Spam and non-Spam features — and probably decreased the accuracy. I’ve since decided to train only on Spam, not on inappropriate mails; and let Gnus keep inappropriate mails from my eyes.
I’ve also waffled over time about whether or not to treat newsletters from Dr. Dobbs Journal or USENIX as Spam or not — now my rule of thumb is that since I signed up for them at some point, they are not Spam — though I don’t feel guilty about letting mailagent squirrel them away mostly out of sight.
A few tips about using mailtrainer:
The first step is to configure the CRM114 files, and these are now in pretty good shape as shipped with Debian. All that I needed to do was set a password, say that I’d use openssl base64 -d, and stick with the less verbose defaults (so no munging subjects, no saving all mail, no saving rejects, etc, since I have other mechanisms that do all that). The comments in the configuration files are plenty good enough. This part went off like a breeze; the results can be found here (most of the values are still the upstream default).
The next step was to create new, empty .css files. I have noticed that creating more buckets makes crm114 perform better, so I go with larger than norm crm114 .css files. I have no idea if this is the right thing to do, but I make like Nike and just do it. At some point I’ll ask on the crm114-general mailing list.
% cssutil -b -r -S 4194000 spam.css
% cssutil -b -r -S 4194000 nonspam.css
Now we have a blank slate; at this time the filter knows nothing, and is equally likely to call something Spam or non-Spam. We are now ready to learn. So, I girded my loins, and set about feeding my whole mail corpus to the filter:
/usr/share/crm114/mailtrainer.crm \
--spam=/backup/classify/Done/Spam/ \
--good=/backup/classify/Done/Ham/ \
--repeat=100 --streak=35000 | \
egrep -i '^ +|train|Excell|Running'
And this failed spectacularly (see Debian bug #399306). Faced with unexpected segment violations, and not being proficient in crm114’s rather arcane syntax, I was forced to speculation: I assumed (as it turns out, incorrectly) that if if you throw too many changes at the crm114 database, things rapidly escalate out of control. I went o to postulate that as my mail corpus was gathered over a period of errors, the characteristic of Spam drifted over time, and what I consider Spam has also evolved. So, some parts of the early corpus are at variance with the more recent bits.
Based on this assumption, I created a wrapper script which did what Clint has called training to exhaustion — it iterated over the corpus several times, starting with a small and geometrically increasing chunk size. Given the premise I was working under, it does a good job of training crm114 on a localized window of Spam: it feeds chunks of the corpus to the trainer, with each successive chunk overlapping the previous and succeeding chunks, and ensuring that crm114 is happy at any given chunk of the corpus. Then it doubles the chunk size, and tries goes at it again. All very clever, and all pretty wrong.
I also created another script to retrain crm114, which was less exhaustive than the previous one, but did incorporate any further drift in the nature of Spam. I no longer use these scripts; but I would like to record them for posterity as an indication of how far one can take an hypothesis.
What it did to was have crm114 learn without segfaulting — and show me that there was a problem in the corpus. I noticed that in some cases the trainer would find a pair of mail messages, classify them wrongly, and retrain and refute — iteration after iteration, back and forth. I noticed this when I added the egrep filter above, and was not drowning in the needless chatter from the trainer. It turns out, I had very similar emails (sometimes, even the same email) in the Ham and the Spam corpus, and no wonder crm114 was throwing hissy fits. Having small chunks ensured that I had not too many such errors in any chunk;and crm114 did try to forget a mail differently classified in an older chunk and learn whatever this chunk was trying to teach it. The downside was that the count of the differences between ham nd Spam went down, and the similarities increased — which meant that the filter was not as good at separating ham and Spam as it could have been.
So my much vaunted mail corpus was far from clean — over the years, I had misclassified mails, been wishy-washy and changed my mind about what was and was not Spam. I have a script that uses md5sums to find duplicate files in a directory, and found, to my horror, that there were scores of duplicates in the corpus. After eliminating outright duplicates, I started examining anything that showed up with an ER (error, refute) tag in the trainer output; on the second iteration of the training script these were likely to be misclassification. I spent days examining my corpus and cleaning it out; and was gratified to see the ratio of differences to similarities between ham and Spam css files climb from a shade under 3 to around 7.35.
Next post we’ll talk about lessons learned about training, and how a nominal work flow of training on errors and training when classifiers disagree can be set up.
Posted at lunch time on Tuesday, December 19th, 2006
License: GPL
I have a fairly sophisticated Spam filtering mechanism setup, using MIMeDefang for SMTP level rejection of Spam, grey-listing for mails not obviously ham or Spam, and using both crm114 and spamassassin for discrimination, since they compensate for each other when one of them can’t tell what it is looking at. People have often asked me to write up the details of my setup, and the support infrastructure, and I have decided to oblige. What follows is a series of blog posts detailing how I set about redoing my crm114 setup, with enough detail that interested parties can tag along.
I noticed that the new crm114 packages have a new facility called mailtrainer, which can be used to setup initial css database files. I liked the fact that it can run over the training data several times, it can keep back portions of the training data as a control group, and you can tell it to keep on going until it gets a certain number of mails discriminated correctly. This is cool, since I have a corpus of about 15K Spam and 16K ham messages, mostly stored on my previous train-on-error practice (whenever crm114 classified a message incorrectly, I train it, and store all training emails). I train whenever crm114 and spamassassin disagree with each other.
This would also be a good time to switch from crm114’s mailfilter to the new mailreaver, which is the new, 3rd Generation mail filter script for crm114. It is easier to maintain, and since it flags mails for which it is unsure, and you ought to train all such mails anyway, it learns faster.
Also, since the new packages come with a brand new discrimination algorithms which are supposed to be as accurate but also faster, but may store data in incompatible ways, I figured that it might be time to redo my CRM mail filter from the ground up. The default now is “osb unique microgroom”. This change requires me to empty out the css files.
I also decided to change my learning scripts to not send command via email to a ‘testcrm’ user, instead, now I train the filter using a command line mode. Apart from saving mails incorrectly classified into folders ham and junk (Spam was already taken), I have a script that grabs and saves mails which are classified differently by crm114 and spamassassin from my incoming mail spool and saves it into a grey.mbox file, which I can manually separate out into ham and junk. Then I have a processing script, that takes the ham and junk folders and trains spamassassin, or crm114, or both; and stashes the training files away into my cached corpus for the future.
In subsequent blog postings, I’ll walk people through how I setup and initialized my filter, and provide examples of the scripts I used, along with various and sundry missteps and lessons learned and all.
It is funny that people say that no one has written software that does Beysian filtering in order to reject Spam at the SMTP level, just when I felt like writing up my mimedefang setup. I am of the school of thought that the only acceptable method of dealing with spam is to reject it at the SMTP protocol level. To accomplish this, I run crm114 and spamassassin from my mimedefang-filter, which runs as a milter while my sendmail is processing mail. I also do conditional grey listing, more on that below.
The reason for running two different Spam detection mechanisms is that while one or the other is often fooled and mis-categorizes mail, the score or confidence level of such mis-categorization is low, and the mechanism that is not fooled enables the combined system to do the right thing in most cases. I tend to quarantine and keep even the messages I reject for about a month, and do spot checks, so out of roughly a 800 mails a day volume, I get about one or two spam messages a day, and did not ever reject a mail I manually would classify as ham in the 6 months n which I double checked every decision taken by my system, and in the spot checks I continue to perform. While this is not as good a record as the author of crm114 reports, it is good enough for me.
The grey listing: since I try to be conservative, and opt to lean towards not discarding any email that is legitimate at the expense of not trapping all spam, and since I do not discard any mail I accept unread based on an automated decision, I was getting more Spam than I cared for still. So, I added the grey listing layer — but only for mail that were not ambiguously spam or viruses (which are rejected), or ham, which are accepted for delivery. Most of the Spam that was slipping through came from the grey area. So, I implemented grey listing only for these mails.
Grey listing happens on a triplet of the class C domain of the senders IP, the sender, and recipient. A grey listed email encounters processing problems, and is temporarily rejected with a 503 for 30 minutes, and then retries are allowed through for 4 hours, in which case the triplet is white-listed for 35 days (catches monthly announcement messages). Known Spam resets the white-listing, and retrying past the window does too.
Doing it this way, known good correspondents suffer no delay, unambiguous mail from strangers flows right through, and I improve my filters based on the results all the time, so my discriminator improves.
My filter is available for examination. See also the older posting, like the time honoured tradition of /., this posting is a duplicate.
Posted mid-morning Saturday, May 21st, 2005
License: GPL
I have been fairly comfortable with my current mail filtering
solution. Unlike some other blogs on Planet, when last I was doing
exhaustive checks, I had not had a false positive in six months
(now I just do random spot checks). The false negatives creep up
from a few a week to a few a day (with a mail feed of about 1k
emails a day). And the stuff that is classified as a definite spam
is REJECTed (I do keep a copy), which means that a
legitimate correspondent would have an idea something went wrong
(unless they are ignoring such bounces form their own
MAILER-DAEMON).
Part of the reason for this satisfactory performance is that I
use a layered approach, with several tools playing off each other,
ameliorating each others mistakes. Admittedly, this took painful
training — I created a testcrm user, and bounced a
copy of every mail I got to it using a .forward. Then,
over a course of two months, I would painstakingly go over the
classification, training on error, until the failure rates dropped
to levels I felt comfortable with, and moved the CRM114 and
Spamassassin configuration over for my own use.
I recently added a hand crafted Greylisting implementation to MIMEDefang — in good MIMEDefang tradition, this is a heavily tweaked version of an implementation on the mailing list, using PostgreSQL. I have modified it to not greylist every single new email that comes my way, but to only greylist stuff that CRM114 and Spamassassin have been unable to classify strongly as Spam or ham. mail that has been strongly classified already, in turn, affects greylisting.
Here is the SQL code for the implementation, and the mimedefang-filter itself, showing how I integrate CRM114 and Spamassassin along with greylisting in a Sendmail Milter.
Have fun.