License: GPL
UPDATE: This posting has severe flaws, which were discovered subsequently. Please ignore.
I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%).
I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) — about 2/3’rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server.
Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter
However, all these numbers are manually gathered, and I still have not gotten around to automating my setup’s overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I’ll update the numbers below via cron.
UPDATE: The css files used below were malformed, and the process of creating them detailed below is flawed. Please see newer postings in this category.
First, some context: when training CRM114 using the
mailtrainer command, one can specify to leave out a
certain percentage of the training set in the learn phase, and run
a second pass over the mails so skipped to test the accuracy of the
training. The way you do this is by specifying a regular expression
to match the file names. Since my training set has message numbers,
it was simple to use the least significant two digits as a regexp;
but I did not like the idea of always leaving out the same
messages. So I now generate two sets of numbers for every training
run, and leave out messages with those two trailing digits, in
effect reserving 2% of all mails for the accuracy run.
An interesting thing to note is the assymetry in the accuracy:
CRM114 has never identified a Spam message incorrectly. This is
because the training mechanism is skewed towards letting a few spam
messages slip through, rather than let a good message slip into the
spam folder. I like that. So, here are the accuracy numbers for
CRM114; adding in Spamassassin into the mix only improves the
numbers. Also, I have always felt that a freshly learned css file
is somewhat brittle — in the sense that if one trains an
unsure
message, and then tried to TUNE (Train Until No
Errors) the css file, a large number of runs through the training
set are needed until the thing stabilizes. So it is as if the
learning done initially was minimalistic, and adding the
information for the new unsure message required all kinds of
tweaking. After a while TOEing (Training on Errors) and TUNEing,
this brittleness seems to get hammered out of the CSS files. I also
expect to see accuracy rise as the css files get less brittle — The
table below starts with data from a newly minted .css file.
| Date | Corpus | Ham | Spam | Overall | Validation | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Size | Count | Correct | Accuracy | Count | Correct | Accuracy | Count | Correct | Accuracy | Regexp | |
| Wed Oct 31 10:22:23 UTC 2007 | 43319 | 492 | 482 | 97.967480 | 374 | 374 | 100.000000 | 866 | 856 | 98.845270 | [1][6][_][_]|[0][3][_][_] |
| Wed Oct 31 17:32:44 UTC 2007 | 43330 | 490 | 482 | 98.367350 | 378 | 378 | 100.000000 | 868 | 860 | 99.078340 | [3][7][_][_]|[2][3][_][_] |
| Thu Nov 1 03:01:35 UTC 2007 | 43334 | 491 | 483 | 98.370670 | 375 | 375 | 100.000000 | 866 | 858 | 99.076210 | [2][0][_][_]|[7][9][_][_] |
| Thu Nov 1 13:47:55 UTC 2007 | 43345 | 492 | 482 | 97.967480 | 376 | 376 | 100.000000 | 868 | 858 | 98.847930 | [1][2][_][_]|[0][2][_][_] |
| Sat Nov 3 18:27:00 UTC 2007 | 43390 | 490 | 480 | 97.959180 | 379 | 379 | 100.000000 | 869 | 859 | 98.849250 | [4][1][_][_]|[6][4][_][_] |
| Sat Nov 3 22:38:12 UTC 2007 | 43394 | 491 | 482 | 98.167010 | 375 | 375 | 100.000000 | 866 | 857 | 98.960740 | [3][1][_][_]|[7][8][_][_] |
| Sun Nov 4 05:49:45 UTC 2007 | 43400 | 490 | 483 | 98.571430 | 377 | 377 | 100.000000 | 867 | 860 | 99.192620 | [4][6][_][_]|[6][8][_][_] |
| Sun Nov 4 13:35:15 UTC 2007 | 43409 | 490 | 485 | 98.979590 | 377 | 377 | 100.000000 | 867 | 862 | 99.423300 | [3][7][_][_]|[7][9][_][_] |
| Sun Nov 4 19:22:02 UTC 2007 | 43421 | 490 | 486 | 99.183670 | 379 | 379 | 100.000000 | 869 | 865 | 99.539700 | [7][2][_][_]|[9][4][_][_] |
| Mon Nov 5 05:47:45 UTC 2007 | 43423 | 490 | 489 | 99.795920 | 378 | 378 | 100.000000 | 868 | 867 | 99.884790 | [4][0][_][_]|[8][3][_][_] |
As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.




