Tales from the Gryphon

Filtering accuracy: Hard numbers

Manoj's hackergotchi
Monday 05 November
2007

License: GPL

UPDATE: This posting has severe flaws, which were discovered subsequently. Please ignore.

I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%).

I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) — about 2/3’rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server.

Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter

However, all these numbers are manually gathered, and I still have not gotten around to automating my setup’s overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I’ll update the numbers below via cron.

UPDATE: The css files used below were malformed, and the process of creating them detailed below is flawed. Please see newer postings in this category.

First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate two sets of numbers for every training run, and leave out messages with those two trailing digits, in effect reserving 2% of all mails for the accuracy run.

An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam messages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle — in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle — The table below starts with data from a newly minted .css file.

Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_]|[0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_]|[2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_]|[7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_]|[0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_]|[6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_]|[7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_]|[6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_]|[7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_]|[9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_]|[8][3][_][_]

As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.

Manoj