Training accuracy for CRM114

I have often blogged about the efficiency of my Spam filtering setup, I've claimed that the combined CRM114 and Spamassasin setup I have is 99.92% (99.98% when both my classifiers are sure), but I have had very little data to back that up. I still have not gotten around to automating my setup's overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114.

First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate four sets of numbers for every training run, and can optionally reserve 0%, 1%, 2%, 4%, or 10% of all mails for the accuracy run. Usually I train with 0% reserved. When the css files are new and still changing a lot, I measure accuracy with 2% reserved, and later on, I reserve 10% in the accuracy test runs.

An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam mesages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle -- in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle -- The table below starts with data from a newly minted .css file, and as you can see, the accuracy climbs, especially after I swithc to reserving 10% of mails for the accuracy run.

Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_]|[0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_]|[2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_]|[7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_]|[0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_]|[6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_]|[7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_]|[6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_]|[7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_]|[9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_]|[8][3][_][_]
Mon Nov 5 16:39:54 UTC 2007 43441 493 489 99.188640 376 376 100.000000 869 865 99.539700 [1][2][_][_]|[6][7][_][_]
Mon Nov 5 18:56:18 UTC 2007 43445 2459 2444 99.390000 1890 1890 100.000000 4349 4334 99.655090 [0][_][_]
Tue Nov 6 05:56:47 UTC 2007 43461 2457 2434 99.063900 1892 1892 100.000000 4349 4326 99.471140 [4][_][_]
Tue Nov 6 07:55:09 UTC 2007 43461 2458 2440 99.267700 1884 1884 100.000000 4342 4324 99.585440 [8][_][_]
Tue Nov 6 16:32:17 UTC 2007 43478 2461 2442 99.227960 1886 1886 100.000000 4347 4328 99.562920 [3][_][_]
Tue Nov 6 17:21:45 UTC 2007 43479 2461 2440 99.146690 1885 1885 100.000000 4346 4325 99.516800 [9][_][_]
Tue Nov 6 18:17:36 UTC 2007 43490 2459 2436 99.064660 1892 1892 100.000000 4351 4328 99.471390 [4][_][_]
Tue Nov 6 20:46:07 UTC 2007 43492 2462 2440 99.106420 1885 1885 100.000000 4347 4325 99.493900 [9][_][_]
Tue Nov 6 21:44:01 UTC 2007 43492 2458 2444 99.430430 1891 1891 100.000000 4349 4335 99.678090 [1][_][_]
Wed Nov 7 03:21:18 UTC 2007 43505 2464 2444 99.188310 1887 1887 100.000000 4351 4331 99.540340 [3][_][_]
Sat Nov 10 06:48:07 UTC 2007 43583 2469 2449 99.189960 1887 1887 100.000000 4356 4336 99.540860 [6][_][_]
Sat Nov 10 07:35:07 UTC 2007 43584 2467 2447 99.189300 1889 1889 100.000000 4356 4336 99.540860 [9][_][_]
Sat Nov 10 08:08:54 UTC 2007 43585 2468 2452 99.351700 1890 1890 100.000000 4358 4342 99.632860 [7][_][_]
Sat Nov 10 16:28:48 UTC 2007 43595 2468 2448 99.189630 1892 1892 100.000000 4360 4340 99.541280 [3][_][_]
Sat Nov 10 19:38:36 UTC 2007 43600 2468 2453 99.392220 1894 1894 100.000000 4362 4347 99.656120 [5][_][_]
Sat Nov 10 21:01:03 UTC 2007 43601 2468 2448 99.189630 1892 1892 100.000000 4360 4340 99.541280 [3][_][_]
Sat Nov 10 21:44:27 UTC 2007 43601 2467 2452 99.391970 1897 1897 100.000000 4364 4349 99.656280 [0][_][_]
Sun Nov 11 08:26:01 UTC 2007 43603 2470 2450 99.190280 1889 1889 100.000000 4359 4339 99.541180 [2][_][_]
Mon Nov 12 23:05:50 UTC 2007 43638 2469 2455 99.432970 1899 1899 100.000000 4368 4354 99.679490 [0][_][_]
Tue Nov 13 17:04:17 UTC 2007 43655 2472 2453 99.231390 1891 1891 100.000000 4363 4344 99.564520 [6][_][_]
Tue Nov 13 20:27:50 UTC 2007 43655 2466 2448 99.270070 1899 1899 100.000000 4365 4347 99.587630 [1][_][_]
Tue Nov 13 21:20:38 UTC 2007 43655 2471 2452 99.231080 1894 1894 100.000000 4365 4346 99.564720 [7][_][_]
Wed Nov 14 07:37:41 UTC 2007 43670 2469 2451 99.270960 1894 1894 100.000000 4363 4345 99.587440 [8][_][_]