I have often blogged about the efficiency of my Spam filtering setup, I've claimed that the combined CRM114 and Spamassasin setup I have is 99.92% (99.98% when both my classifiers are sure), but I have had very little data to back that up. I still have not gotten around to automating my setup's overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114.
First, some context: when training CRM114 using the
mailtrainer command, one can specify to leave out a
certain percentage of the training set in the learn phase, and
run a second pass over the mails so skipped to test the accuracy
of the training. The way you do this is by specifying a regular
expression to match the file names. Since my training set has
message numbers, it was simple to use the least significant two
digits as a regexp; but I did not like the idea of always
leaving out the same messages. So I now generate four sets of
numbers for every training run, and can optionally reserve 0%,
1%, 2%, 4%, or 10% of all mails for the accuracy run. Usually
I train with 0% reserved. When the css files are new and still
changing a lot, I measure accuracy with 2% reserved, and later
on, I reserve 10% in the accuracy test runs.
An interesting thing to note is the assymetry in the accuracy:
CRM114 has never identified a Spam message incorrectly. This is
because the training mechanism is skewed towards letting a few
spam mesages slip through, rather than let a good message slip
into the spam folder. I like that. So, here are the accuracy
numbers for CRM114; adding in Spamassassin into the mix only
improves the numbers. Also, I have always felt that a freshly
learned css file is somewhat brittle -- in the sense that if one
trains an unsure
message, and then tried to TUNE (Train
Until No Errors) the css file, a large number of runs through
the training set are needed until the thing stabilizes. So it
is as if the learning done initially was minimalistic, and
adding the information for the new unsure message required all
kinds of tweaking. After a while TOEing (Training on Errors)
and TUNEing, this brittleness seems to get hammered out of the
CSS files. I also expect to see accuracy rise as the css files
get less brittle -- The table below starts with data from a newly
minted .css file, and as you can see, the accuracy climbs,
especially after I swithc to reserving 10% of mails for the
accuracy run.
| Date | Corpus | Ham | Spam | Overall | Validation | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Size | Count | Correct | Accuracy | Count | Correct | Accuracy | Count | Correct | Accuracy | Regexp | |
| Wed Oct 31 10:22:23 UTC 2007 | 43319 | 492 | 482 | 97.967480 | 374 | 374 | 100.000000 | 866 | 856 | 98.845270 | [1][6][_][_]|[0][3][_][_] |
| Wed Oct 31 17:32:44 UTC 2007 | 43330 | 490 | 482 | 98.367350 | 378 | 378 | 100.000000 | 868 | 860 | 99.078340 | [3][7][_][_]|[2][3][_][_] |
| Thu Nov 1 03:01:35 UTC 2007 | 43334 | 491 | 483 | 98.370670 | 375 | 375 | 100.000000 | 866 | 858 | 99.076210 | [2][0][_][_]|[7][9][_][_] |
| Thu Nov 1 13:47:55 UTC 2007 | 43345 | 492 | 482 | 97.967480 | 376 | 376 | 100.000000 | 868 | 858 | 98.847930 | [1][2][_][_]|[0][2][_][_] |
| Sat Nov 3 18:27:00 UTC 2007 | 43390 | 490 | 480 | 97.959180 | 379 | 379 | 100.000000 | 869 | 859 | 98.849250 | [4][1][_][_]|[6][4][_][_] |
| Sat Nov 3 22:38:12 UTC 2007 | 43394 | 491 | 482 | 98.167010 | 375 | 375 | 100.000000 | 866 | 857 | 98.960740 | [3][1][_][_]|[7][8][_][_] |
| Sun Nov 4 05:49:45 UTC 2007 | 43400 | 490 | 483 | 98.571430 | 377 | 377 | 100.000000 | 867 | 860 | 99.192620 | [4][6][_][_]|[6][8][_][_] |
| Sun Nov 4 13:35:15 UTC 2007 | 43409 | 490 | 485 | 98.979590 | 377 | 377 | 100.000000 | 867 | 862 | 99.423300 | [3][7][_][_]|[7][9][_][_] |
| Sun Nov 4 19:22:02 UTC 2007 | 43421 | 490 | 486 | 99.183670 | 379 | 379 | 100.000000 | 869 | 865 | 99.539700 | [7][2][_][_]|[9][4][_][_] |
| Mon Nov 5 05:47:45 UTC 2007 | 43423 | 490 | 489 | 99.795920 | 378 | 378 | 100.000000 | 868 | 867 | 99.884790 | [4][0][_][_]|[8][3][_][_] |
| Mon Nov 5 16:39:54 UTC 2007 | 43441 | 493 | 489 | 99.188640 | 376 | 376 | 100.000000 | 869 | 865 | 99.539700 | [1][2][_][_]|[6][7][_][_] |
| Mon Nov 5 18:56:18 UTC 2007 | 43445 | 2459 | 2444 | 99.390000 | 1890 | 1890 | 100.000000 | 4349 | 4334 | 99.655090 | [0][_][_] |
| Tue Nov 6 05:56:47 UTC 2007 | 43461 | 2457 | 2434 | 99.063900 | 1892 | 1892 | 100.000000 | 4349 | 4326 | 99.471140 | [4][_][_] |
| Tue Nov 6 07:55:09 UTC 2007 | 43461 | 2458 | 2440 | 99.267700 | 1884 | 1884 | 100.000000 | 4342 | 4324 | 99.585440 | [8][_][_] |
| Tue Nov 6 16:32:17 UTC 2007 | 43478 | 2461 | 2442 | 99.227960 | 1886 | 1886 | 100.000000 | 4347 | 4328 | 99.562920 | [3][_][_] |
| Tue Nov 6 17:21:45 UTC 2007 | 43479 | 2461 | 2440 | 99.146690 | 1885 | 1885 | 100.000000 | 4346 | 4325 | 99.516800 | [9][_][_] |
| Tue Nov 6 18:17:36 UTC 2007 | 43490 | 2459 | 2436 | 99.064660 | 1892 | 1892 | 100.000000 | 4351 | 4328 | 99.471390 | [4][_][_] |
| Tue Nov 6 20:46:07 UTC 2007 | 43492 | 2462 | 2440 | 99.106420 | 1885 | 1885 | 100.000000 | 4347 | 4325 | 99.493900 | [9][_][_] |
| Tue Nov 6 21:44:01 UTC 2007 | 43492 | 2458 | 2444 | 99.430430 | 1891 | 1891 | 100.000000 | 4349 | 4335 | 99.678090 | [1][_][_] |
| Wed Nov 7 03:21:18 UTC 2007 | 43505 | 2464 | 2444 | 99.188310 | 1887 | 1887 | 100.000000 | 4351 | 4331 | 99.540340 | [3][_][_] |
| Sat Nov 10 06:48:07 UTC 2007 | 43583 | 2469 | 2449 | 99.189960 | 1887 | 1887 | 100.000000 | 4356 | 4336 | 99.540860 | [6][_][_] |
| Sat Nov 10 07:35:07 UTC 2007 | 43584 | 2467 | 2447 | 99.189300 | 1889 | 1889 | 100.000000 | 4356 | 4336 | 99.540860 | [9][_][_] |
| Sat Nov 10 08:08:54 UTC 2007 | 43585 | 2468 | 2452 | 99.351700 | 1890 | 1890 | 100.000000 | 4358 | 4342 | 99.632860 | [7][_][_] |
| Sat Nov 10 16:28:48 UTC 2007 | 43595 | 2468 | 2448 | 99.189630 | 1892 | 1892 | 100.000000 | 4360 | 4340 | 99.541280 | [3][_][_] |
| Sat Nov 10 19:38:36 UTC 2007 | 43600 | 2468 | 2453 | 99.392220 | 1894 | 1894 | 100.000000 | 4362 | 4347 | 99.656120 | [5][_][_] |
| Sat Nov 10 21:01:03 UTC 2007 | 43601 | 2468 | 2448 | 99.189630 | 1892 | 1892 | 100.000000 | 4360 | 4340 | 99.541280 | [3][_][_] |
| Sat Nov 10 21:44:27 UTC 2007 | 43601 | 2467 | 2452 | 99.391970 | 1897 | 1897 | 100.000000 | 4364 | 4349 | 99.656280 | [0][_][_] |
| Sun Nov 11 08:26:01 UTC 2007 | 43603 | 2470 | 2450 | 99.190280 | 1889 | 1889 | 100.000000 | 4359 | 4339 | 99.541180 | [2][_][_] |
| Mon Nov 12 23:05:50 UTC 2007 | 43638 | 2469 | 2455 | 99.432970 | 1899 | 1899 | 100.000000 | 4368 | 4354 | 99.679490 | [0][_][_] |
| Tue Nov 13 17:04:17 UTC 2007 | 43655 | 2472 | 2453 | 99.231390 | 1891 | 1891 | 100.000000 | 4363 | 4344 | 99.564520 | [6][_][_] |
| Tue Nov 13 20:27:50 UTC 2007 | 43655 | 2466 | 2448 | 99.270070 | 1899 | 1899 | 100.000000 | 4365 | 4347 | 99.587630 | [1][_][_] |
| Tue Nov 13 21:20:38 UTC 2007 | 43655 | 2471 | 2452 | 99.231080 | 1894 | 1894 | 100.000000 | 4365 | 4346 | 99.564720 | [7][_][_] |
| Wed Nov 14 07:37:41 UTC 2007 | 43670 | 2469 | 2451 | 99.270960 | 1894 | 1894 | 100.000000 | 4363 | 4345 | 99.587440 | [8][_][_] |