Tales from the Gryphon/ archives/

Tales from the Gryphon

Archives for 2007

Manoj's hackergotchi
Add a new post titled:
Thursday 31 January
2008
Link: Fear is the key

Posted early Thursday morning, January 31st, 2008

Fear is the key

This is an old favorite from Alistair Maclean, and I'm revisiting this after some 20 odd years since the last time I read it. This books is from a time period towards the end of his good phase, and is heavier on the action and lower on plot, and the amount of suspension of disbelief is correspondingly higher than his best, but this is still very good. Perhaps I don't like it just because it is so sad. Anyway, this is one of a set of Maclean novels I picked up recently, and some of the others are more of a guilty pleasure than this one.

Manoj

Thursday 31 January
2008
Link: The Messenger

Posted early Thursday morning, January 31st, 2008

The Messenger

Yet another of Silva's books on Allon. This is the first one in which Allon is active in the Office from the beginning, and one of the most complex plot lines for Silva. Allons relationships advance as well. A very enjoyable caper.

Manoj

Thursday 31 January
2008
Link: The Prince of Fire

Posted early Thursday morning, January 31st, 2008

The Prince of Fire

This novel brings Allon's past, and present, deeper into the lime light; and Silva's writing is beginning to take on the human touch that his earliest works lacked. He invites the reader to invest in the character far more now than he used, and I think that makes his work better. I like how he is also beginning to bring Leah back alive again, a little at a time. And how this plays with Chiara.

Manoj

Monday 24 December
2007
Link: A death in Vienna

Posted early Monday morning, December 24th, 2007

A death in Vienna

The final part of the Nazi escapees and the collaborators that helped them. This book takes aim at the Swiss banking system, flush with Jewish gold, and the ways in which war criminals were helped and ensconced into high finance society post-WW-II. Again, I think this trilogy leads back to the original book I read by Silva (the most recent, it turns out).

Silva's writing definitely seems to be maturing as he goes along, and the characters get to be more human with every book. This is a writer still coming into his own; and while the trade craft might not yet be up to Smiley's standards, it is not very far off.

Manoj

Monday 24 December
2007
Link: The Choice of the cat

Posted early Monday morning, December 24th, 2007

The Choice of the cat

The second in the Vampire Earth series by E. E. Knight, this book has our protagonist moving to the next step in the genetically augmented "hunter" clans of the human resistance. Well written, though the next stage of the character development would require the people to actually get into some relationships. Apart from that, the series is moving well along, with other team members being brought into the fold.

Manoj

Monday 24 December
2007
Link: The English assassin

Posted early Monday morning, December 24th, 2007

The English assassin

This is the first of a trilogy that Silva has written around the Nazi looting of the wealth of the Jews of Europe, specifically the art, in this book. This was an excellent read; I think Silva might have found his groove talking about the broader historical context of the Israeli intelligence service operations and the hunting of those responsible for the holocaust.

I certainly recommend this book.

Manoj

Monday 24 December
2007
Link: The confessor

Posted early Monday morning, December 24th, 2007

The confessor

In this, the second of his trilogy exploring the various ways in which Nazi war criminals tried to escape the consequences of their actions during the holocaust, Silva explores the inaction, or perhaps even the tacit cooperation, of the Catholic church, during the holocaust. I suppose this is likely to be controversial. It was a compelling read.

Manoj

Monday 24 December
2007
Link: The kill artist

Posted early Monday morning, December 24th, 2007

The kill artist

The first Gabriel Allon novel by Daniel Silva. It feels much better than the Michael Osborne series; I can begin to feel that Gabriel is human. Tariq feels a little too super human; but the book is fast paced, and better than the run of the mill fare, though only marginally so. The bad guy is still one of the jihadists like those that we met in The Secret Servant.

Manoj

Monday 24 December
2007
Link: The marching season

Posted early Monday morning, December 24th, 2007

The marching season

Yet another book by Daniel Silva, and I am even more disappointed. The pacing has fallen off; the characters are even more pasteboard; and the book is far less compelling than the Gabriel Allon book that drew me to Silva. This is a far, far cry from Smiley's circus.

Manoj

Monday 24 December
2007
Link: The mark of the assassin

Posted early Monday morning, December 24th, 2007

The mark of the assassin

Based on the The Secret Servant, I ordered a whole slew of books from the same author, including this one. I am still impressed by the geopolitical insight; a few years before 9/11, the author sets up a reasonable facsimile by having evil doers blow up a plane taking off from New York; though the evil doers in question are bunches of cold war nostalgic members of the military-intelligence-industrial collective.

The action is till fast paced, though I am far less impressed by the characters than I was in the secret servant. Micheal and Elizabeth seem to be two dimensional cut outs, not people; the person I felt most in touch with was the evil doer assassin.

Manoj

Monday 24 December
2007
Link: The way of the wolf

Posted early Monday morning, December 24th, 2007

The way of the wolf

This is a SCi-Fi series from E. E. Knight. It is not often that one comes across a brand new series in fantasy or science fiction; and more rarely still when it has the quality of this one (The Vampire Earth series). This is a post-apocolaptic novel, the apocalypse being a virus that killed most of the human population, unleashed by a gate-travelling extra-solar species to disrupt human resistance as they took over. Ostensibly about vampires, it provides an interesting back story to explain "master" vampires and their reaper thralls.

What was captivating about this book is the detailed and generally coherent world building, the swaths of land under outsider control, where there is law and order and culling of humans for food; and the rag tag resistance. The characters are fairly well developed (though the author shies away from romantic relationships of any kind).

Not since the Recluse novels have I felt this way about a new series.

Manoj

Saturday 15 December
2007
Link: Ankur

Posted early Saturday morning, December 15th, 2007

Ankur

This one of the very first "art" films from Indian cinema, and in some senses the very antithesis of a Bollywood movie. It explores things like caste, parent child relationships, and aspects of rural society in India, and much, much, more. And while doing so, it manages to tell a tail that draws you into it, and into caring for the characters. Pretty amazing.

Read all about Ankur), thanks to the wikipedia.

Manoj

Saturday 15 December
2007
Link: Flags of our fathers

Posted early Saturday morning, December 15th, 2007

Flags of our fathers

This was a good movie. Perhaps I'm guilty of heresy in saying this, but it was not a great movie. It did come as somewhat of a surprise to me that the movie was about the flag raising (a humdrum chore when it was done) photograph, not about Iwo Jima or ghe marine corps or uch about the war (apart from what lead to the flag raising -- mostly to provide a contrast to the the actual flag raising. The impression I took away from it, the thing that made the most impact on me, was the sheer desperation of the fund raising, trying to get a weary and cynical population to buy war bonds.

Clint Eastwood is to be commended on an unusual take on the sordid details of war -- there are no drum beats going in this movie.

Manoj

Sunday 25 November
2007
Link: 300, and the history channel perspective.

Posted early Sunday morning, November 25th, 2007

300, and the history channel perspective.

Yes, this is about a movies based on a comic based on a movie from the 50's. And they did a wonderful job of conveying to comic book feel -- and yet, though you could appreciate the abstract, stylized presentation of the comic, most of the movie still came straight from Herodotus. The training of the Spartans, the throwing of the Persian emissaries into a pit and a well -- this cleaving to the historic details was a pleasant surprise. The history channel presentation is recommended for the perspective it brings to the tale.

There were some poetic licenses -- the whole bit about a highly placed Spartan traitor was made out of plain cloth; and the current convention wisdom is that Leonidas went to Thermopylae because of his religious beliefs, and conviction about the sacred prophecy of the oracle at Delphi, not because he thought Persia would destroy Greece (remember, Xerxes won, and sacked Athens). Indeed, there was little concept of "Greece" at that point.

Indeed, the whole stick about the last stand at Thermopylae saving democracy seems suspect -- the stand bloodied Persia's nose, and delayed them by perhaps 5 days -- in an advance that took the better part of a year that the Greeks knew about. No, it was the combination of Marathon, Thermopylae, Salamis, Plataea -- over the course of half a century -- that ensure that the no name David of the Greek city states survived against the Goliath of Persia. And, then, of course, came the boy wonder out of Macedonia.

Highly recommended.

Manoj

Sunday 25 November
2007
Link: Filtering Accuracy: Brown paper bag time

Posted early Sunday morning, November 25th, 2007

Filtering Accuracy: Brown paper bag time

After posting about filtering accuracy I got to thinking about the test I was using. It appeared to me that there should be no errors in the mails that crm114 had already been trained upon -- but here I was, coming up with errors when I trained the css files until there were no errors, and then used a reg exp that tried to find the accuracy of classification for all the files, not just new ones. This did not make sense.

The only explanation was that my css files were not properly created -- and I thought to try an experiment where isntead of trying to throw my whole corpus as one chunk at a blank css file, I would feed the corpus in chunks. I cam up with an initialization script to feed my corpus to a blank css file in 200 mail chunks; and, while it was at it, renumber the corpus mails (over the years, as I cleaned the corpus, gaps had appeared in the numbering). I have also updated the retraining script

Voila. I am getting a new set of css files which do not appear to show any errors for mails crm114 has already learned about -- in other words, for mails it has seen, the accuracy is now 100%, not 99.5% as it was a couple of days ago.

While it is good news in that my classification accuracy is better than it was last week; the bad news is that I no longer have concrete number on accuracy for crm114 anymore -- the mechanism used now gives 100% accuracy all the time. The funny thing is, I recall going through this analysis a couple of years ago, where I predicted that one could only check for accuracy with a test corpus that had the same characteristics as real mail inflow, and which had not been used for learning. That wold mean I would have classified testing corpus that could improve the efficiency of my filter, but was not being used to provide accuracy numbers -- I have gone for improving the filter, at the cost of knowing how accurate they actually are.

Manoj

Sunday 25 November
2007
Link: The children of men

Posted early Sunday morning, November 25th, 2007

The children of men

A nicely placed movie about a bleak future, and how people cope with despair and desperate times. While it did not quite come together in the details (anything outside of England was a big unknown blur), and the London of 2027 seemed not much different from any current day city under semi martial law (technology, for instance, seems to have frozen at todays levels), it was still fast paced, and enjoyable, and anyway, this is not primarily a sci-fi flick.

Recommended.

Manoj

Sunday 18 November
2007
Link: Eragon

Posted early Sunday morning, November 18th, 2007

Eragon

I liked the book. Sure, it is "The Lord of the Rings" meets "Star Wars", but, the book had a nice flow -- and it was written by a fifteen year old, fer gawds sake. The very fact that he can turn out a page turner of a book when others of his age can't string together a grammatical sentence spelled correctly is amazing. Overall, derivative, unoriginal, and simplistic though the book is, it has an original charm -- a very good book for children, and one that adults can read through as well.

So I went to this movie with high hopes. What a let down. This was merely a notch above the Beowulf debacle. Lack luster performances, bland, drudge of a movie, with all kinds of interesting elements and nuances from the book removed. Crude, unimaginative, ham handed performances all around. The plot line, which did not follow the book, was dumbed down, there were implications that the Elven princess was a potential love interest (faugh), and the refreshing pace of the book fell off to a plodding soporific caricature. It is an offense to the book, and to the author.

I was going to point out the differences between the movie and the book; and why they difference made the movie worse, but after 30 or so items this post would have gotten to be too big. And, having written it, I have the release of the rant, so I no longer have to include it here. Anyway, Wikipedia says that the film came in at #235 in the all time worldwide box office chart but was met with dismal critical reviews, scoring only a 16% composite score on Rotten Tomatoes

I feel sorry for you if you suffered through this, as did I.

Manoj

Sunday 18 November
2007
Link: The movie vaguely resembling Beowulf: an IMAX 3d experience

Posted early Sunday morning, November 18th, 2007

The movie vaguely resembling Beowulf: an IMAX 3d experience

This should really be titled "A movie vaguely representing Beowulf, but all sexed up with various salubrious elements". Hrothgar was treated much better in the original; and all the blatant and gratuitous sexuality brought in into the movie was a turn off. But then, I might be in the minority of the audience who had any familiarity with the poem.

The characters in the movie seemed two dimensional caricatures (the only compelling performance was from Grendel's mother). And the changes made to the story line also lost the prowling menace of the latter years of the king of the Geats.

After watching Hollywood debacles like this one, I am driven to wonder about why Hollywood writers seem to think they can so improve upon the work of writers whose story has stood the test of time. Making Beowulf into a boastful liar and cheat (even in the tale of the sea monsters -- his men imply that that they knew their lord was a liar) -- in an age where honor and battle prowess were everything -- I mean, what were the producers thinking?

Most certainly not a movie I am going to recommend.

I had not researched the movie much before I went into the show, and it was a surprise to me to see that this was an animated movie a la "Final Fantasy", and while I was impressed with the computer graphics (reflections in general, and reflections of ripples in the water were astounding), the not a cartoon but not a realistic movie experience was a trifle distracting, and detracted from telling the tale.

I like IMAX 3d, and the glasses are improving.

Manoj

Tuesday 13 November
2007
Link: Deeds of Paksenarrion: III

Posted early Tuesday morning, November 13th, 2007

Deeds of Paksenarrion: III

Oath of Gold rounds up this excellent fantasy series from Elizabeth Moon. It is a pity that she never came back to this character (though she wrote a couple of prequels), despite the fact that the ending paragraph leaves ample room for sequels "... when the call of Gird came, Paksenarrion left for other lands."

This is high fantasy in true Tolkien manner, but faster paced, more gritty, and with characters one could relate to. I am already looking forward to my next re-read of the series.

Manoj

Monday 12 November
2007
Link: Deeds of Paksenarrion: II

Posted early Monday morning, November 12th, 2007

Deeds of Paksenarrion: II

Divided Allegiance is the middle of the trilogy, the one that I hate reading. Not because Ms Moon's book is bad, which it is not, it is still as gripping as the others, and comes closer to the high fantasy of Tolkien -- it is just that I hate what happens to Paks in the book, and the fact that the books ends, leaving her in that state. I guess I am a wimp when it comes to some things that happen to characters I am identifying with. However, it has been so long since I read the series that I have begun to forget the details, so I went through and read it anyway.

This is a transition book: the Deeds of Paksenarrion was about Paksenarrion the line warrior, and the final book is where she becomes the stuff of legends. I usually read the first and last here.

Manoj

Saturday 10 November
2007
Link: Adventures in the windy city

Posted early Saturday morning, November 10th, 2007

Adventures in the windy city

I have just come back from a half week stay at the Hilton Indian Lakes resort (which is the second time in a month that I have stayed at a golf resort and club and proceeded to spend 9 hours a day in a windowless conference room). On Thursday night, an ex Chicago native wanted to show us the "traditional" Chicago pizza (which can be delivered, half cooked, and frozen, via Fed-Ex, anywhere in the lower 48). Google Maps to the rescue! One of the attendees had a car, and we piled in and drove to the nearest pizzeria. It was take out only. We headed to the next on the list, again to be met with disappointment; since making the pizza takes the best part of an hour, and we did not want to be standing out in a chilly parking lot while they made out pizza. So, I strongly advocated going to Tapas Valencia instead, since I have never had tapas before.

Somewhat to our disappointment, they served tapas only as an appetizer, and had a limited selection; so we ended up ordering one tapas dish (I had beef kabobs with a garlic horseradish sauce and caramelized onions), and my very first paella (paella valencia), with shrimp, mussels, clams, chicken, and veggies. We ate well, and headed back to the hotel. As we parked, and started for the gate, I realized I no longer had my wallet with me -- so back to the restaurant we went. The waiter had not found the wallet. Nor had the busboy. The owner/hostess suggested perhaps it was in the parking lot? So we all went and combed the parking lot -- once, twice.

At this point I am beginning to think about the consequences --- I can't get home, because I can't get into the airport, since I have no ID. I have no money, but Judy can't wire the money to me via western union -- because I have no ID. I need money to buy greyhound tickets to get home on a bus ... and then there is the cancelling credit cards, etc. Panic city.

While I was on my fourth circuit of the parking lot, the owner went back -- and checked the laundry chute. I had apparently carelessly draped the napkin over my wallet when paying the tab, and walked away -- and the busboy just grabbed all the napkins, wallet and all, and dumped it down the chute. Judy suggests I carry an alternate form of ID and at least one credit card in a different location than my wallet for future trips.

If that was not excitement enough, yesterday, I got on the plane home, uneventfully enough. We took off, and I was dozing comfortably, when there were two loud bags, and the plane juddered and lister to the port. There was a smell of burning rubber, and we stopped gaining altitude. After making a rough about turn with the left wing down, the pilot came on the intercom to say "We just lost our left engine, and we are returning to O'Hare. We should be in the ground in two minutes". Hearing the "in", a guy up front started hyperventilating, and his wife was rubbing his back. My feelings were mostly of exasperation, I had just managed to get myself situated comfortably, and now lord only knows when we would get another aircraft. When we landed, the nervous dude reached over and kissed his wife like he had just escaped the jaws of death. And he asked if any of us knew statistics, and if we were fine now. (I was tempted to state that statistics are not really predictive, but hey). It was all very pretty, with six fire engines rushing over and spraying us with foam and all. When we got off the plane the nervous dude headed straight to some chairs in the terminal, and said his legs would not carry him further. He did make it to the replacement plane later, though.

Turns out it was a bird flying into the engine that caused the flameout. Well, at least I have a story to tell, though it delayed getting home by about three hours.

Manoj

Saturday 10 November
2007
Link: The Secret Servant

Posted early Saturday morning, November 10th, 2007

The Secret Servant

I bought this book by Daniel Silva last week at SFO, faced with a long wait for the red eye back home, since I recalled hearing about it on NPR, and reading a review in Time magazine, or was it the New Yorker? Anyway, the review said he is his generation's finest writer of international intrigue, one of America's most gifted spy novelists ever. I guess Graham Greene and John le Carre belong to an older generation. Anyway, everything I read or heard about it was very positive.

Daniel Silva is far less cynical than Le Carre, and his world does not gel quite as well, to my ears, as Smiley's circus did. The hero, Gabriel Allon, does have some super human traits, but, thank the lord, is not James bond. I was impressed by Silva's geo-politics, though - paragraphs from the book seem to appear almost verbatim in current event reports in the International Herald Tribune and BBC stories.

I like this books (to the extent of ordering another 7 from this author from Amazon today), and appreciate the influx of new blood in the international espionage market. Lately, the genre has been treated by lack luster, mediocre knock offs of the Bourne Identity -- and the engaging pace of the original has never been successfully replicated in the sequels. And Silva's writing is better than Ludlum's.

Manoj

Thursday 08 November
2007
Link: Deeds of Paksenarrion

Posted early Thursday morning, November 8th, 2007

Deeds of Paksenarrion

Sheep Farmers Daughter is an old favourite, which I have read lord only knows how many times. Elizabeth Moon has written a gritty, enthralling story of the making of a Paladin. This is the first book of a trilogy, and introduces us to a new universe through the eyes of a young innocent (which is a great device to introduce us to a universe from the viewpoint of someone who is not seeing it through eyes jaundiced by experience).

For me, books have always been an escape from the humdrum mundanity of everyday existence. Putting myself in the shoes of a character in the story is the whole point; and this story excels there: it is very believable. Not many people can tell a tale that comes alive, and Ms Moon is one of them. An ex-marine, much of the detail of the military life of Paks has been drawn from Moon's own military experience. More than just that, the world is richly drawn, and interesting.

I read this book in a hotel room in Chicago, since, as usual, there was nothing really interesting on TV, and I don't "get" the whole bar scene.

Manoj

Tuesday 06 November
2007
Link: Continuous Automated Build and Integration Environment

Posted early Tuesday morning, November 6th, 2007

Continuous Automated Build and Integration Environment

One of the things I have been tasked to do in my current assignment is to create a dashboard of the status of various software components created by different contractors (participating companies) in the program. These software components are built by different development groups, utilizing unlike toolsets, languages and tools -- though I was able to get an agreement on the VCS (subversion -- yuck). Specifically, one should be able to tell which components pass pre-build checks, compile, can be installed, and pass unit and functional tests. There should be nightly builds, as well as builds whenever someone checks in code on the "release" branches. And, of course, the dashboard should be HTTP accessible, and be bright and, of course, shiny.

My requirements were that since the whole project is not Java, there should be no dependencies on maven or ant or eclipse projects (or make, for that matter); that it should be able to do builds on multiple machines (license constraints restrict some software to Solaris or Windows), not suck up too much time from my real job (this is a service, if it is working well, you get no credit, if it fails, you are on the hot seat). And it should be something I can easily debug, so no esoteric languages (APL, haskell -- and Python :P)

So, using continuous integration as a google search term, I found the comparison matrix at Damage Control

I looked at anthill, and cruisecontrol, and the major drawback people seemed to think it had was that configuration was done by editing an XML file, as opposed to a (by some accounts, buggy) UI is not much of a factor for me. (See this, and also this ). I also like the fact that it seems easier to plug in other components. I am uncomfortable with free software that has a "commercial" sibling; we have been burned once by UML software with those characteristics.

Cruisecontrol, Damagecontrol, Tinderbox1 & tinderbox2, Continuum, and Sin match. I tried to see the demo versions; Sin's link led me to a site selling Myrtle Beach condo's, never a good sign. Continuum and Damagecontrol were currently down, so I could not do an evaluation. So, here are the ones I could get to with working demo pages: http://cclive.thoughtworks.com/ and http://tinderbox.mozilla.org/showbuilds.cgi?tree=SeaMonkey

Cruisecontrol takes full control, checking things out of source control; and running the tests; which implies that all the software does build and run on the same machine -- this is not the case for me. Also, CC needs to publish the results/logs in XML; which seems to be a good fit for the java world; but might be a constraint for my use case.

I like the tinderbox dashboard better, based on the information presented; but that is not a major issue. It also might be better suited for a distributed, open source development model; cruisecontrol seems slightly more centralized -- more on this below. cruisecontrol is certainly more mature; and we have some experience with it. Tinderbox has a client/server model, and communicates via EMAIL to a number of machines where the actual build/testing is done. This seems good.

Then there is flamebox -- nice dashboard, derivative of tinderbox2; and seems pretty simple (perhaps too simple); and easily modifiable.

However, none of these seemed right. There was too much of an assumption of a build and test model -- and few of them seemed to be a good fit for a distributed, Grid-based software development; so I continued looking.

Cabie screen shot.

I finally decided CABIE

Continuous Automated Build and Integration Environment. Cabie is a multi-platform, multi-cm client/server based application providing both command line and web-based access to real time build monitoring and execution information. Cabie builds jobs based upon configuration information stored in MySQL and will support virtually any build that can be called from the command line. Cabie provides a centralized collection point for all builds providing web based dynamic access, the collector is SQL based and provides information for all projects under Cabie's control. Cabie can be integrated with bug tracking systems and test systems with some effort depending on the complexity of those systems. With the idea in mind that most companies create build systems from the ground up, Cabie was designed to not have to re-write scripted builds but instead to integrate existing build scripts into a smart collector. Cabie provides rapid email notification and RSS integration to quickly handle build issues. Cabie provides the ability to run builds in parallel, series, to poll jobs or to allow the use of scripted nightly builds. Cabie is perfect for agile development in an environment that requires multiple languages and tools. Cabie supports Perforce, Subversion and CVS. The use of a backend broker allows anyone with perl skills to write support for additional CM systems.

The nice people at Yo Linux have provided a Tutorial for the process. I did have to make some changes to get things working (mostly in line with the changes recommended in the tutorial, but not exactly the same. I have sent the patches upstream, but upstream is not sure how much of it they can use, since there has been major progress since the last release.

The upstream is nice and responsive, and have added support in unreleased versions for using virtual machines to run the builds in (they use that to do the solaris/windows builds), improved the web interface using (shudder) PHP, and and all kinds of neat stuff.

Manoj

Monday 05 November
2007
Link: Filtering accuracy: Hard numbers

Posted early Monday morning, November 5th, 2007

Filtering accuracy: Hard numbers

UPDATE: This posting has severe flaws, which were discovered subsequently. Please ignore.

I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%).

I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) -- about 2/3'rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server.

Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter

However, all these numbers are manually gathered, and I still have not gotten around to automating my setup's overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I'll update the numbers below via cron.

UPDATE: The css files used below were malformed, and the process of creating them detailed below is flawed. Please see newer postings in this category.

First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate two sets of numbers for every training run, and leave out messages with those two trailing digits, in effect reserving 2% of all mails for the accuracy run.

An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam messages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle -- in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle -- The table below starts with data from a newly minted .css file.

Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_]|[0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_]|[2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_]|[7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_]|[0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_]|[6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_]|[7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_]|[6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_]|[7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_]|[9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_]|[8][3][_][_]

As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.

Manoj

Sunday 04 November
2007
Link: The White Company

Posted early Sunday morning, November 4th, 2007

The White Company

I had somehow managed to miss out on The White Company while I was growing up and devouring all of Sherlock Holmes stories and The Lost World. This is a pity, since I would have like this bit of the hundred years war much better when I was young and uncritical.

Oh, I do like the book. The pacing is fast, if somewhat predictable. The book is well researched, and leads you from one historic event to the other, and is peppered with all kinds of historical figures, and I believe it to be quite authentic in it's period settings. Unfortunately, there is very little character development, and though the characters are deftly sketched, they all lack depth, which would not have bothered the young me. Also, Sir John Hawkwood, of the white company, is mentioned only briefly in passing.

This compares less favourably than Walter Scott's Quentin Durward, set in a period less than 80 years in the future. but then, I've always have had a weakness for Scott. As for Conan Doyle, the lost world was far more gripping.

I am now looking for books about Hawkwood, a mercenary captain mentioned in this book, as well as Dickson's Childe Cycle books. The only books I have found so far on the golden age of the Condottieri are so darned expensive.

Manoj

Tuesday 21 August
2007
Link: Arch Hook

Posted early Tuesday morning, August 21st, 2007

Arch Hook

All the version control systems I am familiar with run scripts on checkout and commit to take additional site specific actions, and arch is no different. Well, actually, arch is perhaps different in the sense that arch runs a script on almost all actions, namely, ~/.arch-params/hook script. Enough information is passed in to make this mechanism one of the most flexible I have had the pleasure to work with.

In my hook script, I do the following things:

  • On a commit, or an initial import
    • For my publicly replicated repositories (and only for my public repositories), the script creates a full source tree in the repository for every 20th commit. This can speed up subsequent calls to get for that and subsequent revisions, since users do not have to get the base version and all patches.
    • For the public repositories, the script removes older cached versions, keeping two cached versions in place. I assume there is not much demand for versions more than 40 patches out of date; and so having to download a few extra patches in that uncommon case is not a big issue.
    • If it is an ikiwiki commit, the script makes sure that it updates the checked out sources of the wiki on the webserver, and rebuilds the wiki.
    • If this is a commit to an archive for which I have a corresponding -MIRROR defined, the script updates the mirror now, and logs an informational message to the screen.
    • There is special handling for my Debian packages.
      * If the category matches one of my packages, the script
        looks to see if any bugs have been closed in this commit,
        and, if so, sends the log to the bug, and tags it fixed.
      * If the category being checked in is one that corresponds
        to one of my Debian packages, or to the `./debian`
        directory that belongs to one of my packages, then the
        script sends a cleaned up change log by mail to the
        *packages.qa.debian.org*. People can subscribe to the
        mailing list setup for each package to get commit logs, if
        they so desire.
      * Arch has the concept of a grab file, and people can get
        all the components of a software package by just feeding
        arch either the grab file (either locally, or via a http
        URL). The script makes sure that a arch config file is
        created , as well as a grab file (using the script
        [arch\_create\_config](/software/misc/arch_create_config.html)),
        and uploads the grab file to to a public location (using
        the script
        [arch\_upload\_grab](/software/misc/arch_upload_grab.html))
        mentioned in `./debian/control` for all my packages.
      * For commits to the Debian policy package, the script also
        sends mail to the policy list with full commit logs. This
        is a group maintained package, so changes to this are
        disseminated slightly more volubly.
      * Whenever a new category, branch, or version is added to
        the repository corresponding to the Debian policy package,
        the script sends mail to the policy list. Again, changes
        to the Policy repository are fairly visible.
      
    • The scripts send myself mail, for archival purposes, whenever a new category or branch is created in any of my repositories (but not for every revision).
    • Additional action is taken to ensure that versions are cached in the local revision library. I am no longer sure if this is strictly needed.

I'd be happy to hear about what other people add to their commit scripts, to see if I have missed out on anything.

Manoj

Monday 20 August
2007
Link: Mail Filtering with CRM114: Part 4

Posted early Monday morning, August 20th, 2007

Mail Filtering with CRM114: Part 4

Training the Discriminators

It has been a while since I posted on this category -- actually, it has been a long while since my last blog. When I last left you, I had mail (mbox format) folders called ham and/or junk, which were ready to be used for training either CRM114 or Spamassassin or both.

Setting up Spamassassin

This post lays the groundwork for the training, and details how things are set up. The first part is setting up Spamassassin. One of the things that bothered me about the default settings for Spamassassin was how swiftly Bayes information was expired; indeed, it seems really eager to dumb the Bayes information (don't they trust their engine?). I have spent some effort building a large corpus, and keeping ti clean, but Spamassassin would discard most of the information from the DB after training over my corpus, and the decrease in accuracy was palpable. To prevent this information from leeching away, I firstly increased the size of the database, and turned off automatic expiration, by putting the following lines into ~/.spamassassin/user_prefs:

bayes_expiry_max_db_size  4000000
bayes_auto_expire         0

I also have regularly updated spam rules from the spamassassin rules emporium to improve the efficiency of the rules; my current user_prefs is available as an example.

Initial training

I keep my Spam/Ham corpus under the directory /backup/classify/Done, in the subdirectories Ham and Spam. At the time of writing, I have approximately 20,000 mails in each of these subdirectories, for a total of 41,000+ emails.

I have created a couple of scripts to train the discriminators from scratch using the extant Spam corpus; and these scripts are also used for re-learning, for instance, when I moved from a 32-bit machine to a 64-bit one, or when I change CRM114 discrimators. I generally run them from ~/.spamassassin/ and ~/var/lib/crm114 (which contains my CRM114 setup) directories.

I have found that training Spamassassin works best if you alternate Spam and Ham message chunks; and this Spamassassin learning script delivers chunks of 50 messages for learning.

With CRM114, I have discovered that it is not a good idea to stop learning based on the number of times the corpus has been gone over; since stopping before all messages i the Corpus are correctly handled is also disastrous. So I set the repeat count to a ridiculously high number, and tell mailtrainer to continue training until a streak larger than the sum of Spam and Ham messages has occurred. This CRM114 trainer script does the hob nicely; running it under screen is highly recommend.

Routine updates

Coming back to where we left off, we had mail (mbox format) folders called ham and/or junk sitting in the local mail delivery directory, which were ready to be used for training either CRM114 or Spamassassin or both.

There are two scripts that help me automate the training. The first script, called mail-process, does most of the heavy listing. This processes a bunch of mail folders, which are supposed to contain mail which is either all ham or all spam, indicated by the command line arguments. We go looking though every mail, and any mail where either the CRM114 or the Spamassassin judgement was not what we expected, we strip out mail gathering headers, and then we save the mail, one to a file, and we train the approprite filter. This ensures that we only train on error, and it does not matter if we accidentally try to train on correctly classified mail, since that would be a no-op (apart from increasing the size of the corpus).

The second script, called mproc is a convenience front-end; it just calls mail-process with the proper command line arguments, and feeds them the ham and junk in sequence; and takes no arguments. So, after human classification, just calling mproc does the classification.

This pretty much finishes the series of posts I had in mind about spam filtering, I hope it has been useful.

Manoj

Sunday 14 January
2007
Link: Mail Filtering with CRM114: Part 3

Posted early Sunday morning, January 14th, 2007

Mail Filtering with CRM114: Part 3

Uphold, maintain, sustain: life in the trenches

Now that I have a baseline filter, how do I continue to train it, without putting too much of an effort? There are two separate activities here, firstly selecting the mails to be used in training, and secondly, automating the training and saving to the mail corpus. On going training is essential; Spam mutates, and even ham changes over time, and well trained filters drift. However, if training disrupts normal work-flow, it won't happen; so a minimally intrusive set of tools is critical

Selecting mail to train filters

There are three broad categories of mails that fit the criteria:

1. Misclassified mail

This is where human judgement comes in, to separate the wheat from the chaff.

  • misclassified Spam. I do nothing special for this category -- I assume that I would notice these mails, and when I do, I just save them in a junk folder, for latter processing. The volume of such messages has fallen to about one or two a month, and having them slip though is not a major problem in the first place
  • misclassified ham is far more critical, and, unfortunately, somewhat harder to get right, since you do want to reject the worst of the Spam at the SMTP level. A mistake here is worse than a false negative: all that happens with a false negative is that you curse, save to the junk folder for later retraining, and mode on. With missed Ham, you never know what you might have missed -- and hope it is nothing important.

    • If one of the two filters did the right thing, then the mlg script can catch it -- more on it below. The only thing to remember to do is to look carefully at the grey.mbox folder that is produced. While this is not good to have misclassified ham, at least this sub-category is easy to detect.
    • If both filters misclassified it, this would then mean a human would have to catch the error. The good news is that I haven't had very many mails fall into this category (the last one I know about was in the late summer of 2005). How can one be sure that there are not ham messages falling through the cracks all the time? The idea is to accept mail with scores one would normally discard, and treat this as a canary in a mine: no ham should ever show up in these realspam messages. My schema for handling Spam is shown in the figure below.

    I try and keep all the mail that I have rejected in quarantine for about a month or so, and so can retrieve a mail if informed about a mistaken rejection. I also do spot checks once in a while, though as time has gone on with no known false positives, the frequency of my checks has dropped.

2. Partially misclassified mail

This is mail correctly classified overall, but misclassified by either crm114, or spamassassin, but not both. This is an early warning sign, and is more common than mail that is misclassified, since usually the filter that is wrong is wrong weakly. But this is the point where training should occur, so that the filter does not drift to the point that mails are misclassified. Again, the mlg script catches this.

3. Mail that crm114 is unsure about

This is Mail correctly classified, but something that mailreaver is unsure about -- and this category is why mailreaver learns faster than mailtrainer.

Spam handling and disposition

Spam handling schema.

At this point I should say something about how I generally handle mails scored as Spam by the filters. As you can see, the mail handling is simple; depending on the combined score given to the mail by the filters. The handling rules are:

  • score <= 5.0: Accept unconditionally
  • 5.0 < score <= 15: Grey list
  • 15 < score: reject

So, any mail with score less than 15 is accepted, potentially after grey-listing. The disposition is done according to the following set of rules:

  • score <= 0: Classify into folder based on origin
  • 0 < score <= 10: file into Spam (some of this survived grey-listing)
  • 10 < score: file into realspam (Must have survived grey-listing)

In the last 18+months, I have not seen a Ham mail in my realspam folder; chances of Ham being rejected are pretty low. My Spam folder gets a ham message every few months, but these are usually spamassassin misclassifying things; and mlg detects those. I have not seen one of these in the last 6 months. So my realspam canary has done wonders for my peace of mind. With ideally trained filters, spam and realspam folders would be empty.

mail list grey

I have created a script called mlg ("Mail List Grey") that I run periodically over my mail folder, that picks out mails that either (a) are classified differently by spamassassin and crm114, or (b) are marked as unsure by mailreaver. The script takes these mails and saves them into a grey.mbox folder. I tend to run them over Spam and non-Spam folders in different runs, so that the grey.mbox folder can be renamed to either ham or junk, in the vast majority of the cases. Only for misclassified mails do I have to individually pick the misplaced email and classify it separately from the rest of the emails in that batch.

At this point, I should have mail mbox folders called ham and/or junk, which are now ready to be used for training either crm114 or spamassassin or both. Processing these folders is the subject of the next article in this series.

Manoj

Thursday 11 January
2007
Link: Mail Filtering with CRM114: Part 2

Posted early Thursday morning, January 11th, 2007

Mail Filtering with CRM114: Part 2

Or, Cleanliness is next to godliness

The last time when I blogged about Spam fighting Mail Filtering With CRM114 Part 1, I left y'all with visions of non-converging learning, various ingenious ways of working around a unclean corpus, and perhaps a sinking feeling that this whole thing was more fragile than it ought to be.

During this eriod of flailing around, trying to get mailtrainer to learn the full corpus correctly, I upgraded to an unpackaged version of crm114. Thanks to the excellent packaging effort by the maintianers, this was dead easy: get the debian sources using apt-get source crm114, download tghe new tarball from crm114 upstream, cp the debian dir over, and just edit the changelog file to reflect the new version. I am currently running my own, statically linked 20061103-Blame Dalkey.src-1.

Cleaning the corpus made a major difference to the quality of discrimination. As mentioned earlier, I examined every mail that was initially incorrectly classified during learning. Now, there are two ways this can happen: That the piece of mail was correctly placed in the corpus, but had a feature that was different from those learned before; or that it was wrongly classified by me. When I started the chances were almost equally likely; I have now hopefully eliminated most of the misclassifications. When mailtrainer goes into cycles, retraining on a couple of emails round after round, you almost certainly are trying to train in conflicting ways. Cyclic retraining is almost always a human's error in classification.

Some of the errors discovered were not just misclassifications: some where things that were inappropriate mail, but not Spam; for instance there was the whole conversation where someone one subscribed debian-devel to another mailing list, there was the challenge, the subscription notice, the un-subscription, challenge, and notice -- all of which were inappropriate, and interrupted the flow, and contributed to the noise -- but were not really Spam. I had, in a fit of pique, labelled them as Spam; but they were really like any other mailing list subscription conversations, which I certainly want to see for my subscriptions. crm114 did register the conversations as Spam and non-Spam, as requested, but that increased the similarity between Spam and non-Spam features -- and probably decreased the accuracy. I've since decided to train only on Spam, not on inappropriate mails; and let Gnus keep inappropriate mails from my eyes.

I've also waffled over time about whether or not to treat newsletters from Dr. Dobbs Journal or USENIX as Spam or not -- now my rule of thumb is that since I signed up for them at some point, they are not Spam -- though I don't feel guilty about letting mailagent squirrel them away mostly out of sight.

A few tips about using mailtrainer:

  • Make sure you have a clean corpus
  • Try and make it so that your have roughly equal numbers of Ham and Spam
  • Don't let mailtrainer quit after a set number of rounds. I use a ridiculous repeat count of about 100 -- never expecting to reach anywhere close to that. Instead, I set the streak to a number = number of Spam + number of Ham + 10. This means that mailtrainer does not quit until it has processed every email in the corpus correctly without needing to retrain. Letting mailtrainer quit after it repeated the corpus twice, but before it got a streak of correct classifications left me with a filter with horrible accuracy.
  • Make sure you have a clean corpus (yes, really)
  • While the ratio of similarities between .css files to the differences between .css files is a fun metric, and I use it as an informal, rough benchmark while training, it is not really correlated to accuracy, so don't get hung up on it, like I was. Usually, the value I get when training from scratch on my corpus is somewhere around 7.5; but over time the ratio degrades (falling to about 5), while the filter accuracy keeps increasing.
  • Train only on errors. The good news is that both mailtrainer and mailreaver already do so, so this is easy rule to follow. The reason is that you should only be doing minimal training, in case you have to modify/change the underlying rule in the future. So your filter should be trained to correctly classify the mails you get, but don't shoot for overwhelming scores.
  • Use mailreaver. It has the nice behaviour of asking for retraining when it is unsure of some mail, and it caches mail for some time which really helps in training and rebuilding .css files. The reason mailreaver learns faster than mailtrainer is just this feature, I think.
  • Stash mails you train on in your corpus, and don't hesitate to re-run mailtrainer over your corpus again after you have been training the unsure emails. When you train single emails, the changed filter may no longer correctly classify some email in the corpus you train with. Running mailtrainer over the corpus adjusts the filter to correctly classify every mail again. I generally retrain about once a week, though I no longer retrain from scratch, unless I change the classification algorithm.

Manoj


Webmaster <webmaster@golden-gryphon.com>
Last commit: terribly early Sunday morning, June 8th, 2014
Last edited terribly early Sunday morning, June 8th, 2014