Monday 24 December
Link: The mark of the assassin

Posted early Monday morning, December 24th, 2007

The mark of the assassin

Based on the The Secret Servant, I ordered a whole slew of books from the same author, including this one. I am still impressed by the geopolitical insight; a few years before 9/11, the author sets up a reasonable facsimile by having evil doers blow up a plane taking off from New York; though the evil doers in question are bunches of cold war nostalgic members of the military-intelligence-industrial collective.

The action is till fast paced, though I am far less impressed by the characters than I was in the secret servant. Micheal and Elizabeth seem to be two dimensional cut outs, not people; the person I felt most in touch with was the evil doer assassin.


Sunday 25 November
Link: 300, and the history channel perspective.

Posted early Sunday morning, November 25th, 2007

300, and the history channel perspective.

Yes, this is about a movies based on a comic based on a movie from the 50's. And they did a wonderful job of conveying to comic book feel -- and yet, though you could appreciate the abstract, stylized presentation of the comic, most of the movie still came straight from Herodotus. The training of the Spartans, the throwing of the Persian emissaries into a pit and a well -- this cleaving to the historic details was a pleasant surprise. The history channel presentation is recommended for the perspective it brings to the tale.

There were some poetic licenses -- the whole bit about a highly placed Spartan traitor was made out of plain cloth; and the current convention wisdom is that Leonidas went to Thermopylae because of his religious beliefs, and conviction about the sacred prophecy of the oracle at Delphi, not because he thought Persia would destroy Greece (remember, Xerxes won, and sacked Athens). Indeed, there was little concept of "Greece" at that point.

Indeed, the whole stick about the last stand at Thermopylae saving democracy seems suspect -- the stand bloodied Persia's nose, and delayed them by perhaps 5 days -- in an advance that took the better part of a year that the Greeks knew about. No, it was the combination of Marathon, Thermopylae, Salamis, Plataea -- over the course of half a century -- that ensure that the no name David of the Greek city states survived against the Goliath of Persia. And, then, of course, came the boy wonder out of Macedonia.

Highly recommended.


Sunday 25 November
Link: Filtering Accuracy: Brown paper bag time

Posted early Sunday morning, November 25th, 2007

Filtering Accuracy: Brown paper bag time

After posting about filtering accuracy I got to thinking about the test I was using. It appeared to me that there should be no errors in the mails that crm114 had already been trained upon -- but here I was, coming up with errors when I trained the css files until there were no errors, and then used a reg exp that tried to find the accuracy of classification for all the files, not just new ones. This did not make sense.

The only explanation was that my css files were not properly created -- and I thought to try an experiment where isntead of trying to throw my whole corpus as one chunk at a blank css file, I would feed the corpus in chunks. I cam up with an initialization script to feed my corpus to a blank css file in 200 mail chunks; and, while it was at it, renumber the corpus mails (over the years, as I cleaned the corpus, gaps had appeared in the numbering). I have also updated the retraining script

Voila. I am getting a new set of css files which do not appear to show any errors for mails crm114 has already learned about -- in other words, for mails it has seen, the accuracy is now 100%, not 99.5% as it was a couple of days ago.

While it is good news in that my classification accuracy is better than it was last week; the bad news is that I no longer have concrete number on accuracy for crm114 anymore -- the mechanism used now gives 100% accuracy all the time. The funny thing is, I recall going through this analysis a couple of years ago, where I predicted that one could only check for accuracy with a test corpus that had the same characteristics as real mail inflow, and which had not been used for learning. That wold mean I would have classified testing corpus that could improve the efficiency of my filter, but was not being used to provide accuracy numbers -- I have gone for improving the filter, at the cost of knowing how accurate they actually are.


Sunday 25 November
Link: The children of men

Posted early Sunday morning, November 25th, 2007

The children of men

A nicely placed movie about a bleak future, and how people cope with despair and desperate times. While it did not quite come together in the details (anything outside of England was a big unknown blur), and the London of 2027 seemed not much different from any current day city under semi martial law (technology, for instance, seems to have frozen at todays levels), it was still fast paced, and enjoyable, and anyway, this is not primarily a sci-fi flick.



Sunday 18 November
Link: Eragon

Posted early Sunday morning, November 18th, 2007


I liked the book. Sure, it is "The Lord of the Rings" meets "Star Wars", but, the book had a nice flow -- and it was written by a fifteen year old, fer gawds sake. The very fact that he can turn out a page turner of a book when others of his age can't string together a grammatical sentence spelled correctly is amazing. Overall, derivative, unoriginal, and simplistic though the book is, it has an original charm -- a very good book for children, and one that adults can read through as well.

So I went to this movie with high hopes. What a let down. This was merely a notch above the Beowulf debacle. Lack luster performances, bland, drudge of a movie, with all kinds of interesting elements and nuances from the book removed. Crude, unimaginative, ham handed performances all around. The plot line, which did not follow the book, was dumbed down, there were implications that the Elven princess was a potential love interest (faugh), and the refreshing pace of the book fell off to a plodding soporific caricature. It is an offense to the book, and to the author.

I was going to point out the differences between the movie and the book; and why they difference made the movie worse, but after 30 or so items this post would have gotten to be too big. And, having written it, I have the release of the rant, so I no longer have to include it here. Anyway, Wikipedia says that the film came in at #235 in the all time worldwide box office chart but was met with dismal critical reviews, scoring only a 16% composite score on Rotten Tomatoes

I feel sorry for you if you suffered through this, as did I.


Sunday 18 November
Link: The movie vaguely resembling Beowulf: an IMAX 3d experience

Posted early Sunday morning, November 18th, 2007

The movie vaguely resembling Beowulf: an IMAX 3d experience

This should really be titled "A movie vaguely representing Beowulf, but all sexed up with various salubrious elements". Hrothgar was treated much better in the original; and all the blatant and gratuitous sexuality brought in into the movie was a turn off. But then, I might be in the minority of the audience who had any familiarity with the poem.

The characters in the movie seemed two dimensional caricatures (the only compelling performance was from Grendel's mother). And the changes made to the story line also lost the prowling menace of the latter years of the king of the Geats.

After watching Hollywood debacles like this one, I am driven to wonder about why Hollywood writers seem to think they can so improve upon the work of writers whose story has stood the test of time. Making Beowulf into a boastful liar and cheat (even in the tale of the sea monsters -- his men imply that that they knew their lord was a liar) -- in an age where honor and battle prowess were everything -- I mean, what were the producers thinking?

Most certainly not a movie I am going to recommend.

I had not researched the movie much before I went into the show, and it was a surprise to me to see that this was an animated movie a la "Final Fantasy", and while I was impressed with the computer graphics (reflections in general, and reflections of ripples in the water were astounding), the not a cartoon but not a realistic movie experience was a trifle distracting, and detracted from telling the tale.

I like IMAX 3d, and the glasses are improving.


Tuesday 13 November
Link: Deeds of Paksenarrion: III

Posted early Tuesday morning, November 13th, 2007

Deeds of Paksenarrion: III

Oath of Gold rounds up this excellent fantasy series from Elizabeth Moon. It is a pity that she never came back to this character (though she wrote a couple of prequels), despite the fact that the ending paragraph leaves ample room for sequels "... when the call of Gird came, Paksenarrion left for other lands."

This is high fantasy in true Tolkien manner, but faster paced, more gritty, and with characters one could relate to. I am already looking forward to my next re-read of the series.


Monday 12 November
Link: Deeds of Paksenarrion: II

Posted early Monday morning, November 12th, 2007

Deeds of Paksenarrion: II

Divided Allegiance is the middle of the trilogy, the one that I hate reading. Not because Ms Moon's book is bad, which it is not, it is still as gripping as the others, and comes closer to the high fantasy of Tolkien -- it is just that I hate what happens to Paks in the book, and the fact that the books ends, leaving her in that state. I guess I am a wimp when it comes to some things that happen to characters I am identifying with. However, it has been so long since I read the series that I have begun to forget the details, so I went through and read it anyway.

This is a transition book: the Deeds of Paksenarrion was about Paksenarrion the line warrior, and the final book is where she becomes the stuff of legends. I usually read the first and last here.


Saturday 10 November
Link: Adventures in the windy city

Posted early Saturday morning, November 10th, 2007

Adventures in the windy city

I have just come back from a half week stay at the Hilton Indian Lakes resort (which is the second time in a month that I have stayed at a golf resort and club and proceeded to spend 9 hours a day in a windowless conference room). On Thursday night, an ex Chicago native wanted to show us the "traditional" Chicago pizza (which can be delivered, half cooked, and frozen, via Fed-Ex, anywhere in the lower 48). Google Maps to the rescue! One of the attendees had a car, and we piled in and drove to the nearest pizzeria. It was take out only. We headed to the next on the list, again to be met with disappointment; since making the pizza takes the best part of an hour, and we did not want to be standing out in a chilly parking lot while they made out pizza. So, I strongly advocated going to Tapas Valencia instead, since I have never had tapas before.

Somewhat to our disappointment, they served tapas only as an appetizer, and had a limited selection; so we ended up ordering one tapas dish (I had beef kabobs with a garlic horseradish sauce and caramelized onions), and my very first paella (paella valencia), with shrimp, mussels, clams, chicken, and veggies. We ate well, and headed back to the hotel. As we parked, and started for the gate, I realized I no longer had my wallet with me -- so back to the restaurant we went. The waiter had not found the wallet. Nor had the busboy. The owner/hostess suggested perhaps it was in the parking lot? So we all went and combed the parking lot -- once, twice.

At this point I am beginning to think about the consequences --- I can't get home, because I can't get into the airport, since I have no ID. I have no money, but Judy can't wire the money to me via western union -- because I have no ID. I need money to buy greyhound tickets to get home on a bus ... and then there is the cancelling credit cards, etc. Panic city.

While I was on my fourth circuit of the parking lot, the owner went back -- and checked the laundry chute. I had apparently carelessly draped the napkin over my wallet when paying the tab, and walked away -- and the busboy just grabbed all the napkins, wallet and all, and dumped it down the chute. Judy suggests I carry an alternate form of ID and at least one credit card in a different location than my wallet for future trips.

If that was not excitement enough, yesterday, I got on the plane home, uneventfully enough. We took off, and I was dozing comfortably, when there were two loud bags, and the plane juddered and lister to the port. There was a smell of burning rubber, and we stopped gaining altitude. After making a rough about turn with the left wing down, the pilot came on the intercom to say "We just lost our left engine, and we are returning to O'Hare. We should be in the ground in two minutes". Hearing the "in", a guy up front started hyperventilating, and his wife was rubbing his back. My feelings were mostly of exasperation, I had just managed to get myself situated comfortably, and now lord only knows when we would get another aircraft. When we landed, the nervous dude reached over and kissed his wife like he had just escaped the jaws of death. And he asked if any of us knew statistics, and if we were fine now. (I was tempted to state that statistics are not really predictive, but hey). It was all very pretty, with six fire engines rushing over and spraying us with foam and all. When we got off the plane the nervous dude headed straight to some chairs in the terminal, and said his legs would not carry him further. He did make it to the replacement plane later, though.

Turns out it was a bird flying into the engine that caused the flameout. Well, at least I have a story to tell, though it delayed getting home by about three hours.


Saturday 10 November
Link: The Secret Servant

Posted early Saturday morning, November 10th, 2007

The Secret Servant

I bought this book by Daniel Silva last week at SFO, faced with a long wait for the red eye back home, since I recalled hearing about it on NPR, and reading a review in Time magazine, or was it the New Yorker? Anyway, the review said he is his generation's finest writer of international intrigue, one of America's most gifted spy novelists ever. I guess Graham Greene and John le Carre belong to an older generation. Anyway, everything I read or heard about it was very positive.

Daniel Silva is far less cynical than Le Carre, and his world does not gel quite as well, to my ears, as Smiley's circus did. The hero, Gabriel Allon, does have some super human traits, but, thank the lord, is not James bond. I was impressed by Silva's geo-politics, though - paragraphs from the book seem to appear almost verbatim in current event reports in the International Herald Tribune and BBC stories.

I like this books (to the extent of ordering another 7 from this author from Amazon today), and appreciate the influx of new blood in the international espionage market. Lately, the genre has been treated by lack luster, mediocre knock offs of the Bourne Identity -- and the engaging pace of the original has never been successfully replicated in the sequels. And Silva's writing is better than Ludlum's.


Thursday 08 November
Link: Deeds of Paksenarrion

Posted early Thursday morning, November 8th, 2007

Deeds of Paksenarrion

Sheep Farmers Daughter is an old favourite, which I have read lord only knows how many times. Elizabeth Moon has written a gritty, enthralling story of the making of a Paladin. This is the first book of a trilogy, and introduces us to a new universe through the eyes of a young innocent (which is a great device to introduce us to a universe from the viewpoint of someone who is not seeing it through eyes jaundiced by experience).

For me, books have always been an escape from the humdrum mundanity of everyday existence. Putting myself in the shoes of a character in the story is the whole point; and this story excels there: it is very believable. Not many people can tell a tale that comes alive, and Ms Moon is one of them. An ex-marine, much of the detail of the military life of Paks has been drawn from Moon's own military experience. More than just that, the world is richly drawn, and interesting.

I read this book in a hotel room in Chicago, since, as usual, there was nothing really interesting on TV, and I don't "get" the whole bar scene.


Tuesday 06 November
Link: Continuous Automated Build and Integration Environment

Posted early Tuesday morning, November 6th, 2007

Continuous Automated Build and Integration Environment

One of the things I have been tasked to do in my current assignment is to create a dashboard of the status of various software components created by different contractors (participating companies) in the program. These software components are built by different development groups, utilizing unlike toolsets, languages and tools -- though I was able to get an agreement on the VCS (subversion -- yuck). Specifically, one should be able to tell which components pass pre-build checks, compile, can be installed, and pass unit and functional tests. There should be nightly builds, as well as builds whenever someone checks in code on the "release" branches. And, of course, the dashboard should be HTTP accessible, and be bright and, of course, shiny.

My requirements were that since the whole project is not Java, there should be no dependencies on maven or ant or eclipse projects (or make, for that matter); that it should be able to do builds on multiple machines (license constraints restrict some software to Solaris or Windows), not suck up too much time from my real job (this is a service, if it is working well, you get no credit, if it fails, you are on the hot seat). And it should be something I can easily debug, so no esoteric languages (APL, haskell -- and Python :P)

So, using continuous integration as a google search term, I found the comparison matrix at Damage Control

I looked at anthill, and cruisecontrol, and the major drawback people seemed to think it had was that configuration was done by editing an XML file, as opposed to a (by some accounts, buggy) UI is not much of a factor for me. (See this, and also this ). I also like the fact that it seems easier to plug in other components. I am uncomfortable with free software that has a "commercial" sibling; we have been burned once by UML software with those characteristics.

Cruisecontrol, Damagecontrol, Tinderbox1 & tinderbox2, Continuum, and Sin match. I tried to see the demo versions; Sin's link led me to a site selling Myrtle Beach condo's, never a good sign. Continuum and Damagecontrol were currently down, so I could not do an evaluation. So, here are the ones I could get to with working demo pages: http://cclive.thoughtworks.com/ and http://tinderbox.mozilla.org/showbuilds.cgi?tree=SeaMonkey

Cruisecontrol takes full control, checking things out of source control; and running the tests; which implies that all the software does build and run on the same machine -- this is not the case for me. Also, CC needs to publish the results/logs in XML; which seems to be a good fit for the java world; but might be a constraint for my use case.

I like the tinderbox dashboard better, based on the information presented; but that is not a major issue. It also might be better suited for a distributed, open source development model; cruisecontrol seems slightly more centralized -- more on this below. cruisecontrol is certainly more mature; and we have some experience with it. Tinderbox has a client/server model, and communicates via EMAIL to a number of machines where the actual build/testing is done. This seems good.

Then there is flamebox -- nice dashboard, derivative of tinderbox2; and seems pretty simple (perhaps too simple); and easily modifiable.

However, none of these seemed right. There was too much of an assumption of a build and test model -- and few of them seemed to be a good fit for a distributed, Grid-based software development; so I continued looking.

Cabie screen shot.

I finally decided CABIE

Continuous Automated Build and Integration Environment. Cabie is a multi-platform, multi-cm client/server based application providing both command line and web-based access to real time build monitoring and execution information. Cabie builds jobs based upon configuration information stored in MySQL and will support virtually any build that can be called from the command line. Cabie provides a centralized collection point for all builds providing web based dynamic access, the collector is SQL based and provides information for all projects under Cabie's control. Cabie can be integrated with bug tracking systems and test systems with some effort depending on the complexity of those systems. With the idea in mind that most companies create build systems from the ground up, Cabie was designed to not have to re-write scripted builds but instead to integrate existing build scripts into a smart collector. Cabie provides rapid email notification and RSS integration to quickly handle build issues. Cabie provides the ability to run builds in parallel, series, to poll jobs or to allow the use of scripted nightly builds. Cabie is perfect for agile development in an environment that requires multiple languages and tools. Cabie supports Perforce, Subversion and CVS. The use of a backend broker allows anyone with perl skills to write support for additional CM systems.

The nice people at Yo Linux have provided a Tutorial for the process. I did have to make some changes to get things working (mostly in line with the changes recommended in the tutorial, but not exactly the same. I have sent the patches upstream, but upstream is not sure how much of it they can use, since there has been major progress since the last release.

The upstream is nice and responsive, and have added support in unreleased versions for using virtual machines to run the builds in (they use that to do the solaris/windows builds), improved the web interface using (shudder) PHP, and and all kinds of neat stuff.


Monday 05 November
Link: Filtering accuracy: Hard numbers

Posted early Monday morning, November 5th, 2007

Filtering accuracy: Hard numbers

UPDATE: This posting has severe flaws, which were discovered subsequently. Please ignore.

I have often posted on the accuracy of my mail filtering mechanisms on the mailing lists (I have not had a false positive in years, and I stash all discards/rejects locally, and do spot checks frequently; and I went through 6 months of exhaustive checks when I put this system in place). False negatives are down to about 3-4 a month (0.019%). Yes, that is right: I am claiming that my classification correctness record is 99.92 (99.98% accuracy for messages my classifiers are sure about). Incorrectly classified unsure ham is about 3-4(0.019%) a month; incorrectly classified unsure Spam is roughly the same, perhaps a little higher. Adding these to the incorrect classification, my best estimate of not confidently classified mail is 0.076%, based on the last 60 days of data (which is what gets you the 99.92%).

I get unsure/retrain messages at the rate of about 20 a day (about 3.2% of non-spam email) -- about 2/3'rds of which are classified correctly; but either SA and crm114 disagree, or crm114 is unsure. So I have to look at about 20 messages a day to see if a ham message slipped in there; and train my filters based on these; and the process is highly automated (just uses my brain as a classifier). The mail statistics can be seen on my mail server.

Oh, my filtering front end also switches between reject/discard and turns grey listing on and off based on whether or not the mail is coming from mailing lists/newsletters I have authorized; mimedefang-filter

However, all these numbers are manually gathered, and I still have not gotten around to automating my setup's overall accuracy, but now I have some figures on one of the two classifies in my system. Here is the data from CRM114. I'll update the numbers below via cron.

UPDATE: The css files used below were malformed, and the process of creating them detailed below is flawed. Please see newer postings in this category.

First, some context: when training CRM114 using the mailtrainer command, one can specify to leave out a certain percentage of the training set in the learn phase, and run a second pass over the mails so skipped to test the accuracy of the training. The way you do this is by specifying a regular expression to match the file names. Since my training set has message numbers, it was simple to use the least significant two digits as a regexp; but I did not like the idea of always leaving out the same messages. So I now generate two sets of numbers for every training run, and leave out messages with those two trailing digits, in effect reserving 2% of all mails for the accuracy run.

An interesting thing to note is the assymetry in the accuracy: CRM114 has never identified a Spam message incorrectly. This is because the training mechanism is skewed towards letting a few spam messages slip through, rather than let a good message slip into the spam folder. I like that. So, here are the accuracy numbers for CRM114; adding in Spamassassin into the mix only improves the numbers. Also, I have always felt that a freshly learned css file is somewhat brittle -- in the sense that if one trains an unsure message, and then tried to TUNE (Train Until No Errors) the css file, a large number of runs through the training set are needed until the thing stabilizes. So it is as if the learning done initially was minimalistic, and adding the information for the new unsure message required all kinds of tweaking. After a while TOEing (Training on Errors) and TUNEing, this brittleness seems to get hammered out of the CSS files. I also expect to see accuracy rise as the css files get less brittle -- The table below starts with data from a newly minted .css file.

Accuracy number and validation regexp
Date Corpus Ham Spam Overall Validation
  Size Count Correct Accuracy Count Correct Accuracy Count Correct Accuracy Regexp
Wed Oct 31 10:22:23 UTC 2007 43319 492 482 97.967480 374 374 100.000000 866 856 98.845270 [1][6][_][_]|[0][3][_][_]
Wed Oct 31 17:32:44 UTC 2007 43330 490 482 98.367350 378 378 100.000000 868 860 99.078340 [3][7][_][_]|[2][3][_][_]
Thu Nov 1 03:01:35 UTC 2007 43334 491 483 98.370670 375 375 100.000000 866 858 99.076210 [2][0][_][_]|[7][9][_][_]
Thu Nov 1 13:47:55 UTC 2007 43345 492 482 97.967480 376 376 100.000000 868 858 98.847930 [1][2][_][_]|[0][2][_][_]
Sat Nov 3 18:27:00 UTC 2007 43390 490 480 97.959180 379 379 100.000000 869 859 98.849250 [4][1][_][_]|[6][4][_][_]
Sat Nov 3 22:38:12 UTC 2007 43394 491 482 98.167010 375 375 100.000000 866 857 98.960740 [3][1][_][_]|[7][8][_][_]
Sun Nov 4 05:49:45 UTC 2007 43400 490 483 98.571430 377 377 100.000000 867 860 99.192620 [4][6][_][_]|[6][8][_][_]
Sun Nov 4 13:35:15 UTC 2007 43409 490 485 98.979590 377 377 100.000000 867 862 99.423300 [3][7][_][_]|[7][9][_][_]
Sun Nov 4 19:22:02 UTC 2007 43421 490 486 99.183670 379 379 100.000000 869 865 99.539700 [7][2][_][_]|[9][4][_][_]
Mon Nov 5 05:47:45 UTC 2007 43423 490 489 99.795920 378 378 100.000000 868 867 99.884790 [4][0][_][_]|[8][3][_][_]

As you can see, the accuracy numbers are trending up, and already are nearly up to the values observed on my production system.


Sunday 04 November
Link: The White Company

Posted early Sunday morning, November 4th, 2007

The White Company

I had somehow managed to miss out on The White Company while I was growing up and devouring all of Sherlock Holmes stories and The Lost World. This is a pity, since I would have like this bit of the hundred years war much better when I was young and uncritical.

Oh, I do like the book. The pacing is fast, if somewhat predictable. The book is well researched, and leads you from one historic event to the other, and is peppered with all kinds of historical figures, and I believe it to be quite authentic in it's period settings. Unfortunately, there is very little character development, and though the characters are deftly sketched, they all lack depth, which would not have bothered the young me. Also, Sir John Hawkwood, of the white company, is mentioned only briefly in passing.

This compares less favourably than Walter Scott's Quentin Durward, set in a period less than 80 years in the future. but then, I've always have had a weakness for Scott. As for Conan Doyle, the lost world was far more gripping.

I am now looking for books about Hawkwood, a mercenary captain mentioned in this book, as well as Dickson's Childe Cycle books. The only books I have found so far on the golden age of the Condottieri are so darned expensive.


