Saturday, July 3, 2010

The iPhone4 antenna and our industry's approach..

Apple's scrambling around the iPhone4 issues reminds me of something from years back. In days of yore, we programmed on large computers in languages like PL/I. I was pretty knowledgeable about PL/I - and understood how it used memory, what quirks there were in IBM's pretty amazing optimizing compiler. Occasionally debugging required us to examine that thing of beauty - the core dump.

So one day, I was teaching a "how to read a core dump" class when a production problem arose and the on duty team could not resolve it. The dump was duly presented to me to interpret. So I made it a class exercise. The symptom was simple - the program raised a "division by zero exception". The Pl/I language has a special construct called an "On Unit" which is invoked if a matching exception arose. The general error on unit was invoked, the system produced its dump and all seemed to make sense. Except that there was no division being performed in the expression.

More PL/I history here - PL/I does automatic conversion of values in expressions to ensure that it manages the precision correctly. And just occasionally, it converts from the internal Extended Binary Coded Decimal format (where each half byte is stored as a digit 0-9 and the rightmost half byte is a value a-f to denote the sign) to a pure binary value. There is even a special machine instruction for doing this.

So on inspecting the actual place the program stopped, it was the machine instruction that does a convert to binary (and not a division instruction). Weird. On further examination of the oracle (the IBM Messages and Codes manual), we saw that the Convert To Binary instruction will raise a Division By Zero exception in the event that the decimal number is too big. Strange exception for that condition, but OK we know now. Of course by the time that the exception is raised to the PL/I language layer it is also treated as a "zerodivide" exception. And that's what gets reported. 

Naturally enough we raised this as an issue with IBM, expecting either some clever fix in the language bindings, like "before actually raising zerodivide we will check the opcode and if it is convert to binary we will raise an overflow exception instead of zerodivide. Or an Oh interesting reaction, the hardware should probably not raise division by zero it should do overflow instead. No such luck. The documentation was fixed instead. It now reads (under the zerodivide section) that sometimes zerodivide can be raised when converting operands in an expression to binary. (Not a verbatim statement this all happened a long time ago).

So what's the parallel with Apple? maybe none, maybe it really is some overly aggressive bars calculating, but if there is a serious underlying problem, it is a whole lot easier to fix at the documentation level (you are holding it wrong), then at the software layer level (we will recalculate the bars) than at the fundamental hardware or platform level (we designed the antenna wrong).

It is natural to look for the least costly and least invasive fix to a problem, but sometimes it backfires.

No comments:

Post a Comment