what's with the unhealthy obsession with the therac-25 incidents?

Okay, well you see, it's not really an unhealthy obsession, it's more like a healthy interest in the lessons to be learned from the whole thing. Nancy Leveson has written an excellent piece on the subject. You may want to start by reading this document. A bit long, but a good read.

In a nutshell, the therac-25 was responsible for overdosing 6 people with radiation in the 80's, 4 of which died. The reasons behind it are rooted in poor software development practices and management.

After issues were reported to the manufacturer (AECL of Canada), investigations were made into the cause of the malfunctions. Oddly enough, the software for the therac-25 was not inspected, because no one believed that the software could possibly be the cause of the malfunctions. The software had evolved from that used in earlier models with different hardware features. Later, it was discovered that the previous models could also exhibit the same possible malfunctions, but since they did not rely on software "locks", but rather on hardware safety features, these problems did not manifest. After several ineffectual investigations and attempts to recreate the reported malfunctions, the company continually claimed that overdoses were impossible, and that such an error in the software had an infinitesimal probability of occurrence (10-5).

To make matters worse, malfunctions were common, and ignored. Technicians and operators simply reset the machines to suppress common error conditions. The error messages that were displayed gave little or no indication of what went wrong, and the codes were not well-documented (there was a sheet on the side of the machine that tersely explained error codes, but gave no indication as to the cause, or remedial steps to take).

Finally, after several tragedies had already occurred, a diligent hospital physicist discovered that the most mysterious and most deadly of the malfunctions, malfunction 54, was caused by the speed with which data was entered into the control screen. The reason AECL engineers could not reproduce the error condition was because they were not rapidly entering control data on their test system (complete entry within 8 seconds).

Even after verifying that these malfunctions were the result of program errors, AECL continued to insist on less-than-adequate mitigation of these problems. One of the suggested remedies was to remove the "up arrow" key and cover the switch with electrical tape. This would prevent the operator from bypassing one of the software checks, which was causing the malfunction. The users and the FDA demanded more. Later, hardware locks were added to ensure that future malfunctions would physically halt the system and prevent injury to the patients.

Okay, so why is all this important, or even interesting? Well, it says a lot about the state of software engineering, and underscores the importance of testing and proper software practice and priciples.

Leveson 3.5.3 - The Software "Bug"

A lesson to be learned from the Therac-25 story is that focusing on particular software "bugs" is not the way to make a safe system. Virtually all complex software can be made to behave in an unexpected fashion under some conditions. The basic mistakes here involved poor software engineering practices and building a machine that relies on the software for safe operation. Furthermore, the particular coding error is not as important as the general unsafe design of the software overall.

I feel that this case in particular should serve as a warning to everyone in the software industry, particularly those that design software that interacts with people. Please stop designing software as if only programmers or other computers will interact with it (unless, of course, other computers are the target of your software). Please thoroughly test software, and do it at many levels: module, unit, integration, regression. This implies that you write code that is, in fact, testable. This becomes even more critical as we, as a society, rely more heavily upon software to control life-critical systems.

So, I chose therac25 as a reminder, at least to myself, to think about what I do and what I create.

<End of Transmission>