From The Gift of Fire by Baase, Prentice-Hall, 1997, p 125-129

 

4.2 CASE STUDY. THE THERAC‑25

 

4.2.1 Therac‑25 Radiation Overdoses

 

The Therac‑25 was a software‑controlled radiation therapy machine used to treat people with cancer. Between 1985 and 1987, Therac‑25 machines at four medical centers gave massive overdoses of radiation to six patients. In some cases, the operator repeated an overdose because the machine's display said that no dose had been given. Medical person­nel later estimated that some patients received between 13,000 and 25,000 rads,* where the intended dose was in the 100‑‑‑200 rad range. These incidents caused severe injuries and the deaths of three patients.

 

What went wrong?

Studies of the Therac‑25 incidents showed that many factors were involved in caus­ing the injuries and deaths. The factors include lapses in good safety design, insufficient testing, bugs in the software that controlled the machines, and an inadequate system of re­porting and investigating the accidents. (Articles by computer scientists Nancy Leveson, Clark Turner, and Jonathan Jacky are the main sources for this discussion .26)

To understand the discussion of the problems, it will help to know a little about the machine, TheTherac‑25 is a dual‑mode machine; that is, it can generate an electron beam or an X‑ray photon beam. The type of beam to be used depends oil the tumor being treated. The machine's linear accelerator produces a high‑energy electron beam (25 million elec­tron volts) that is dangerous. Patients are not to be exposed to the raw beam. The computer monitors and controls movement of a turntable on which three sets of devices are mounted. Depending on whether the treatment is electron or X‑ray, a different set of devices is ro­tated in front of the beam to spread it and make it safe. It is essential that the proper pro­tective device be in place when the electron beam is on. A third position of the turntable may be used with the electron beam off and a light beam on instead, to help the operator position the beam in precisely the correct place on the patient's body. There were several weaknesses in the design of the Therac‑25 that contributed to the accidents (including some in the physical design that we will not mention here).

 

4.2.2 Software and Design Problems

 

DesignFlaws

The Therac‑25, developed in the late 1970s, followed earlier machines called the Therac‑6 and Therac‑20. It differed from them in that it was designed to be fully computer controlled. The older machines had hardware safety interlock mechanisms, independent of the computer, that prevented the beam from firing in unsafe conditions, for example, if the beam‑attenuating devices were not in tile correct position. Many of these hardware safety features were eliminated in the design of the Therac‑25. Some software from the Therac-20 and Therac‑6 was reused in the Therac‑25. This software was apparently assumed to be functioning correctly. This assumption was wrong. When new operators used the Therac-20, there were frequent shutdowns and blown fuses, but no overdoses. The Therac‑20 soft­ware had bugs, but the hardware safety mechanisms were doing their job. Either the manu­facturers did not know of the problems with the Therac‑20, or they completely missed their serious implications.

The Therac‑25 malfunctioned frequently. One facility said there were sometimes 40 dose rate malfunctions in a day, generally underdoses. Thus operators became used to er­ror messages appearing often, with no indication that there might be safety hazards.

There were a number of weaknesses in the design of the operator interface. The error messages that appeared on the display were simply error numbers or obscure messages ("Malfunction 54" or "H‑tilt"). This was not unusual for computer programs in the 1970s when computers had much less memory and mass storage than they have now. One had to look up each error number in a manual for more explanation. The operator's manual for the Therac‑25, however, did not include any explanation of the error messages. Even the main­tenance manual did not explain them. The machine distinguished between the severity of errors by the amount of effort needed to continue operation. For certain error conditions, the machine paused and the operator could proceed (turn on the electron beam) by press­ing one key. For other kinds of errors, the machine suspended operation and had to be com­pletely reset. One would presume that the one‑key resumption would be allowed only after minor, not safety‑related, errors. Yet this was the situation that occurred in some of the ac­cidents in which patients received multiple overdoses.

Investigators studying the accidents found that there was very little documentation produced during development of the program concerning the software specifications or the testing plan. Although the manufacturer of the machine, Atomic Energy of Canada, Ltd. (AECL), a Canadian government corporation, claimed that it was tested extensively, it ap­peared that the test plan was inadequate.

 

Bugs

Investigators were able to trace some of the overdoses to two specific software er­rors. Because many readers of this book are computer science students, I will describe the bugs. These descriptions illustrate the importance of using good programming techniques. However, some readers have little or no programming knowledge, so I will simplify the descriptions.

After treatment parameters are entered by the operator at a control console, a soft­ware procedure, Set‑Up Test, is called to perform a variety of checks to be sure the ma­chine is positioned correctly, and so on. If anything is not ready, the routine schedules itself to be executed again so that the checks are done again after the problem is resolved. (It may simply have to wait for the turntable to move into place.) The Set‑Up Test routine may be called several hundred times while setting up for one treatment. When a particular flag variable is zero, it indicates that a specific device on the machine is positioned correctly. To ensure that the device is checked, each time the Set‑Up Test routine runs, it increments the variable to make it nonzero. The problem was that the flag variable was stored in one byte. When the routine was called the 256th time, the flag overflowed and showed a value of zero. (If you are not familiar with programming, think of this as an odometer rolling over to zero after reaching the highest number it can show.) If everything else happened to be ready at that point, the device position was not checked, and the treatment could proceed. Investigators believe that in some of the accidents, this bug allowed the electron beam to be turned on when the turntable was positioned for use of the light beam, and there was no protective device in place to attenuate the beam.

Part of the tragedy in this case is that the error was such a simple one, with a sim­ple correction. No good student programmer should have made this error. The solution is to set the flag variable to a fixed value, say 1, when entering Set‑Up Test, rather than incrementing it.

In a real‑time system where physical machinery is controlled, status is determined, and an operator enters, and may modify, input (a multitasking system). There are many complex factors that can contribute to subtle, intermittent, and hard‑to‑detect bugs. Pro­grammers working on such systems must learn to be aware of the potential problems and to program using good techniques to avoid them. In some of the accidents, a set of bugs al­lowed the machine to ignore changes or corrections made by the operator at the console. When the operator typed in all the necessary information for a treatment, the program be­gan moving various devices into place. This process could take several seconds. The soft­ware was written to check for editing of the input by the operator during this time and to restart the set-up if editing was detected.  However, because of bugs in this section of the program, some parts of the program learned of the edited information while others did not. This led to machine settings that were incorrect and inconsistent with safe treatment. Ac­cording to the later investigation by the Food and Drug Administration (FDA), there ap­peared to be no consistency checks in the program. The error was most likely to occur if the operator was experienced and quick at editing input.

 

4.2.3 Why So Many Incidents?

 

There were six known Therac‑25 overdoses. You may wonder why the machine continued to be used after the first one,

The Therac‑25 had been in service for up to two years at some clinics. It was not pulled from service after the first few accidents because it was not known immediately that it was the cause of the injuries. Medical staff members considered various other explana­tions. The staff at the site of the first incident said that one reason they were not certain of the source of the patient's injuries was that they had never seen such a massive radiation overdose before. The manufacturer was questioned about the possibility of overdoses, but responded (after the first, third, and fourth accidents) that the patient injuries could not have been caused by the machine. According to the Leveson and Turner investigative re­port, they also told the facilities that there had been no similar cases of injuries.

After the second accident, AECL investigated and found several problems related to the turntable (not including any of the ones we described). They made some changes in the system and recommended operational changes.  They declared that the safety of the ma­chine had been improved by five orders of magnitude, although they told the FDA that they were not certain of the exact cause of the accident.   That is, they did not know if they had found the problem that caused the accident or if they had just found other problems.  In making decisions about continued use of the machines, the hospitals and clinics had to consider the costs of  removing the expensive machine from service (in lost income and loss of treatment for patients who needed it), the uncertainty about whether the machine was the cause of the injuries, and later, when that was clear, the manufacturer's assurances that the problem had been solved.  After some of the later accidents, machines were re­moved from service. They were returned to service after modifications by the manufac­turer, but the modifications had not fixed all the bugs.

A Canadian government agency and some hospitals using the Therac‑25 made recommendations for many more changes to enhance safety; they were not implemented. Af­ter the fifth accident, the FDA declared the machine defective and ordered AECL to inform users of the problems. The FDA and AECL spent about a year (during which the sixth accident occurred) negotiating about changes to be made in the machine. The final plan in­cluded more than two dozen changes. The critical hardware safety interlocks were eventu­ally installed, and most of the machines remain in use with no new incidents of overdoses since 1987 .

 

4.2.4 Overconfidence

 

In the first overdose incident, when the patient told the machine operator that she had been "burned," the operator told her that was impossible. This was one of many indications that the makers and some users of the Therac‑25 were overconfident about the safety of the sys­tem. The most obvious and critical indication of overconfidence in software was the deci­sion to eliminate the hardware safety mechanisms. A safety analysis of the machine done by AECL years before the accidents suggests that they did not expect significant problems from software errors. In one case where a clinic added its own hardware safety features to the machine, AECL told them it was not necessary. (None of the accidents occurred at that facility.)

The hospitals using the machine assumed that it worked safely, an understandable assumption. Some of their actions, though, suggest overconfidence, or at least practices that should be avoided, for example, ignoring error messages because the machine pro­duced so many of them. A camera in the treatment room and an intercom system enabled the operator to monitor the treatment and communicate with the patient. (The treatment room is shielded, and the console used by the operator is outside the room.) On the day of an accident at one facility, neither the video monitor nor the intercom was functioning. The operator did not see or hear the patient try to get up after an overdose; he received a second overdose before he reached the door and pounded on it. This facility had successfully treated more than 500 patients with the machine before the accident.

 

4.2.5 Conclusion and Perspective

 

From design decisions all the way to responding to the overdose accidents, the manufac­turer of theTherac‑25 did a poor job. Minor design and implementation errors might be ex­pected in any complex system, but the number and pattern of problems in this case, and the way they were handled, suggests irresponsibility that merits high awards to the families of the victims and possibly, some observers believe, criminal charges, This case illustrates many of the things that a responsible, ethical software developer should not do. It illus­trates the importance of following good procedures in software development. It is a stark reminder of file consequences of' carelessness, cutting comers, unprofessional work, and attempts to avoid responsibility. It reminds us that a complex system may work correctly hundreds of times with a bug that shows up only in unusual circumstances, hence, the im­portance of always following good safety procedures in operation of 'potentially dangerous equipment. This case also illustrates the importance of individual initiative and responsi­bility. Recall that sonic facilities installed hardware safety devices on their Therac‑25 ma­chines. They recognized the risks and took action to reduce them. The hospital physicist at one of the facilities where the Therac‑25 overdosed patients spent many hours working with the machine to try to reproduce the conditions under which the overdoses occurred. With little support or information from the manufacturer, he was able to figure out the cause of some of the malfunctions.

Even if the Therac‑25 case was unusual,** we must deal with the fact that the ma­chine was built and used, and it killed people. There have been enough accidents in safety ­critical applications to indicate that significant improvement is needed. Should we not trust computers for such applications at all? Or, if we continue to use computers for safety­ critical applications, what can be done to reduce the incidence of failures'? We will discuss some approaches in the next section.

To put the Therac‑25 in some perspective, it is helpful to remember that failures and other accidents have always occurred and continue to occur in systems that do not use computers. Two other linear accelerator radiation‑treatment machines seriously overdosed patients. Three patients received overdoses in one day at a London hospital in 1966 when safety controls failed. Twenty‑four patients received overdoses from a malfunctioning ma­chine at a Spanish hospital in 1991; three patients died. Neither of these machines had computer controls. Two news reporters reviewed more than 4000 cases of radiation over­doses reported to the U.S. government. The Therac‑25 incidents were included, but most of the cases did not involve computers. Here are a few of the overdose incidents they de­scribe. A technician started a treatment, then left the patient for 10‑ 15 minutes to attend an office party. A technician failed to carefully check the prescribed treatment time. A techni­cian failed to measure the radioactive drugs administered; she just used what looked like the right amount. In at least two cases, technicians confused microcuries and millicuries.***  The general problems were carelessness, lack of appreciation for the risk involved, poor training, and lack of sufficient penalty to encourage better practices.  In most cases, the medical facilities paid small fines or were not fined at all. (One radiation oncologist se­verely injured live women. He was eventually sued.)

Some of' these problems might have been prevented by good computer systems. Many could have occurred even if a computer were in use. None excuse the Therac‑25. They suggest, however, that individual and management responsibility, good training, and accountability are more important factors than whether or not a computer is used.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

* A rad is the unit used to quantify radiation doses.  It stands for "radiation absorbed dose."

 

** Sadly, some software safety experts say the poor design and lack of attention to safety in this case are not un­usual.

 

.***  A curie  is a measure of radioactivity.  A millicurie is one thousand times as much as a microcurie.