From The Gift of Fire by Baase, Prentice-Hall, 1997, p
125-129
4.2 CASE STUDY. THE THERAC‑25
4.2.1 Therac‑25
Radiation Overdoses
The
Therac‑25 was a software‑controlled radiation therapy machine used
to treat people with cancer. Between 1985 and 1987, Therac‑25 machines at
four medical centers gave massive overdoses of radiation to six patients. In
some cases, the operator repeated an overdose because the machine's display
said that no dose had been given. Medical personnel later estimated that some
patients received between 13,000 and 25,000 rads,* where the intended dose was
in the 100‑‑‑200 rad range. These incidents caused severe
injuries and the deaths of three patients.
What went wrong?
Studies of the Therac‑25 incidents
showed that many factors were involved in causing the injuries and deaths. The
factors include lapses in good safety design, insufficient testing, bugs in the
software that controlled the machines, and an inadequate system of reporting
and investigating the accidents. (Articles by computer scientists Nancy
Leveson, Clark Turner, and Jonathan Jacky are the main sources for this
discussion .26)
To understand the discussion of the
problems, it will help to know a little about the machine, TheTherac‑25
is a dual‑mode machine; that is, it can generate an electron beam or an X‑ray
photon beam. The type of beam to be used depends oil the tumor being treated.
The machine's linear accelerator produces a high‑energy electron beam (25
million electron volts) that is dangerous. Patients are not to be exposed to
the raw beam. The computer monitors and controls movement of a turntable on
which three sets of devices are mounted. Depending on whether the treatment is
electron or X‑ray, a different set of devices is rotated in front of the
beam to spread it and make it safe. It is essential that the proper protective
device be in place when the electron beam is on. A third position of the
turntable may be used with the electron beam off and a light beam on instead,
to help the operator position the beam in precisely the correct place on the
patient's body. There were several weaknesses in the design of the Therac‑25
that contributed to the accidents (including some in the physical design that
we will not mention here).
4.2.2 Software and Design Problems
DesignFlaws
The Therac‑25, developed in the
late 1970s, followed earlier machines called the Therac‑6 and Therac‑20.
It differed from them in that it was designed to be fully computer controlled.
The older machines had hardware safety interlock mechanisms, independent of the
computer, that prevented the beam from firing in unsafe conditions, for
example, if the beam‑attenuating devices were not in tile correct
position. Many of these hardware safety features were eliminated in the design
of the Therac‑25. Some software from the Therac-20 and Therac‑6 was
reused in the Therac‑25. This software was apparently assumed to be
functioning correctly. This assumption was wrong. When new operators used the
Therac-20, there were frequent shutdowns and blown fuses, but no overdoses. The
Therac‑20 software had bugs, but the hardware safety mechanisms were
doing their job. Either the manufacturers did not know of the problems with
the Therac‑20, or they completely missed their serious implications.
The Therac‑25 malfunctioned
frequently. One facility said there were sometimes 40 dose rate malfunctions in
a day, generally underdoses. Thus operators became used to error messages
appearing often, with no indication that there might be safety hazards.
There were a number of weaknesses in the
design of the operator interface. The error messages that appeared on the
display were simply error numbers or obscure messages ("Malfunction
54" or "H‑tilt"). This was not unusual for computer
programs in the 1970s when computers had much less memory and mass storage than
they have now. One had to look up each error number in a manual for more
explanation. The operator's manual for the Therac‑25, however, did not
include any explanation of the error messages. Even the maintenance manual did
not explain them. The machine distinguished between the severity of errors by
the amount of effort needed to continue operation. For certain error
conditions, the machine paused and the operator could proceed (turn on the
electron beam) by pressing one key. For other kinds of errors, the machine
suspended operation and had to be completely reset. One would presume that the
one‑key resumption would be allowed only after minor, not safety‑related,
errors. Yet this was the situation that occurred in some of the accidents in
which patients received multiple overdoses.
Investigators studying the accidents
found that there was very little documentation produced during development of
the program concerning the software specifications or the testing plan.
Although the manufacturer of the machine, Atomic Energy of Canada, Ltd. (AECL),
a Canadian government corporation, claimed that it was tested extensively, it
appeared that the test plan was inadequate.
Bugs
Investigators were able to trace some of
the overdoses to two specific software errors. Because many readers of this
book are computer science students, I will describe the bugs. These
descriptions illustrate the importance of using good programming techniques.
However, some readers have little or no programming knowledge, so I will
simplify the descriptions.
After treatment parameters are entered by
the operator at a control console, a software procedure, Set‑Up Test, is
called to perform a variety of checks to be sure the machine is positioned
correctly, and so on. If anything is not ready, the routine schedules itself to
be executed again so that the checks are done again after the problem is
resolved. (It may simply have to wait for the turntable to move into place.)
The Set‑Up Test routine may be called several hundred times while setting
up for one treatment. When a particular flag variable is zero, it indicates
that a specific device on the machine is positioned correctly. To ensure that
the device is checked, each time the Set‑Up Test routine runs, it
increments the variable to make it nonzero. The problem was that the flag
variable was stored in one byte. When the routine was called the 256th time,
the flag overflowed and showed a value of zero. (If you are not familiar with
programming, think of this as an odometer rolling over to zero after reaching
the highest number it can show.) If everything else happened to be ready at
that point, the device position was not checked, and the treatment could
proceed. Investigators believe that in some of the accidents, this bug allowed
the electron beam to be turned on when the turntable was positioned for use of
the light beam, and there was no protective device in place to attenuate the
beam.
Part of the tragedy
in this case is that the error was such a simple one, with a simple
correction. No good student programmer should have made this error. The
solution is to set the flag variable to a fixed value, say 1, when entering Set‑Up
Test, rather than incrementing it.
In a real‑time system where physical
machinery is controlled, status is determined, and an operator enters, and may
modify, input (a multitasking system). There are many complex factors that can
contribute to subtle, intermittent, and hard‑to‑detect bugs. Programmers
working on such systems must learn to be aware of the potential problems and to
program using good techniques to avoid them. In some of the accidents, a set of
bugs allowed the machine to ignore changes or corrections made by the operator
at the console. When the operator typed in all the necessary information for a
treatment, the program began moving various devices into place. This process
could take several seconds. The software was written to check for editing of
the input by the operator during this time and to restart the set-up if editing was detected. However, because of bugs in this section of
the program, some parts of the program learned of the edited information while
others did not. This led to machine settings that were incorrect and
inconsistent with safe treatment. According to the later investigation by the
Food and Drug Administration (FDA), there appeared to be no consistency checks
in the program. The error was most likely to occur if the operator was
experienced and quick at editing input.
4.2.3 Why So Many Incidents?
There
were six known Therac‑25 overdoses. You may wonder why the machine
continued to be used after the first one,
The Therac‑25 had been in service
for up to two years at some clinics. It was not pulled from service after the
first few accidents because it was not known immediately that it was the cause
of the injuries. Medical staff members considered various other explanations.
The staff at the site of the first incident said that one reason they were not
certain of the source of the patient's injuries was that they had never seen
such a massive radiation overdose before. The manufacturer was questioned about
the possibility of overdoses, but responded (after the first, third, and fourth
accidents) that the patient injuries could not have been caused by the machine.
According to the Leveson and Turner investigative report, they also told the
facilities that there had been no similar cases of injuries.
After the second accident, AECL
investigated and found several problems related to the turntable (not including
any of the ones we described). They made some changes in the system and
recommended operational changes. They
declared that the safety of the machine had been improved by five orders of
magnitude, although they told the FDA that they were not certain of the exact
cause of the accident. That is, they
did not know if they had found the problem that caused the accident or if they
had just found other problems. In
making decisions about continued use of the machines, the hospitals and clinics
had to consider the costs of removing
the expensive machine from service (in lost income and loss of treatment for
patients who needed it), the uncertainty about whether the machine was the
cause of the injuries, and later, when that was clear, the manufacturer's
assurances that the problem had been solved.
After some of the later accidents, machines were removed from service.
They were returned to service after modifications by the manufacturer, but the
modifications had not fixed all the bugs.
A Canadian government agency and some
hospitals using the Therac‑25 made recommendations for many more changes
to enhance safety; they were not implemented. After the fifth accident, the
FDA declared the machine defective and ordered AECL to inform users of the
problems. The FDA and AECL spent about a year (during which the sixth accident
occurred) negotiating about changes to be made in the machine. The final plan
included more than two dozen changes. The critical hardware safety interlocks
were eventually installed, and most of the machines remain in use with no new
incidents of overdoses since 1987 .
4.2.4 Overconfidence
In
the first overdose incident, when the patient told the machine operator that
she had been "burned," the operator told her that was impossible.
This was one of many indications that the makers and some users of the Therac‑25
were overconfident about the safety of the
system. The most obvious and critical indication of overconfidence in
software was the decision to eliminate the hardware safety mechanisms. A
safety analysis of the machine done by
AECL years before the accidents suggests that they did not expect significant
problems from software errors. In one case where a clinic added its own
hardware safety features to the machine, AECL told them it was not necessary.
(None of the accidents occurred at that facility.)
The hospitals using the machine assumed
that it worked safely, an understandable assumption. Some of their actions,
though, suggest overconfidence, or at least practices that should be avoided,
for example, ignoring error messages because the machine produced so many of
them. A camera in the treatment room and an intercom system enabled the
operator to monitor the treatment and communicate with the patient. (The
treatment room is shielded, and the console used by the operator is outside the
room.) On the day of an accident at one facility, neither the video monitor nor
the intercom was functioning. The operator did not see or hear the patient try
to get up after an overdose; he received a second overdose before he reached
the door and pounded on it. This facility had successfully treated more than
500 patients with the machine before the accident.
4.2.5 Conclusion and
Perspective
From
design decisions all the way to responding to the overdose accidents, the
manufacturer of theTherac‑25 did a poor job. Minor design and
implementation errors might be expected in any complex system, but the number
and pattern of problems in this case, and the way they were handled, suggests
irresponsibility that merits high awards to the families of the victims and
possibly, some observers believe, criminal charges, This case illustrates many
of the things that a responsible, ethical software developer should not do. It
illustrates the importance of following good procedures in software
development. It is a stark reminder of file consequences of' carelessness,
cutting comers, unprofessional work, and attempts to avoid responsibility. It
reminds us that a complex system may work correctly hundreds of times with a
bug that shows up only in unusual circumstances, hence, the importance of
always following good safety procedures in operation of 'potentially dangerous
equipment. This case also illustrates the importance of individual initiative
and responsibility. Recall that sonic facilities installed hardware safety
devices on their Therac‑25 machines. They recognized the risks and took
action to reduce them. The hospital physicist at one of the facilities where
the Therac‑25 overdosed patients spent many hours working with the
machine to try to reproduce the conditions under which the overdoses occurred.
With little support or information from the manufacturer, he was able to figure
out the cause of some of the malfunctions.
Even if the Therac‑25 case was
unusual,** we must deal with the fact that the machine was built and used, and
it killed people. There have been enough accidents in safety critical
applications to indicate that significant improvement is needed. Should we not
trust computers for such applications at all? Or, if we continue to use
computers for safety critical applications, what can be done to reduce the
incidence of failures'? We will discuss some approaches in the next section.
To put the Therac‑25 in some perspective,
it is helpful to remember that failures and other accidents have always
occurred and continue to occur in systems that do not use computers. Two other
linear accelerator radiation‑treatment machines seriously overdosed
patients. Three patients received overdoses in one day at a London hospital in
1966 when safety controls failed. Twenty‑four patients received overdoses
from a malfunctioning machine at a Spanish hospital in 1991; three patients
died. Neither of these machines had computer controls. Two news reporters
reviewed more than 4000 cases of radiation overdoses reported to the U.S.
government. The Therac‑25 incidents were included, but most of the cases
did not involve computers. Here are a few of the overdose incidents they describe.
A technician started a treatment, then left the patient for 10‑ 15
minutes to attend an office party. A technician failed to carefully check the
prescribed treatment time. A technician failed to measure the radioactive
drugs administered; she just used what looked like the right amount. In at
least two cases, technicians confused microcuries and millicuries.*** The general problems were carelessness, lack
of appreciation for the risk involved, poor training, and lack of sufficient
penalty to encourage better practices.
In most cases, the medical facilities paid small fines or were not fined
at all. (One radiation oncologist severely injured live women. He was
eventually sued.)
Some of' these problems might have been
prevented by good computer systems. Many could have occurred even if a computer
were in use. None excuse the Therac‑25. They suggest, however, that
individual and management responsibility, good training, and accountability are
more important factors than whether or not a computer is used.
* A rad is the
unit used to quantify radiation doses.
It stands for "radiation absorbed dose."
**
Sadly, some software safety experts say the poor design and lack of attention
to safety in this case are not unusual.
.*** A curie is a measure of radioactivity. A millicurie is one thousand times as much
as a microcurie.