Episode 13 – Therac-25 and Software Safety

This episode discusses the Therac-25 accidents, and includes an interview with software safety researcher Richard Hawkins.

Despite the widespread use of software in critical applications such as aircraft, rail systems, automobiles, weapons and medical devices, it is actually very rare to find examples where fatalities can be directly linked to a software error. Many of the examples we cite when talking about software safety are not actually accidents in the strict sense of the word. They involve extensive property damage, but no unintended harm to humans.

Therac-25 stands out as a clear-cut case of a software bug leading directly to death. Like all accidents, the causes are not simple. As we talk about Therac-25 we will discuss problems with hazard analysis, hardware design, human performance, through-life safety management, and incident reporting. All of these are enablers – systematic faults in a system that allowed a simple software bug known as a race condition to shorten the lives of five people.

Medical Devices

The safety of medical devices highlights a fundamental conflict between the way different types of evidence are used in different fields of human endeavour.

The field of evidence-based medicine places heavy emphasis on data from randomized controlled trials, aggregated through systematic reviews which compare the data from multiple large and well designed studies. This approach is not perfect, particularly when not all data from all trials is available, but it generally works very well for drugs, where randomizing and controlling are straightforward. Continued monitoring is also statistical, collecting large group data on efficacy and side effects.

The field of product safety engineering places heavy emphasis on data from the processes used to produce a product, and the test and analysis of that product. This approach is not perfect either, particularly when human interaction is a key variable in the safety of the product, but it generally works very well for physical devices, where test and analysis are straightforward. Continued monitoring is somewhat statistical, but also incorporates detailed investigation and analysis of single incidents and anomalies.

The field of safety management places heavy emphasis on accumulated experience and understanding of the way organisations work, and the way they become dysfunctional. It draws on methods such as case studies and action research from the social sciences. It works well for situations where problems and solutions cannot be differentiated from the environment in which they take place, but lacks authority on strictly empirical questions, particularly where numbers are involved.

Medical devices introduce an engineered product into a hospital management system, for the purpose of treating patients. Arguments about the right type of safety assurance are inevitable. We don’t really know the right answer, but there is a wrong answer, which is to ignore one of the three fields altogether.


Therac-25, produced by a company called AECL, was a medical linear accelerator. Linear accelerators are one way of providing radiation therapy for cancer. Electrons are accelerated to produce a high-energy beam which burns away tumors, leaving healthy tissue untouched.

The machine had three operating modes, depending on which accessory was placed in front of the electron beam. Field light mode, with no accessory, was used to line up the machine and the patient. Electron mode used magnets to spread a raw electron beam to the right therapeutic concentration. X-ray mode used a metal target to convert electrons into x-rays, and a flattening filter to spread the x-rays.

The existence of three operating modes created an inherent hazard. X-ray mode required a much stronger electron beam than electron mode, and field light mode required no beam at all. If the wrong accessory was in place, the patient would be zapped by a beam that was much too powerful.

The logical solution to this hazard is to put in place hardware interlocks which physically limit the amount of electron beam power based on the position of the accessories. For example, the highest power beam should be available only if the x-ray target and flattening filter are locked in position. This is indeed the way Therac-25s predecessors worked. Therac-6 and Therac-20 both used hardware interlocks. They were both computer controlled, but the automation was added to a physically safe hardware design.

Therac-25, on the other hand, was designed from the bottom-up to be computer controlled. Safe operation required correct operation of the software. Unfortunately, the software was not safe. Two different but related software bugs, known as race conditions, were involved in six separate overdose accidents.

The post Episode 13 – Therac-25 and Software Safety appeared first on DisasterCast Safety Podcast.

Episode 12 – Piper Alpha

Piper Alpha Overview

On the 25th Anniversary of the destruction of the Piper Alpha oil platform, everyone is discussing the importance of not forgetting the lessons of Piper Alpha. What are those lessons though? Hindsight bias can often let us believe that accidents are caused by extreme incompetence or reckless disregard for safety. These simplistic explanations convince us that disaster could never happen to us. After all, we do care about safety. We do try to do the right thing. We have good safety management, don’t we? The scary truth is that what we believe about our own organisations, Occidental Petroleum believed about Piper Alpha.

As well as the usual description of the accident, this episode separately delves into the design and management of Piper Alpha. In each segment, we extract themes and patterns repeated across multiple systems, multiple procedures, and multiple people.


From a design point of view, there were four major failings on Piper Alpha, all teaching lessons that are still relevant.

  1. Failure to include protection against unlikely but foreseeable events
  2. An assumption that everything would work, with no backup provision if things didn’t work.
  3. Inadequate independence, particularly with respect to physical co-location of equipment
  4. A design that didn’t support the human activity that the design required to be safe


There are three strong patterns in the management failings of Piper Alpha.

  1. A lack of feedback loops, and an assumption that not hearing any bad information meant that things were working.
  2. A tendancy to seek simple, local explanations for problems, rather than using small events as clues for what was wrong with the system
  3. An unwillingness to share and discuss information about things that went wrong

Additionally, there were severe problems with the regulator – not a shortage of regulation, but a shortage of good regulation.

Transcript for this episode is here.

The post Episode 12 – Piper Alpha appeared first on DisasterCast Safety Podcast.