Episode 21 – Safety Integrity Levels

What do electric cars, steel capped boots, and balloons bursting in crowded lecture theatres have in common? Not much, except that they all feature on this episode of DisasterCast. When it comes to achieving safety, one of the key questions is “How Much is Enough?” There will always come a point where the amount of risk you are facing doesn’t justify taking further measures to reduce it. Beyond this point, we can receive better return on our safety investment by spending our efforts and money elsewhere. We may even be destroying the benefits we get by trying too hard to be safe.

When we’re designing systems, certain aspects of safety can be expressed in numbers. This is particularly the case when we are concerned about random failures. Random failures are what we usually think about when we consider a car, train or aircraft breaking down or doing something unsafe. One minute a component is working, then it fails, after which it is no longer working. We can express the random side of things as a probability. We can reduce the likelihood of random failures by using better components, and we can reduce the impact of random failures by building redundancy into our systems.

Random failures aren’t the only type of failures though. We call the other sorts of failures “systematic”. Redundancy doesn’t help here, because no matter how many widgets we have, if they’ve all got the same design flaw then under the wrong conditions they’ll all fail at once.

Working out how much redundancy we need is something we can determine mathematically. Working out how much protection we need against systematic failures is more nebulous. Software is a good example of this. We never know how many errors there are in a piece of software, because any time we find an error we fix it. We can reduce the number of errors by putting a lot of effort into finding and fixing them, but this still doesn’t help us count them.

The question “How safe is safe enough?” turns into “How hard do I need to keep looking for systematic failures?”. This is where the concept of safety integrity levels comes in.

Partial transcript is available here.

The post Episode 21 – Safety Integrity Levels appeared first on DisasterCast Safety Podcast.

Episode 20 – An Unexpected Risk Assessment

There is a fine line between confidence and stupidity. In the 1970s the London Ambulance Service tried to implement a computer aided despatch system, and failed because they couldn’t get the system’s users to support the change. In the late 1980s they tried again, but the system couldn’t cope with the expected load.

Clearly, implementing a system of this sort involved significant managerial and technical challenges. What better way to handle it then, than to appoint a skeleton management team and saddle them with an impossible delivery timetable.

The London Ambulance Service Computer Aided Despatch System and Management Aided Disaster is described on this Episode by George Despotou. George also talks about the safety challenges of connected health.

Episode 20 transcript is here.

References

  1. ZDNet News Item about 999 System outage
  2. London Ambulance Service Press Release
  3. Anthony Finkelstein LASCAD page with an academic paper, the full report and case study notes
  4. University of Kent LASCAD case study notes [pdf]
  5. Caldicott Report mentioned in George’s Connected Health piece
  6. The Register news article mentioned in George’s piece
  7. BBC News article on hacking heart pumps
  8. George’s Dependable Systems Blog

The post Episode 20 – An Unexpected Risk Assessment appeared first on DisasterCast Safety Podcast.

Episode 19 – Star Trek Transporters and Through Life Safety

Have you ever noticed that very few people get hurt during the design of a system. From
precarious assemble-at-home microlight aircraft to the world’s most awesome super-weapons, the hazards that can actually occur at design time are those of a typical office environment – power sockets, trips, falls and repetitive strain injury. Our safety effort during this time is all predictive. We don’t usually call it prediction, but that’s what modelling, analysis, and engineering judgement ultimately are. We’re trying to anticipate, imagine and control a future world.

And even though it’s easy to be cynical about the competence and diligence of people in charge of dangerous systems I really don’t think that there are evil masterminds out there authorising systems in the genuine belief that they are NOT safe. At the time a plant is commissioned or a product is released there is a mountain of argument and evidence supporting the belief of the designers, the testers, the customers and the regulators that the system is safe. Why then, do accidents happen?

That’s what this episode is about. We’ll look at some of the possible reasons and how to manage them, then discuss an accident, the disaster that befell Alaska Airlines Flight 261. Just in case you’ve got a flight to catch afterwards, we’ll reset our personal risk meters by discussing an alternate way to travel, the transporters and teleportation devices from Star Trek and similar Sci Fi experiences.

Transcript is available here.

References

  1. Memory Alpha (Star Trek Wiki) article on Transporters.
  2. NTSB Report on the Alaska Airlines 261 Crash.

The post Episode 19 – Star Trek Transporters and Through Life Safety appeared first on DisasterCast Safety Podcast.

Episode 18 – Friendly Fire

This episode is about military fratricide accidents, also known as friendly fire, blue-on-blue, and the reason why your allies are sometimes scarier than your enemies.

Friendly fire accidents are a prime example why system safety isn’t just an activity for practice and peacetime. When warfighters can’t trust their own weapons or their own allies it puts a serious dent in their operational capability, and that’s generally considered a bad thing. There’s a reason why Wikipedia has a page dedicated specifically for United States Friendly Fire Incidents with British Victims. It’s actually not a long list, but the cultural and strategic impact makes it feel much longer. Blue-on-blue incidents lead to distrust, lack of communication and lack of cooperation. Given that lack of communication and coordination is often cited as a cause of friendly fire, you can probably already picture the cycle of unintentional violence that can spiral from one or two incidents.

At a tactical level, friendly fire incidents occur for one of three reasons:

1) Misidentifying a friendly unit as a valid target;
2) Firing at a location other than intended; or
3) A friendly unit moving into an area where indiscriminate firing is occurring.

Since technology is increasingly being used to help identify targets, aim weapons and navigate, it is inevitable that technology will be complicit in a growing number of friendly fire accidents. In some respects the role of technology in these accidents is similar to medical device failures – accidents would occur at a higher rate without the technology, it just isn’t a perfect solution. This isn’t an excuse not to make the technology better though. In particular, when friendly fire accidents happen because our electronic devices have unexpected failure modes, that’s a sign that better safety analysis has an important role to play.

In this episode we are going to look at three friendly fire incidents. Apart from the use of technology and the nationality of the perpetrators, see if you can spot the common thread.

The episode transcript is available here.

The post Episode 18 – Friendly Fire appeared first on DisasterCast Safety Podcast.

Episode 17 – Glenbrook and Waterfall

In 1999, at a place called Glenbrook, just outside of Sydney, Australia, two trains collided killing seven people.
In 2003, at a place called Waterfall, just outside of Sydney, Australia, a train derailed killing seven people.
Same operator, same regulator, same state government, same judge leading the inquiry. Justice Peter Aloysius McKinerny was not
impressed to find that his first lot of recommendations hadn’t been followed.

Episode Transcript is here.

References

  1. Special Commission of Inquiry into Glenbrook.
  2. Independent Transport Safety Regulator Waterfall Reports.

The post Episode 17 – Glenbrook and Waterfall appeared first on DisasterCast Safety Podcast.

Episode 16 – Certain Questions

Honesty and humility about uncertainty are an important part of safety. At one end of the spectrum is false certainty about safety, and at the other is dogmatism about particular ways of achieving safety. Both involve overconfidence in designs, methods, and the correctness of the person making the judgement. The main feature of this episode is an interview with senior safety researcher Michael Holloway. The episode also covers the 1971 Iraq Grain Disaster.

Episode 16 Transcript is here.

References

  1. Project Syndicate Report on Iraqi Disasters
  2. Science Magazine Article on Iraq Poison Grain Disaster
  3. Bulletin of the World Health Organisation article on the Iraqi Poison Grain Disaster

The post Episode 16 – Certain Questions appeared first on DisasterCast Safety Podcast.

Episode 15 – Disowning Fukushima

Sociologist John Downer talks about his recent paper, “Disowning Fukushima: Managing the Credibility of Nuclear Reliability Assessment in the Wake of Disaster”. If you’re in the business of producing or relying on quantitative risk assessment, what do you do when an event such as Fukushima occurs? Do you say that the event didn’t happen? Do you claim that the risk assessment wasn’t wrong? Do you say that their risk assessment was wrong, but yours isn’t? Maybe you admit that there was a problem, but claim that everything has now been sorted out.

The post Episode 15 – Disowning Fukushima appeared first on DisasterCast Safety Podcast.

Episode 14 – Three Mile Island and Normal Accidents

This episode of DisasterCast covers the Three Mile Island nuclear accident, and “Normal Accidents”, one possible explanation for why disasters like Three Mile Island Occur.

Normal Accidents is the brainchild of the sociologist Charles Perrow. If you haven’t explicitly heard of him or of Normal Accidents, you’ve probably still encountered the basic ideas which often appear in the press when major accidents are discussed. If you read or hear someone saying that we need to think “possibilistically” instead of “probabilistically”, it’s likely that they’ve been influenced, at least in part, by Normal Accidents. In particular, there were a number of news articles written after Fukushima which invoked Normal Accidents as an explanation.

Risk assessment is not a science. Whilst we can study risk assessment using scientific methods, just as we can study any human activity, risk assessment itself doesn’t make testable predictions. This may seem a bit non intuitive. Consider nuclear power. We’ve had lots of nuclear reactors for a long time – isn’t that enough to tell us how safe they are? Put simply, no. The probabilities that the reactor safety studies predict are so low, that we would need tens of thousands of years of operational evidence to actually test those predictions. None of this is controversial. Perrow goes a step further though. He says that the reason that we have not had more accidents is simply that nuclear reactors haven’t been around long enough to have those accidents. In other words, he goes beyond believing that the risk assessments are unreliable, to claiming that they significantly underestimate the risk.

The theory of Normal Accidents is essentially Perrow’s explanation of where that extra risk is coming from. His starting point is not something that we should consider controversial. Blaming the operators for an accident such as Three Mile Island misses the point. Sure, the operators made mistakes, but we need to work out what it was about the system and the environment that caused those mistakes. Blaming the operators for stopping the high-pressure injectors would be like blaming the coolant for flowing out through the open valve.

Perrow points to two system features, which he calls “interactive complexity” and “tight coupling” which make it hard for operators to form an accurate mental model of the systems they are operating. The bulk of his book consists of case studies examining how these arise in various ways, and how they contribute to accidents.

The post Episode 14 – Three Mile Island and Normal Accidents appeared first on DisasterCast Safety Podcast.

Episode 13 – Therac-25 and Software Safety

This episode discusses the Therac-25 accidents, and includes an interview with software safety researcher Richard Hawkins.

Despite the widespread use of software in critical applications such as aircraft, rail systems, automobiles, weapons and medical devices, it is actually very rare to find examples where fatalities can be directly linked to a software error. Many of the examples we cite when talking about software safety are not actually accidents in the strict sense of the word. They involve extensive property damage, but no unintended harm to humans.

Therac-25 stands out as a clear-cut case of a software bug leading directly to death. Like all accidents, the causes are not simple. As we talk about Therac-25 we will discuss problems with hazard analysis, hardware design, human performance, through-life safety management, and incident reporting. All of these are enablers – systematic faults in a system that allowed a simple software bug known as a race condition to shorten the lives of five people.

Medical Devices

The safety of medical devices highlights a fundamental conflict between the way different types of evidence are used in different fields of human endeavour.

The field of evidence-based medicine places heavy emphasis on data from randomized controlled trials, aggregated through systematic reviews which compare the data from multiple large and well designed studies. This approach is not perfect, particularly when not all data from all trials is available, but it generally works very well for drugs, where randomizing and controlling are straightforward. Continued monitoring is also statistical, collecting large group data on efficacy and side effects.

The field of product safety engineering places heavy emphasis on data from the processes used to produce a product, and the test and analysis of that product. This approach is not perfect either, particularly when human interaction is a key variable in the safety of the product, but it generally works very well for physical devices, where test and analysis are straightforward. Continued monitoring is somewhat statistical, but also incorporates detailed investigation and analysis of single incidents and anomalies.

The field of safety management places heavy emphasis on accumulated experience and understanding of the way organisations work, and the way they become dysfunctional. It draws on methods such as case studies and action research from the social sciences. It works well for situations where problems and solutions cannot be differentiated from the environment in which they take place, but lacks authority on strictly empirical questions, particularly where numbers are involved.

Medical devices introduce an engineered product into a hospital management system, for the purpose of treating patients. Arguments about the right type of safety assurance are inevitable. We don’t really know the right answer, but there is a wrong answer, which is to ignore one of the three fields altogether.

Therac-25

Therac-25, produced by a company called AECL, was a medical linear accelerator. Linear accelerators are one way of providing radiation therapy for cancer. Electrons are accelerated to produce a high-energy beam which burns away tumors, leaving healthy tissue untouched.

The machine had three operating modes, depending on which accessory was placed in front of the electron beam. Field light mode, with no accessory, was used to line up the machine and the patient. Electron mode used magnets to spread a raw electron beam to the right therapeutic concentration. X-ray mode used a metal target to convert electrons into x-rays, and a flattening filter to spread the x-rays.

The existence of three operating modes created an inherent hazard. X-ray mode required a much stronger electron beam than electron mode, and field light mode required no beam at all. If the wrong accessory was in place, the patient would be zapped by a beam that was much too powerful.

The logical solution to this hazard is to put in place hardware interlocks which physically limit the amount of electron beam power based on the position of the accessories. For example, the highest power beam should be available only if the x-ray target and flattening filter are locked in position. This is indeed the way Therac-25s predecessors worked. Therac-6 and Therac-20 both used hardware interlocks. They were both computer controlled, but the automation was added to a physically safe hardware design.

Therac-25, on the other hand, was designed from the bottom-up to be computer controlled. Safe operation required correct operation of the software. Unfortunately, the software was not safe. Two different but related software bugs, known as race conditions, were involved in six separate overdose accidents.

The post Episode 13 – Therac-25 and Software Safety appeared first on DisasterCast Safety Podcast.

Episode 12 – Piper Alpha

Piper Alpha Overview

On the 25th Anniversary of the destruction of the Piper Alpha oil platform, everyone is discussing the importance of not forgetting the lessons of Piper Alpha. What are those lessons though? Hindsight bias can often let us believe that accidents are caused by extreme incompetence or reckless disregard for safety. These simplistic explanations convince us that disaster could never happen to us. After all, we do care about safety. We do try to do the right thing. We have good safety management, don’t we? The scary truth is that what we believe about our own organisations, Occidental Petroleum believed about Piper Alpha.

As well as the usual description of the accident, this episode separately delves into the design and management of Piper Alpha. In each segment, we extract themes and patterns repeated across multiple systems, multiple procedures, and multiple people.

Design

From a design point of view, there were four major failings on Piper Alpha, all teaching lessons that are still relevant.

  1. Failure to include protection against unlikely but foreseeable events
  2. An assumption that everything would work, with no backup provision if things didn’t work.
  3. Inadequate independence, particularly with respect to physical co-location of equipment
  4. A design that didn’t support the human activity that the design required to be safe

Organisation

There are three strong patterns in the management failings of Piper Alpha.

  1. A lack of feedback loops, and an assumption that not hearing any bad information meant that things were working.
  2. A tendancy to seek simple, local explanations for problems, rather than using small events as clues for what was wrong with the system
  3. An unwillingness to share and discuss information about things that went wrong

Additionally, there were severe problems with the regulator – not a shortage of regulation, but a shortage of good regulation.

Transcript for this episode is here.

The post Episode 12 – Piper Alpha appeared first on DisasterCast Safety Podcast.