(Last modified Thu Apr 17 22:42 2008)

home teaching site map

In4matx 115
Software Specification and
Quality Engineering
Spring 2008
Some well-known failures to achieve quality

1.  What would have been the first space shuttle launch (1981)

Launch called off at T-20min.  To improve reliability, five on-board computers running the shuttle's control software:  four identical primary computers running same software, and one backup running different implementation from different team at different company.  An unrelated bug fix added a new fault, which resulted in the backup computer being one clock cycle behind for one out of every 67 times the computers were initialized.  Unfortunately, that one occurred for the launch.  The backup computer rejected commands from the primary computers as noise because they all appeared to arrive 1 cycle too soon (I am not familiar with all the details of this).  Consequently, the commands from the primary to the backup system to start its software were never accepted.  The flight controllers could not determine why the backup software was not running, so they called off the launch.  [Neu95 p20]

2.  Bank of New York $24 billion overdraft (1985)

Software processing incoming credits had a 16-bit variable that overflowed and was not checked.  Most of the variables were 32 bits, but not this one.  The overflow prevented the bank from processing the credits, but the Federal Reserve software processing debits from the bank was working just fine.  The bank was forced to borrow $24 billion for one day while the software was being fixed.  [Neu95 p67]

3.  Strategic Defense Initiative (1985)

I believe trillions of dollars have been spent on this boondoggle, which has always relied upon a hypothetical software system that could process satellite data on a ballistic missile attack, possibly involving hundreds or thousands of simultaneous launches as well as a much-larger number of decoys (potentially several orders of magnitude larger), and coordinate a response that shot down all the real missiles, all within a few minutes.  Even assuming such a vastly complex software system could be produced, how could it be tested?  [many sources]

4.  Therac-25 (1985-87)

The Therac-25 was a computer-controlled electron-accelerator radiation-therapy device.  Eleven were installed, in locations spread across the country.  An extensive investigation after the fact showed that three faults contributed to its failure. 

  1. An operator could edit a command line that changed the state of the machine so that a radiation command would be executed before the state change was completed.  The significant state change was from high-intensity operation to low-intensity operation. 
  2. Some software safety checks were unintentionally bypassed whenever a particular 6-bit counter reached zero (a chance of one in 64). 
  3. Some of the hardware safety interlocks had been removed because their function was to be performed by the software. 

Three patients died as a direct result of radiation overdoses, and three others were injured, one very seriously.  Tests after each accident seemed to show nothing was wrong. 

Leveson and Turner's article on the accidents noted "Most accidents involving complex technology are caused by a combination of organizational, managerial, technical, and, sometimes, sociological or political factors; preventing accidents requires paying attention to all the root causes, not just the precipitating event in a particular circumstance."  [Neu95 p68]

5.  AT&T 4ESS phone switch (1990)

New software was installed in 114 of these large switches that (it was hoped) reduced their signalling overhead by eliminating a signal that had been used to show a switch was ready to receive traffic again; other switches were expected to be able to recognize when the switch was ready without this signal, based on the switch resuming activity.  The software to recognize this condition had a latent fault.  About a month after the upgrade, a switch signalled it couldn't accept traffic, reset and recovered itself, and resumed operation.  A second switch received the signal and adjusted where it was sending traffic.  The second switch didn't detect when the first one resumed operation, and failed to handle a command from the first one.  The second switch then reset itself, triggering the same sequence of events in a third switch.  All 114 switches eventually were involved.  AT&T ran simulations to try and figure out the cause; after nine hours during which many long-distance calls were blocked nationwide (because the phone system backbone was spending so much time resetting itself), they stopped the cycle by forcing a reduction in the number of calls accepted into the network.  The software bug was found and fixed later, and was traced to an erroneous break in an if in a switch statement, which incidentally violated AT&T's coding standards.  Around 5 million calls were estimated to have been blocked during the nine hours.  [Neu95 p14] 

6.  London Ambulance System (1992)

7.  Denver Airport baggage handling system (1994)

8.  Mars Climate Orbiter (1999)

9.  Virtual Case File system (2000-05)

References

[Neu95]  Peter G. Neumann.  Computer Related Risks.  ACM Press, 1995. 

Share-Alike Made with jEdit Valid CSS! Valid HTML 4.01! UC Irvine Thomas A. Alspaugh
Assistant Professor, Informatics Dept.
School of Information and Computer Sciences