(Last modified Thu Apr 17 22:42 2008)
Launch called off at T-20min. To improve reliability, five on-board computers running the shuttle's control software: four identical primary computers running same software, and one backup running different implementation from different team at different company. An unrelated bug fix added a new fault, which resulted in the backup computer being one clock cycle behind for one out of every 67 times the computers were initialized. Unfortunately, that one occurred for the launch. The backup computer rejected commands from the primary computers as noise because they all appeared to arrive 1 cycle too soon (I am not familiar with all the details of this). Consequently, the commands from the primary to the backup system to start its software were never accepted. The flight controllers could not determine why the backup software was not running, so they called off the launch. [Neu95 p20]
Software processing incoming credits had a 16-bit variable that overflowed and was not checked. Most of the variables were 32 bits, but not this one. The overflow prevented the bank from processing the credits, but the Federal Reserve software processing debits from the bank was working just fine. The bank was forced to borrow $24 billion for one day while the software was being fixed. [Neu95 p67]
I believe trillions of dollars have been spent on this boondoggle, which has always relied upon a hypothetical software system that could process satellite data on a ballistic missile attack, possibly involving hundreds or thousands of simultaneous launches as well as a much-larger number of decoys (potentially several orders of magnitude larger), and coordinate a response that shot down all the real missiles, all within a few minutes. Even assuming such a vastly complex software system could be produced, how could it be tested? [many sources]
The Therac-25 was a computer-controlled electron-accelerator radiation-therapy device. Eleven were installed, in locations spread across the country. An extensive investigation after the fact showed that three faults contributed to its failure.
Three patients died as a direct result of radiation overdoses, and three others were injured, one very seriously. Tests after each accident seemed to show nothing was wrong.
Leveson and Turner's article on the accidents noted "Most accidents involving complex technology are caused by a combination of organizational, managerial, technical, and, sometimes, sociological or political factors; preventing accidents requires paying attention to all the root causes, not just the precipitating event in a particular circumstance." [Neu95 p68]
New software was installed in 114 of these large switches
that (it was hoped) reduced their signalling overhead
by eliminating a signal that had been used to show a switch
was ready to receive traffic again;
other switches were expected to be able to recognize when the switch was ready
without this signal, based on the switch resuming activity.
The software to recognize this condition had a latent fault.
About a month after the upgrade,
a switch signalled it couldn't accept traffic,
reset and recovered itself, and resumed operation.
A second switch received the signal and adjusted where it was sending traffic.
The second switch didn't detect when the first one resumed operation,
and failed to handle a command from the first one.
The second switch then reset itself,
triggering the same sequence of events in a third switch.
All 114 switches eventually were involved.
AT&T ran simulations to try and figure out the cause;
after nine hours during which many long-distance calls were blocked nationwide
(because the phone system backbone was spending so much time resetting itself),
they stopped the cycle by forcing a reduction
in the number of calls accepted into the network.
The software bug was found and fixed later,
and was traced to an erroneous break in an if
in a switch statement,
which incidentally violated AT&T's coding standards.
Around 5 million calls were estimated to have been blocked
during the nine hours.
[Neu95 p14]
[Neu95] Peter G. Neumann. Computer Related Risks. ACM Press, 1995.