RISKS-LIST: RISKS-FORUM Digest Wednesday 5 April 1989 Volume 8 : Issue 49 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: An unusual "common mode failure" in B-1B aircraft (PGN) Gripen crash caused by flight control software (Mitchell Charity, Mike Nutley) Airbus A320 article plus some comments (Nancy Leveson) [long] ---------------------------------------------------------------------- Date: Wed, 5 Apr 1989 10:44:35 PDT From: Peter Neumann Subject: An unusual "common mode failure" in B-1B aircraft A rather bizarre common mode failure has been detected in the recent inspection of grounded B-1B bombers: there was a shortage of lubricant in a critical gearbox in 70 of the 80 planes inspected (with 17 more still to go). The problem was found on the plane whose wing swept into the fuel tank (RISKS-8.46), which resulted in two shafts fractured and a leak along a fuel tank seam. [San Francisco Chronicle, 5 April 1989, p. A7] ------------------------------ Date: Wed, 5 Apr 89 01:15:31 EDT From: mcharity@ATHENA.MIT.EDU Subject: Gripen crash caused by flight control software (quotes&inserts from FLIGHT INTERNATIONAL, 25 March 1989) On Feb 2 the 1st prototype (of 5) of Sweden's Saab JAS39 Gripen fighter crashed on landing after its 6th test flight. It impacted, broke left main gear, bounced, skidded and flipped. ``Gripen is naturally unstable and has a triplex digital fly-by-wire system with a triplex analogue backup.'' Initial flight was ``some 18 months behind schedule'' and this was ``attributed to difficulties in proving the software for the flight control system.'' After the 1st flight, test pilot ``remarked that the control system seemed too sensitive and that the control laws would probably need to be changed.'' On all flights ``the aircraft experienced problems with lateral oscillations.'' [On the] ``last flight oscillation in pitch was also apparent.'' The accident investigation committee chairman ``confirms earlier assumptions that the flight control system was at fault.'' Chairman: ``The accident was caused by the aircraft experiencing increasing pitch oscillations (divergent dynamic instability) in the final stage of landing, the oscillations becoming uncontrollable. This was because movement of the stick in the pitch axis exceeded the values predicted when designing the flight control system, whereby the stability margins were exceeded at the critical frequency.'' Separate investigation by the JAS Industry Group: ``The control laws implemented in the flight-control system's computer had deficiencies with respect to controlling the pitch axis at low speed. In this case, the pilot's control commands were subjected to such a delay that he was out of phase with the aircraft's motion.'' ``the company hopes to fly JAS39-2 before the end of the year.'' ``Delivery of the first production aircraft [...] is now expected in [1993, although typo said `1933'], instead of 1992.'' ------------------------------ Subject: Swedish Gripen Fighter Crash Date: Wed, 5 Apr 89 17:09:44 BST From: jpff@maths.bath.ac.uk Sender: jpff@maths.bath.ac.uk From Datalink, April 3 1989 (a British paper for system/software) quoted in full without permission. Swedish wind cuts fly-by-wires Flight-control software has been blamed for the crash of the prototype Swedish Gripen fighter last February. The preliminary report from the Swedish government's crash-investigation commission indifified the software's inability to cope with gusting winds and the oversensitivity of the control system as the prime reasons for the accident. According to a spokesman for the commission, problems with the \pound 3.2 billion project first arose in an earlier flight test. "The preceeding test flight had shown up problems, but it's not a problem with the aircraft or with the flight control systems. It's a software problem. "The whole of the control system was too sensitive for the pilot; it operated too fast. It was too easy for the pilot to go outside the flight-control envelope into unstable flight." In common with many fighters currently being developed, the JAS39 Gripen is designed to be inherently unstable to increase its manoeuverability. It relies on the software to keep it under control. "There were limitations on the flight control systems, but during the landing phase the wind was stronger than allowed for by these limitations. The pilot had to try to overcome them." A final report into the crash is due in May, but work has already started on the second prototype aircraft, including a modified version of the flight-control software. Mike Nutley ------------------------------ Date: Wed, 05 Apr 89 13:54:29 -0400 From: levesonelectron.LCS.MIT.EDU Subject: Airbus A320 article plus some comments Here is the full Washington Post article, interspersed with a few of my comments. WASHINGTON POST: OUTLOOK, 04/02/89 Copyright (c) 1989 The Washington Post Co. By Jim Beatson [Jim Beatson writes on aviation issues for the Guardian and some other British newspapers. He is currently living in Canada. NGL] [Apparently either Beatson or the Post removed some more controversial items from the original British appearance of this material. PGN] IN JUNE, a new plane hits the American skies. Northwest Airlines will become the first U.S. carrier to take delivery of the European Airbus A320 -- the most advanced passenger aircraft in the world, and already one of the most controversial. In use since last May by British Airways and Air France, the medium-sized 150-seat twin-engine jet is the first airliner to have every function, from flight controls to toilet operation, directed by computer. On June 26, 1988, two days after the third A320 went into service, it crashed while performing a low-level pass at a French air show. A woman and two children on board were killed. An investigation blamed the accident on pilot error, but the pilot faulted a number of factors including the aircraft's computers for providing incorrect altitude information. (The pilot, a senior Air France captain, was subsequently dismissed.) Since then, various unsettling reports have appeared in the European press, regarding: engines unexpectedly throttling up on final approach; inaccurate altimeter readings; sudden power loss prior to landing; steering problems while taxiing. [NGL: It is interesting that the pilot was never believed about the altimeter although there is not plenty of evidence to back up his story. I have noticed several things about evaluation of accidents in general: 1) Human error is always the first ascribed cause whenever a human is involved in the system where an accident occurred. However, most accidents are multi-factorial. If the altimeter is indeed inaccurate, then the accident was only partially caused by the pilot. Humans tend to want simple answers to complex problems and to be able to ascribe blame to some single cause. There are, of course, other factors at work in these oversimplifications such as liability issues and misplaced faith in technology. But seldom are accidents the result of only one thing going wrong. Actually, the few times I have found this to be true (i.e., one thing is at fault), it is a computer that is the primary agent. Perhaps engineers expect other things to fail and therefore design systems so that a single failure cannot lead to an accident. But since (as engineers often tell me or write in system safety evaluations) computer software does not fail... 2) If a human cannot be blamed, then the hardware is. The first incident involving the Therac 25 occurred in Hamilton, Ontario. The accident was blamed on a faulty microswitch (a "transient" failure since nothing could be found wrong with the microswitch). The fix for the problem was to put in a duplicate microswitch to detect when the filter was not in place to correctly filter the X-ray beam. When the next incident occurred in Tyler, Texas (again involving the misalignment of the filter), it was believed that the burn suffered by the patient (who died from his injuries 6 months later) was electrical. Nobody believed that he could have suffered an overdose or that the computer could be involved. The electrical system was checked out and found to be OK so the machine was deemed safe. Two weeks later another man was overdosed in Tyler (he died two weeks after this) and FINALLY, someone (at the hospital) decided the computer might be involved. It was the physicist at the hospital who was able to reproduce the problem and raise an alarm about the computer. He had some difficulty convincing anyone else about this. The Therac 25 victim in Georgia had great trouble convincing anyone that the Therac was responsible for her severe burns. This was true also for the first overdose in Yakima. Finally, when the second person was overdosed in Yakima (and all the prior incidents had occurred including the detection of an error in the software that could have caused the incidents), people were willing to examine the possibility that this was a software error (a different software error was given the blame this time). Why are people so reluctant to believe that the computer may be at fault?] [returning to the Washington Post article] Of course, the introduction of any new aircraft entails shake-out problems of one kind or another. But the A320's extensive use of computers raises a new set of questions: Are we ready to rely so heavily on complex software systems for such safety-critical applications as commercial flight? Bird on a Wire The control system employed by the A320 is known as "fly by wire." FBW replaces the conventional stick and rudder controls with a series of computers and miles of electronic cables. Instead of the familiar control-column, the pilots use "side-sticks," a single lever resembling the joy sticks used in video games. Sensing devices which gauge the aircraft's flight characteristics pass the information to the six color monitors that replace nearly all the traditional analog instruments and result, Airbus says, in 75 percent fewer instruments than conventional configurations. On the uncluttered flight deck, the pilot on the right uses the side-stick with the right hand while the pilot on the left has a left-handed version. (On the left, pilots tend to push the aircraft to the right owing to the position of the forearm and wrist; that side-stick was adjusted to compensate.) But the computer system actually directs the control surfaces. Only the rudder and horizontal stabilizer -- both on the tail -- be mechanically directed by the pilot. All other flight controls are managed by the electrical flight- control system (EFCS), which contains three spoiler/elevator computers (SEC), two elevator/aileron computers (ELAC) and the flight-augmentation computer that oversees stability, limiting and protection functions. The engines and throttles are managed by the full-authority digital engine- control (FADEC) computers. The EFCS uses "dissimilar redundancy." That is, computers that are designed to back each other up are of different brands, have different microprocessor types and are supplied by different vendors -- all to minimize the likelihood of identical hardware parts failing at the same time. And different programmers were employed to write each of the parallel sets of software. Moreover, each computer is divided into two physically separate units with "segregated" power supplies. [NGL: There were different programmers. Were there different requirements specifications? How about design specifications? How much detailed design information was provided to the programmers?] The EFCS is designed to fly within a theoretical "flight envelope" -- permissible ranges for various maneuvers -- thus providing computer-monitored protection against windshear forces, overload or overspeed conditions. If the pilot were to, say, allow the speed to drop toward the stall point, the computer would sound alarms and automatically increase the power. In the event that two computers should disagree, one automatically shuts itself down and its tasks are carried out by the other. For example, if one unit directed the flaps to be partly extended and its monitoring software expected full flap extension, then the first unit would automatically shut itself down and its functions would be passed over to the other. The pilots' display monitors would tell them what had happened. Finally, each of the five flight-control-surface computers is capable of performing all of the essential tasks of the others as well as its own tasks. [NGL: If two computers disagree, how is it determined which computer to shut down? It does not sound like the pilots do this, they are just told about the event afterward (and may not have the information necessary to make this decision anyway). So how is the decision made? How do they know that the monitor is correct and the other one is not?] The Airbus A320, of course, is not the first civilian aircraft to use computerized control. Boeing's 757 and 767, for example, have computer- activated spoilers; and Boeing had planned to use FBW technology in the 7J7 but subsequently deferred development. Joe Sutter, Boeing's chief engineer for the past 20 years, believes that "fly-by-wire is way overstated as to its benefits"; and as for the side-stick, system, "we have some reservations -- like what one pilot is doing is not obvious to the other." "The main benefit of FBW," he says, "is to reduce weight and increase range. It will really boost safety. But fooling around with FBW to reduce [something like] tail size goes against the design philosophy I have always urged -- that you've got to design an aircraft which one day for some reason or other is going to get into a hell of a lot of trouble. " That means mechanical back-up systems for the main control surfaces. "What happens with FBW when the aircraft gets outside its control laws? Its going to leave the pilot in one hell of a lot of trouble -- for what? One-percent fuel burn?" A great deal more than that, says Airbus, which believes it now enjoys a significant competitive advantage over Boeing and McDonnell Douglas in fuel and weight savings. An Air France official says that the Airbus A320 is 40 percent more fuel efficient than the old Boeing 727s they have replaced. He was expecting 8 to 9 percent better, "but it's a good result anyway." How Safe Is Safe? But for all FBW's advantages, critics argue that its sophisticated computer system may be too far ahead of its time because of our relatively limited ability to test the reliability of software. Airbus Industry executive Robert Alizart believes that the duplicate architecture "reduces the chances of a total system loss to an absolute minimum." But Martyn Thomas, chairman of Praxis Systems, which produces special high-reliability software for Britain's Air Force, believe such precautions offer no guarantees. "Errors get through," Thomas says. "There may be common sources of error, such as a faulty specification, which cause the same mistakes in every version of the program. Identical errors may be made by independent teams. Testing only exercises a small proportion of the possible situations that the program may have to handle." Peter Neumann, a computer scientist at S.R.I. International, a Menlo Park, Calif., think tank, is a specialist in software engineering who has documented hundreds of software failure cases in the aerospace and other industries. Neumann says, "There are very serious risks in reliance or software in safety-critical applications. A seemingly innocuous addition to the software could have disastrous effects not discovered in testing. Never trust anyone who says such failures can never happen." The task facing testers is prodigious. "For even small amounts of software," says Thomas, "the number of possible paths far exceeds the number which could realistically be tested. For example, a recent module comprising 100 lines of assembly code was analyzed and found to contain 38 million possible paths, of which 500,000 could be followed with valid input data." Mike Hennell, head of Computational Mathematics at Liverpool University -- an authority on software reliability -- has not examined the A320's software code. Still, he says: "I wouldn't get into an Airbus A320 or any fly-by-wire aircraft." "We don't have the technology yet to tell if the programs have been adequately tested. We don't know what 'adequately tested' means. We can't predict what errors are left after testing, what their frequency is or what their impact will be. If, after testing over a long period, the program has not crashed, then it is assumed to be okay. That presupposes that they will have generated all of the sort of data that will come at it in real life -- and it is not clear that that will be true." Indeed, scientists have been working for 15 years on software reliability models, writes John Musa of AT&T's Bell Laboratories in the February issue of IEEE Spectrum. And they are now "moving into practice and starting to pay off." But they "deal with average rather than specific behavior, since the random nature of program usage and fault introduction generates failures at random." In the case of an airline reservation system, for example, "it is impossible to predict the next specific input and hence the next specific failure. Average behavior, however, can be characterized." The international design standard for airborne software systems (RTCA DO-178A) was developed by the Washington-based Radio Technical Commission for Aeronautics. Nancy Leveson, a specialist in software safety research and currently a visiting professor at MIT, says that DO- 178A is "not adequate for certifying commercial aircraft software. It lacks any mention of formal verification of safety, as required, for example, by the Department of Defense" which demands safety and hazard analysis. The FAA does, however, oblige developers "to use certain accepted concepts for design and development," says Mike DeWalt, an aircraft computer software specialist with the FAA. Although FAA officials do not see all the programming ("obviously there's no way in the world that a review agency could look at that much code"), they do demand adequate testing and quality evaluation, and even sample the programmers' work. "Basically, we take a slice through the whole system," says DeWalt. That is, pick a function like left aileron control and "follow it all the way down through testing and configuration management." "I don't want to imply that manufacturers and subcontractors will not do their best," Leveson says. "After all, they have the liability, and I'm sure they are decent human beings who care about human life. The problem is that without external review, we are depending on the competence of the employees of these companies, and I am less sanguine about the general state of software engineering knowledge and practice in industry than I am about the good intentions of humans." Daryl Pederson, deputy director of the FAA's Aircraft Certification Division and the man charged with certifying the A320, says of DO-178A, "The document recognizes that you can't test every situation you encounter." His British counterpart, Brian Perry, head of Avionics and Electrical Systems at the Civil Aviation Authority, agrees: "It's true that we are not able to establish to a fully verifiable level that the A320 software has no errors. It's not satisfactory, but it's a fact of life." Computers in the Sky Nonetheless, FBW offers the pilot some real gains. In extreme situations such as suddenly encountering strong windshear, the computers instantaneously compensate. Gordon Gorbes, chief test pilot for Airbus, says, "If a pilot has to make violent changes to the aircraft's attitude in an emergency, then the computer will prevent the pilot pushing it past design strengths. For example, the computer would prevent the pilot putting it into a dive that might break off the tail." And FBW saves money for the plane's owner, by reducing hardware costs, keeping the aircraft at optimum fuel-saving trim and facilitating the switch from three- to two-person flight crews. Many pilots flying the A320 have been enthusiastic in praising its handling and flying qualities. But some have complained about software problems and control irregularities. (The number of such complaints, according to Airbus' technical director, Bernard Ziegler, is small.) One problem reported by Air France, in a memo dated July 10, 1988 to Airbus, noted a software bug in its altimeter which measures the aircraft's height, a problem which has also been observed with British Airways' A320s. It is this problem that the pilot of the A320 that crashed at the small French airport at Mulhouse last June claimed contributed to the accident. [NGL: And which no one believed at the time.] There are various ways to fix a bug or add to a plane's installed software. Complete boxes containing replacement hardware and software can be exchanged by Airbus Industries. For carriers like Northwest, with 100 aircraft on order, this option would be expensive. So reprogramming could take place at a keyboard in the aircraft, conducted by Airbus or Northwest engineers. With over 640 aircraft on order around the world using two different makes of engine and a variety of sub-systems, the problem of "configuration management," as it is termed in the computer industry, becomes apparent. [NGL: Note that a configuration management problem involving a navigation computer was implicated in the Antarctica crash of the Air New Zealand plane into Mount Erebus. Of course, planes are not sent back to the factory for all of the hardware design changes that occur -- usually the maintenance crew handles them, Is the problem different for software?] So does the problem of anticipating a near-infinitude of real-life contingencies. In 1983 a United Airlines Boeing 767 went into a four- minute powerless glide after the pilot was compelled to shut down both engines because of overheating. The National Transportation Safety Board discovered that the plane's computerized engine-management system had ordered the engines to run at a relatively slow speed to optimize fuel efficiency. In the flight's particular atmospheric circumstances, however, this had allowed ice to build up on some engine surfaces, reducing the flow of air and causing the engines to work harder and overheat. "The problem is that the designer didn't anticipate all the possible demands the software would face," says Hennell. "The computer will always do something. But it will only do the correct thing if it has been programmed for that situation." ------------------------------ End of RISKS-FORUM Digest 8.49 ************************ -------