RISKS-LIST: RISKS-FORUM Digest  Wednesday 5 April 1989   Volume 8 : Issue 49

        FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS
   ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator

Contents:
  An unusual "common mode failure" in B-1B aircraft (PGN)
  Gripen crash caused by flight control software (Mitchell Charity, Mike Nutley)
  Airbus A320 article plus some comments (Nancy Leveson) [long]

----------------------------------------------------------------------

Date: Wed, 5 Apr 1989 10:44:35 PDT
From: Peter Neumann <neumann@csl.sri.com>
Subject: An unusual "common mode failure" in B-1B aircraft

A rather bizarre common mode failure has been detected in the recent inspection
of grounded B-1B bombers: there was a shortage of lubricant in a critical
gearbox in 70 of the 80 planes inspected (with 17 more still to go).  The
problem was found on the plane whose wing swept into the fuel tank
(RISKS-8.46), which resulted in two shafts fractured and a leak along a fuel
tank seam.  [San Francisco Chronicle, 5 April 1989, p. A7]

------------------------------

Date: Wed, 5 Apr 89 01:15:31 EDT
From: mcharity@ATHENA.MIT.EDU
Subject: Gripen crash caused by flight control software

(quotes&inserts from FLIGHT INTERNATIONAL, 25 March 1989)

On Feb 2 the 1st prototype (of 5) of Sweden's Saab JAS39 Gripen fighter crashed
on landing after its 6th test flight.  It impacted, broke left main gear,
bounced, skidded and flipped.

``Gripen is naturally unstable and has a triplex digital fly-by-wire system
with a triplex analogue backup.''

Initial flight was ``some 18 months behind schedule'' and this was ``attributed
to difficulties in proving the software for the flight control system.''

After the 1st flight, test pilot ``remarked that the control system seemed
too sensitive and that the control laws would probably need to be changed.''
On all flights ``the aircraft experienced problems with lateral oscillations.''
[On the] ``last flight oscillation in pitch was also apparent.''

The accident investigation committee chairman ``confirms earlier assumptions
that the flight control system was at fault.''

Chairman:
``The accident was caused by the aircraft experiencing increasing pitch
 oscillations (divergent dynamic instability) in the final stage of landing,
 the oscillations becoming uncontrollable.  This was because movement
 of the stick in the pitch axis exceeded the values predicted when
 designing the flight control system, whereby the stability margins were
 exceeded at the critical frequency.''

Separate investigation by the JAS Industry Group:
``The control laws implemented in the flight-control system's computer had
 deficiencies with respect to controlling the pitch axis at low speed.
 In this case, the pilot's control commands were subjected to such a delay
 that he was out of phase with the aircraft's motion.''

``the company hopes to fly JAS39-2 before the end of the year.''
``Delivery of the first production aircraft [...] is now expected
 in [1993, although typo said `1933'], instead of 1992.''

------------------------------

Subject: Swedish Gripen Fighter Crash
Date: Wed, 5 Apr 89 17:09:44 BST
From: jpff@maths.bath.ac.uk
Sender: jpff@maths.bath.ac.uk

From Datalink, April 3 1989 (a British paper for system/software)
quoted in full without permission.

                Swedish wind cuts fly-by-wires

Flight-control software has been blamed for the crash of the prototype
Swedish Gripen fighter last February.  The preliminary report from the
Swedish government's crash-investigation commission indifified the
software's inability to cope with gusting winds and the oversensitivity of
the control system as the prime reasons for the accident.

According to a spokesman for the commission, problems with the \pound 3.2
billion project first arose in an earlier flight test.  "The preceeding test
flight had shown up problems, but it's not a problem with the aircraft or
with the flight control systems.  It's a software problem.

"The whole of the control system was too sensitive for the pilot; it
operated too fast.  It was too easy for the pilot to go outside the
flight-control envelope into unstable flight."

In common with many fighters currently being developed, the JAS39
Gripen is designed to be inherently unstable to increase its
manoeuverability.  It relies on the software to keep it under control.

"There were limitations on the flight control systems, but during the
landing phase the wind was stronger than allowed for by these
limitations.  The pilot had to try to overcome them."

A final report into the crash is due in May, but work has already
started on the second prototype aircraft, including a modified version
of the flight-control software.
                                         Mike Nutley
------------------------------

Date: Wed, 05 Apr 89 13:54:29 -0400
From: levesonelectron.LCS.MIT.EDU
Subject: Airbus A320 article plus some comments

Here is the full Washington Post article, interspersed with a few of my
comments.

WASHINGTON POST: OUTLOOK, 04/02/89
Copyright (c) 1989 The Washington Post Co.
By Jim Beatson

    [Jim Beatson writes on aviation issues for the Guardian and some other
     British newspapers.  He is currently living in Canada.  NGL]
         [Apparently either Beatson or the Post removed some more controversial
         items from the original British appearance of this material.  PGN]

   IN JUNE, a new plane hits the American skies. Northwest Airlines will become
the first U.S. carrier to take delivery of the European Airbus A320 -- the most
advanced passenger aircraft in the world, and already one of the most
controversial. In use since last May by British Airways and Air France, the
medium-sized 150-seat twin-engine jet is the first airliner to have every
function, from flight controls to toilet operation, directed by computer.
   On June 26, 1988, two days after the third A320 went into service, it
crashed while performing a low-level pass at a French air show. A woman and two
children on board were killed. An investigation blamed the accident on pilot
error, but the pilot faulted a number of factors including the aircraft's
computers for providing incorrect altitude information. (The pilot, a senior
Air France captain, was subsequently dismissed.) Since then, various unsettling
reports have appeared in the European press, regarding: engines unexpectedly
throttling up on final approach; inaccurate altimeter readings; sudden power
loss prior to landing; steering problems while taxiing.

       [NGL:  It is interesting that the pilot was never believed about
       the altimeter although there is not plenty of evidence to back up
       his story.  I have noticed several things about evaluation of
       accidents in general:

       1) Human error is always the first ascribed cause whenever a human
       is involved in the system where an accident occurred.  However,
       most accidents are multi-factorial.  If the altimeter is indeed
       inaccurate, then the accident was only partially caused by the
       pilot.  Humans tend to want simple answers to complex problems and
       to be able to ascribe blame to some single cause.  There are, of
       course, other factors at work in these oversimplifications such
       as liability issues and misplaced faith in technology.  But seldom
       are accidents the result of only one thing going wrong.  Actually,
       the few times I have found this to be true (i.e., one thing is at
       fault), it is a computer that is the primary agent.  Perhaps engineers
       expect other things to fail and therefore design systems so that a
       single failure cannot lead to an accident.  But since (as engineers
       often tell me or write in system safety evaluations) computer software
       does not fail...

       2) If a human cannot be blamed, then the hardware is.  The first
       incident involving the Therac 25 occurred in Hamilton, Ontario.
       The accident was blamed on a faulty microswitch (a "transient"
       failure since nothing could be found wrong with the microswitch).
       The fix for the problem was to put in a duplicate microswitch to
       detect when the filter was not in place to correctly filter
       the X-ray beam.  When the next incident occurred in Tyler,
       Texas (again involving the misalignment of the filter), it was
       believed that the burn suffered by the patient (who died from his
       injuries 6 months later) was electrical.  Nobody believed that he
       could have suffered an overdose or that the computer could be
       involved.  The electrical system was checked out and found to be
       OK so the machine was deemed safe.  Two weeks later another man
       was overdosed in Tyler (he died two weeks after this) and FINALLY,
       someone (at the hospital) decided the computer might be involved.
       It was the physicist at the hospital who was able to reproduce the
       problem and raise an alarm about the computer.  He had some difficulty
       convincing anyone else about this.  The Therac 25 victim in Georgia
       had great trouble convincing anyone that the Therac was responsible
       for her severe burns.  This was true also for the first overdose in
       Yakima.  Finally, when the second person was overdosed in Yakima
       (and all the prior incidents had occurred including the detection
       of an error in the software that could have caused the incidents),
       people were willing to examine the possibility that this was a
       software error (a different software error was given the blame
       this time).  Why are people so reluctant to believe that the
       computer may be at fault?]

[returning to the Washington Post article]

   Of course, the introduction of any new aircraft entails shake-out
problems of one kind or another. But the A320's extensive use of
computers raises a new set of questions: Are we ready to rely so heavily
on complex software systems for such safety-critical applications as
commercial flight?

   Bird on a Wire

The control system employed by the A320 is known as "fly by wire." FBW
replaces the conventional stick and rudder controls with a series of
computers and miles of electronic cables. Instead of the familiar
control-column, the pilots use "side-sticks," a single lever resembling
the joy sticks used in video games.
   Sensing devices which gauge the aircraft's flight characteristics
pass the information to the six color monitors that replace nearly all
the traditional analog instruments and result, Airbus says, in 75
percent fewer instruments than conventional configurations. On the
uncluttered flight deck, the pilot on the right uses the side-stick with
the right hand while the pilot on the left has a left-handed version.
(On the left, pilots tend to push the aircraft to the right owing to the
position of the forearm and wrist; that side-stick was adjusted to
compensate.) But the computer system actually directs the control surfaces.
Only the rudder and horizontal stabilizer -- both on the tail -- be
mechanically directed by the pilot.
   All other flight controls are managed by the electrical flight-
control system (EFCS), which contains three spoiler/elevator computers
(SEC), two elevator/aileron computers (ELAC) and the flight-augmentation
computer that oversees stability, limiting and protection functions. The
engines and throttles are managed by the full-authority digital engine-
control (FADEC) computers. The EFCS uses "dissimilar redundancy." That
is, computers that are designed to back each other up are of different
brands, have different microprocessor types and are supplied by
different vendors -- all to minimize the likelihood of identical hardware
parts failing at the same time. And different programmers were employed
to write each of the parallel sets of software. Moreover, each computer
is divided into two physically separate units with "segregated" power
supplies.

      [NGL:  There were different programmers.  Were there different
       requirements specifications?  How about design specifications?
       How much detailed design information was provided to the programmers?]

   The EFCS is designed to fly within a theoretical "flight
envelope" -- permissible ranges for various maneuvers -- thus providing
computer-monitored protection against windshear forces, overload or
overspeed conditions. If the pilot were to, say, allow the speed to drop
toward the stall point, the computer would sound alarms and
automatically increase the power.
   In the event that two computers should disagree, one automatically
shuts itself down and its tasks are carried out by the other. For
example, if one unit directed the flaps to be partly extended and its
monitoring software expected full flap extension, then the first unit
would automatically shut itself down and its functions would be passed
over to the other. The pilots' display monitors would tell them what had
happened. Finally, each of the five flight-control-surface computers is
capable of performing all of the essential tasks of the others as well
as its own tasks.

      [NGL:  If two computers disagree, how is it determined which computer
       to shut down?  It does not sound like the pilots do this, they are
       just told about the event afterward (and may not have the information
       necessary to make this decision anyway).   So how is the decision
       made?  How do they know that the monitor is correct and the other
       one is not?]

   The Airbus A320, of course, is not the first civilian aircraft to use
computerized control. Boeing's 757 and 767, for example, have computer-
activated spoilers; and Boeing had planned to use FBW technology in the
7J7 but subsequently deferred development. Joe Sutter, Boeing's chief
engineer for the past 20 years, believes that "fly-by-wire is way
overstated as to its benefits"; and as for the side-stick, system, "we
have some reservations -- like what one pilot is doing is not obvious to
the other."
   "The main benefit of FBW," he says, "is to reduce weight and increase
range. It will really boost safety. But fooling around with FBW to
reduce [something like] tail size goes against the design philosophy I
have always urged -- that you've got to design an aircraft which one day
for some reason or other is going to get into a hell of a lot of trouble.
" That means mechanical back-up systems for the main control surfaces.
"What happens with FBW when the aircraft gets outside its control laws?
Its going to leave the pilot in one hell of a lot of trouble -- for what?
One-percent fuel burn?"
   A great deal more than that, says Airbus, which believes it now
enjoys a significant competitive advantage over Boeing and McDonnell
Douglas in fuel and weight savings. An Air France official says that the
Airbus A320 is 40 percent more fuel efficient than the old Boeing 727s
they have replaced. He was expecting 8 to 9 percent better, "but it's a
good result anyway."

   How Safe Is Safe?

But for all FBW's advantages, critics argue that its sophisticated
computer system may be too far ahead of its time because of our
relatively limited ability to test the reliability of software.
   Airbus Industry executive Robert Alizart believes that the duplicate
architecture "reduces the chances of a total system loss to an absolute
minimum." But Martyn Thomas, chairman of Praxis Systems, which produces special
high-reliability software for Britain's Air Force, believe such precautions
offer no guarantees. "Errors get through," Thomas says.  "There may be common
sources of error, such as a faulty specification, which cause the same mistakes
in every version of the program. Identical errors may be made by independent
teams. Testing only exercises a small proportion of the possible situations
that the program may have to handle."
   Peter Neumann, a computer scientist at S.R.I. International, a Menlo Park,
Calif., think tank, is a specialist in software engineering who has documented
hundreds of software failure cases in the aerospace and other industries.
Neumann says, "There are very serious risks in reliance or software in
safety-critical applications. A seemingly innocuous addition to the software
could have disastrous effects not discovered in testing. Never trust anyone who
says such failures can never happen."
   The task facing testers is prodigious. "For even small amounts of software,"
says Thomas, "the number of possible paths far exceeds the number which could
realistically be tested. For example, a recent module comprising 100 lines of
assembly code was analyzed and found to contain 38 million possible paths, of
which 500,000 could be followed with valid input data."
   Mike Hennell, head of Computational Mathematics at Liverpool University --
an authority on software reliability -- has not examined the A320's software
code. Still, he says: "I wouldn't get into an Airbus A320 or any fly-by-wire
aircraft."
   "We don't have the technology yet to tell if the programs have been
adequately tested. We don't know what 'adequately tested' means. We can't
predict what errors are left after testing, what their frequency is or what
their impact will be. If, after testing over a long period, the program has not
crashed, then it is assumed to be okay. That presupposes that they will have
generated all of the sort of data that will come at it in real life -- and it
is not clear that that will be true."
   Indeed, scientists have been working for 15 years on software
reliability models, writes John Musa of AT&T's Bell Laboratories in the
February issue of IEEE Spectrum. And they are now "moving into practice
and starting to pay off." But they "deal with average rather than
specific behavior, since the random nature of program usage and fault
introduction generates failures at random." In the case of an airline
reservation system, for example, "it is impossible to predict the next
specific input and hence the next specific failure. Average behavior,
however, can be characterized."
   The international design standard for airborne software systems (RTCA
DO-178A) was developed by the Washington-based Radio Technical Commission for
Aeronautics. Nancy Leveson, a specialist in software safety research and
currently a visiting professor at MIT, says that DO- 178A is "not adequate for
certifying commercial aircraft software. It lacks any mention of formal
verification of safety, as required, for example, by the Department of Defense"
which demands safety and hazard analysis.
   The FAA does, however, oblige developers "to use certain accepted concepts
for design and development," says Mike DeWalt, an aircraft computer software
specialist with the FAA. Although FAA officials do not see all the programming
("obviously there's no way in the world that a review agency could look at that
much code"), they do demand adequate testing and quality evaluation, and even
sample the programmers' work.  "Basically, we take a slice through the whole
system," says DeWalt. That is, pick a function like left aileron control and
"follow it all the way down through testing and configuration management."
   "I don't want to imply that manufacturers and subcontractors will not do
their best," Leveson says. "After all, they have the liability, and I'm sure
they are decent human beings who care about human life. The problem is that
without external review, we are depending on the competence of the employees of
these companies, and I am less sanguine about the general state of software
engineering knowledge and practice in industry than I am about the good
intentions of humans."
   Daryl Pederson, deputy director of the FAA's Aircraft Certification Division
and the man charged with certifying the A320, says of DO-178A, "The document
recognizes that you can't test every situation you encounter." His British
counterpart, Brian Perry, head of Avionics and Electrical Systems at the Civil
Aviation Authority, agrees: "It's true that we are not able to establish to a
fully verifiable level that the A320 software has no errors. It's not
satisfactory, but it's a fact of life."

   Computers in the Sky

Nonetheless, FBW offers the pilot some real gains. In extreme situations
such as suddenly encountering strong windshear, the computers
instantaneously compensate. Gordon Gorbes, chief test pilot for Airbus,
says, "If a pilot has to make violent changes to the aircraft's attitude
in an emergency, then the computer will prevent the pilot pushing it
past design strengths. For example, the computer would prevent the pilot
putting it into a dive that might break off the tail." And FBW saves
money for the plane's owner, by reducing hardware costs, keeping the
aircraft at optimum fuel-saving trim and facilitating the switch from
three- to two-person flight crews.
   Many pilots flying the A320 have been enthusiastic in praising its
handling and flying qualities. But some have complained about software
problems and control irregularities. (The number of such complaints,
according to Airbus' technical director, Bernard Ziegler, is small.) One
problem reported by Air France, in a memo dated July 10, 1988 to Airbus,
noted a software bug in its altimeter which measures the aircraft's
height, a problem which has also been observed with British Airways'
A320s. It is this problem that the pilot of the A320 that crashed at the
small French airport at Mulhouse last June claimed contributed to the
accident.

      [NGL:  And which no one believed at the time.]

   There are various ways to fix a bug or add to a plane's installed
software. Complete boxes containing replacement hardware and software
can be exchanged by Airbus Industries. For carriers like Northwest, with
100 aircraft on order, this option would be expensive. So reprogramming
could take place at a keyboard in the aircraft, conducted by Airbus or
Northwest engineers. With over 640 aircraft on order around the world
using two different makes of engine and a variety of sub-systems, the
problem of "configuration management," as it is termed in the computer
industry, becomes apparent.

    [NGL:  Note that a configuration management problem involving
     a navigation computer was implicated in the Antarctica crash of
     the Air New Zealand plane into Mount Erebus.  Of course, planes
     are not sent back to the factory for all of the hardware design
     changes that occur -- usually the maintenance crew handles
     them, Is the problem different for software?]

   So does the problem of anticipating a near-infinitude of real-life
contingencies. In 1983 a United Airlines Boeing 767 went into a four-
minute powerless glide after the pilot was compelled to shut down both
engines because of overheating. The National Transportation Safety Board
discovered that the plane's computerized engine-management system had
ordered the engines to run at a relatively slow speed to optimize fuel
efficiency. In the flight's particular atmospheric circumstances,
however, this had allowed ice to build up on some engine surfaces,
reducing the flow of air and causing the engines to work harder and
overheat.
   "The problem is that the designer didn't anticipate all the possible
demands the software would face," says Hennell. "The computer will
always do something. But it will only do the correct thing if it has
been programmed for that situation."

------------------------------

End of RISKS-FORUM Digest 8.49
************************
-------