Safety is defined as "freedom from accidents or losses" whereas reliability is "a stochastic measure of the
availability of services from a system." Both of these concerns are managed through the administration of redundancy.
The safety and reliability architecture is concerned with correct functioning in the presence of faults and errors.
Heterogeneous redundancy (also known as diverse redundancy) is used to provide protection from failures and errors.
Reliability is a measure of the up-time or availability of a system--specifically, it is the probability that a
computation will successfully complete before the system fails. It is normally estimated with mean time between failure
(MTBF).
Redundancy is one design approach that increases availability because if one component fails, another takes its place.
Of course, redundancy only improves reliability when the failures of the redundant components are independent. The
reliability of a component does not depend on what happens after the component fails. Whether the system fails safely
or not, the reliability of the system remains the same.
Safety is distinct from reliability. A safe system is one that does not incur too much risk to persons or equipment. A
risk is an event or condition that can occur but is undesirable. Risk is the product of the severity of the incident
and its probability.
The key to managing both safety and reliability is redundancy. For improving reliability, redundancy allows the system
to continue to work in the presence of faults because other system elements can take up the work of the broken one. For
improving safety, additional elements are needed to monitor the system to ensure that it is operating properly; other
elements may be needed to either shut down the system in a safe way or take over the required functionality.
Normally, this view is represented structurally with class or structure diagrams; the interaction of the system
elements to achieve safety and/or reliability goals are shown on sequence diagrams; the behavior of individual classes
or objects are depicted with state machines; the logical relation between faults, conditions, and failure events is
shown as a Fault Tree Analysis (FTA) diagram; the reliability as a function of failure modes is shown as an FMEA; the
relation between the faults, corrective measures, risks, and hazards is shown in a Hazard Analysis.
|