ATTENTION We strongly recommend that the system owner make and save for reference purposes a file copy of the dumpconf listing of the complete and correct configuration of all system hardware devices after the system has been successfully brought up for the first time.
After any power interruption, reboot, or devctl -c reconfiguration command, use this reference file and compare it to the new listing to determine if any devices failed to reestablish their links and disappeared from the system's configuration tree.
The problems described in this chapter involve more than one FC device of the I/O subsystem. Problems specific to a particular device are documented in a separate chapter for each device:
LP7000 FC Host Adapters, Chapter 3
LP6000 FC Host Adapters, Chapter 4
IBM 2109 and SilkWorm 2000 FC Switches, Chapter 5
SilkWorm 1000 FC Switch, Chapter 6
IBM FC Bridge, Chapter 7
An LP7000 FC Host Adapter, cabled to a SilkWorm 1000-family switch that is running V1.6c3 firmware, can intermittently fail to achieve an immediate login to that switch after a site power loss (also known as "lights out"), a switch reboot, or a system devctl -c reconfiguration command. The resulting delay caused by the timing cycle of the retry process can make the system appear to be stalled.
Workaround: A recent change in a default time-out value in a system file resulted in an extra long retry window for host adapter FLOGIs. The system will retry three times before marking down the device as failed, therefore, wait at least 3.5 minutes (210 seconds) before attempting corrective action such as re-initializing the device.
ATTENTION Any unused FC Host Adapter in a system must have its optical loopback connector installed or else each unterminated adapter will cause a 210-second time-out cycle, one at a time.
The following problems occasionally occur when an EMC® Symmetrix® Storage Subsystem is connected to an FC Switch.
Under certain unknown I/O loads, an EMC Symmetrix subsystem running 5265 firmware can cause sequence time outs. These errors have been discovered by inspecting the DYNIX/ptx log files in the /usr/adm/ktlog directory for sequence time-out messages from EMC disks. As long as the NUMA-Q system is configured for a Level-2 or -3 resource domain, retries and rerouting will protect from data loss.
The probable cause is interference from the internal environmental tests programmed within the EMC system to run at a certain time of day. Check the messages in the log files in /usr/adm/ktlog to see if the same thing is occurring every day, or over many days because the incidence appears to be load-dependent. Request the EMC field engineer to check scheduled timing for the internal environmental tests within the EMC system. Check specifically the time scheduled for "Environmental Test 1," the most likely culprit. If the time-out messages in the log files from /usr/adm/ktlogoccur within 120 seconds after the internal EMC process completes, this is the most likely cause.
ATTENTION Check first for any discrepancies between the NUMA-Q system clock and the EMC system clock.
Workaround: Request the EMC field engineer to reschedule those internal tests to a known low-load time of day. An alternative is to upgrade the EMC subsystem to the first general availability release of the 5266 firmware.
Under certain unknown I/O loads in a Level-2 or -3 resource domain, a disk in a SCSI-ported EMC subsystem can be missed when configuring (devctl -c) an FC Bridge back into the system following a service procedure such as a replacement. Under certain I/O loads, an EMC disk will respond to the probe with a "SCSI busy" signal, which is not recognized by the operating system as a retryable event, and that disk will not be included in the I/O configuration for that path. This error has been discovered by checking the dumpconf output after the procedure to verify that all devices are found.
ATTENTION As long as the NUMA-Q system is configured for a Level-2 or -3 resource domain, retries and rerouting will protect from data loss.
Workaround: Once a previously configured disk is discovered to be missing from a particular scsibus connected to the parent FC Bridge that was reconfigured, issue a devctl -c scsibusx command to recover the attached disk.
In NUMA-Q clusters, if a hardware-error interrupt (B_INT) should occur on one of the nodes during certain unknown load conditions, an EMC I/O may fail on the other node(s) of the cluster, causing soft and hard I/O errors on those nodes.
Workaround: There is no known workaround at this time. The probability of a B_INT on a NUMA-Q system is very low.