If you do not have EES installed on your NUMA-Q host, the sesmond daemon will write messages only to ktlog. If you do have EES, sesmond will also write the log messages to /var/ees/eeslog. We recommend that you install EES and consult the messages in /var/ees/eeslog; this log not only contains more information but also structures that information more consistently, making the log entries easier to filter when you are searching for specific subsets of the information.
The sesmond daemon sends an initial message to the log for each DAE unit it is monitoring:
Started monitoring DAE on fabric-m and fabricn with address x
-- In which fabricm and fabricn are strings identifying the arbitrated loops to which the DAE unit is connected, and x is the DAE enclosure address (with a possible value of 0-11). The DAE enclosure address is set by a switch on the front panel of the DAE unit.
ATTENTION To view the messages in the file /var/ees/eeslog, you must use the ees_view utility, which converts the file contents from a compressed binary format into readable text.
From the time sesmond starts until the DAE unit is shut down, sesmond writes messages concerning faults on the DAE enclosure and its FRUs to the ptx/EES log /var/ees/eeslog. Messages are sent both when faults are discovered and when they are corrected.
All EES messages present their information in a sequence of standard fields, as follows:
binary timestamp product_name product_version
host_name provider_name area_name file_name#lineno
EES_event_id sequence number severity description timestamp
The following example shows how these fields might look when fully populated:
8904950989:ptx/SESMON:1.2.0A:elm2a89:sesmond::clariion_dae.c#150: e=23:s=482:major:The recv diag configuration page for disk sd12 is not that of a DAE, but the disk inquiry is for a DAE disk : Fri Jul 07 16:16:29 2000
For more details concerning EES logging and filtering, see the EES User's Guide.
Messages logged to the ktlog file will not contain as much information as those written to /var/ees/eeslog. Also, the order of the fields will differ between the two types of logs. The following is a sample of a typical sesmond entry in ktlog:
362b72e9 10:12:09 tolog/warn p6763 ptx/SESMON:25:v1.2.0: sesmond::ees_major:disk fault for disk sd16 in slot 6 in DAE on fabric0 and fabric1 with address 0
Note that the timestamp in plain text immediately follows the binary timestamp. In addition, the severity level is written as "tolog/warn" instead of "warning," and comes earlier in the message than it does in the EES message format. Most importantly, the name of the host computer to which the message was sent, the name of the source file logging the event, and the line number of the message within that source file are not captured.
The textual portions of the messages logged by sesmond do not differ significantly between ktlog and /var/ees/eeslog. This section presents examples of the messages that may appear, interprets them, and suggests appropriate corrective action. In these examples, the target DAE unit is assumed to be dual-ported.
Messages are divided into the categories "critical," "major," "minor," "debug," and "info." Within those categories, the messages are sorted according to their EES event ID numbers, in ascending order.
The error messages that you will encounter in the normal course of operation are those that report on the status of hardware devices within the DAE unit. Refer to the EES event ID of the message listed here, to find interpretive information and corrective action:
(EES ID 25) major: disk fault for disk in slot 0 in DAE on fabric2 and fabric3 with address 1
(EES ID 27) major:power failure for power supply A in DAE on fabric2 and fabric3 with address 1
ATTENTION This is a critical event if the failing power supply is the only one installed in the system. By default, the DAE comes with redundant power supplies.
ATTENTION This is a critical event if the failing power supply is the only one installed in the system. By default, the DAE comes with redundant power supplies.
(EES ID 33) major: LCC failure for LCC A in DAE on fabric2 and fabric3 with address 1
(EES ID 35) major: primary MIA failure for LCC B in DAE on fabric2 and fabric3 with address 1
Critical messages concern problems that render ptx/SESMON or the DAE unit unusable; for example, a problem that causes the sesmond daemon to exit inadvertently is critical. The following critical messages may be logged by sesmond.
The first of three critical error messages that may be issued when the parent sesmond daemon detects that the child daemon has exited unexpectedly. The cause is a hardware or software bug.
Corrective action: Check logs for any previous errors, possible causes of the exit. Verify that the daemon has restarted, and contact NUMA-Q Customer Support.
This message is output when the (child) sesmond daemon detects an error and exits with a non-zero value.
Corrective action: Check logs for any previous errors, and verify that the daemon has restarted and is running. Contact NUMA-Q Customer Support.
The child sesmond daemon exited with a non-zero value, meaning that it encountered an unexpected error such as a memory-allocation failure.
Corrective action: Check logs for any previous errors as possible causes of the exit. Verify that the daemon has restarted, and contact NUMA-Q Customer Support.
The child sesmond daemon was restarted more than 5 times within the last (20 * poll_period) seconds. The default time period is 20*30 seconds = 600 seconds or 10 minutes.
Corrective action: Check logs for any previous errors and try to fix or work around the errors found. Otherwise, the DAE cannot be monitored; contact NUMA-Q Customer Support.
EES event 47 displays a variety of text messages and can be either a critical or major event, depending upon the context. When this event is a critical error, it is accompanied by EES ID 24, critical: exiting. The first portion of the text string is the text of the failure, including the system or library call that failed ; the second portion is the errno text. The following text variants are possible:
sigset failed, errno
fork failed, errno
gettimeofday failed, errno
wait failed, errno
MPTIOCINQ failed, errno
MPTIOCRECDIAG failed, errno
FIOCADDR ioctl failed, errno
cfg_sys call failed, errno
setitimer failed, errno
could not open disk_name
could not open fabric_name
Corrective action: Determine what this error means in the context of the failed system or library call. Check for errors with more information following these events, such as the disk name to which the failure applies.
EES defines major errors are those that indicate a problem with noticeable impact on the program under report (the "provider"). A range of problems can be grouped under this condition: certain functionality within the provider is not available; the provider is performing poorly; or the provider is malfunctioning in some other way. For ease of reference, the major errors listed here are sorted by their EES ID number, in ascending order. Almost all of the error messages for ptx/SESMON V1.2.0 have variants for both single-port and dual-port connection of the DAE enclosure to the NUMA-Q host. The message examples shown below, unless otherwise noted, pertain to dual-port connection.
This message is output whenever an error occurs that may have prevented complete identification of a DAE.
Corrective action: Check logs for any previous errors. It is possible that the DAE is not being monitored - check the log to view monitored DAE units. Then contact NUMA-Q Customer Support.
Corrective action: Check logs for any previous errors. Then contact NUMA-Q Customer Support.
This is an internal error.
Corrective action: Try restarting sesmond or rerunning sesmonid. Contact NUMA-Q Customer Support.
ptx/SESMON received an unexpected value for the Fibre Channel address of a disk in the DAE.
Corrective action: Check logs for any previous errors. Then contact NUMA-Q Customer Support.
The status code returned for a device is not on the list of predefined possible values.
Corrective action: Check logs for any previous errors. Then contact NUMA-Q Customer Support.
An error occurred while sesmond was trying to determine what DAE units were configured on the system.
Corrective action: Check logs for any previous errors. Then contact NUMA-Q Customer Support.
The sesmond daemon no longer has access to any of the disk drives in slots 0-3; access to at least one of these devices is needed for SES access to the DAE. The loss of access may have been caused by a failed connection to the DAE, or by the failure of the disks in slots 0-3.
Corrective action: Verify that at least one of the four disk drives in slots 0-3 is capable of sending and receiving I/O. If not, replace the failed drive or correct the failed connection.
This error is output the second time that the polling interval is exceeded.
Corrective action: Check for hung configuration (devctl) commands, and check to see if sesmond is using excessive CPU time.
The signal-based monitor timer's poll period expired while sesmond was still monitoring or trying to monitor. This event can easily occur when a device is being configured or deconfigured from the system.
Corrective action: Shut down sesmond while deconfiguring, in order to avoid interactions with devctl, and restart sesmond after device configuration or deconfiguration is complete.
The cfg call failed, probably because it could not allocate memory.
Corrective action: Check system memory resources.
If the system is unable to communicate with a DAE disk in slots 0-3 that is visible to the operating system, this message will be output. Two of these disks normally have access to enclosure status information for each LCC. If just one of them fails, ptx/SESMON will use the other disk to access the enclosure status information. In that case, this message will be seen only once. If both disks with access to a given LCC fail (both disks in slots 0 and 2, or both disks in slots 1 and 3), this message will be seen every poll period until the failed disks are recovered or taken offline.
Corrective action: Check status of the disk and its connections. If necessary, replace the disk drive, which is replaceable with the DAE unit online. See the NUMA-Q Installation Guide for CLARiiON Disk Arrays
A malloc call failed.
Corrective action: Check system memory resources.
There are disks in a DAE unit configured into the system, but no disks in slots 0-3 are configured, so the DAE cannot be monitored.
Corrective action: Check logs for previous errors, and check that the proper disks were configured.
This is an internal error.
Corrective action: Check logs for previous errors. Contact NUMA-Q Customer Support.
This is an internal error.
Corrective action: Try restarting the sesmond daemon or rerunning the sesmonid command. Contact NUMA-Q Customer Support.
Corrective action: Check logs for any previous errors. Contact NUMA-Q Customer Support.
Probably hardware that supports UPS is matched with software that does not. Another possible cause is a hardware or software bug.
Corrective action: Contact NUMA-Q Customer Support.
This message probably results either from a bug or from an unsupported configuration, such as having connected to the NUMA-Q host a disk that has DAE-specific firmware but is not in a DAE unit.
Corrective action: Contact NUMA-Q Customer Support.
The disk drive in slot 0 in the DAE unit with the enclosure address 1, dual-connected to Fibre Channel arbitrated loops fabric3 and fabric4, has failed.
Corrective Action: Replace the disk drive, which is replaceable with the DAE unit online. See the section entitled "Replacing or Adding a Disk Module in the DAE" in the NUMA-Q® Installation Guide for CLARiiON® Disk Arrays.
Power supply A in the DAE unit with the enclosure address 1, dual-ported to Fibre Channel arbitrated loops fabric3 and fabric4, has failed.
Corrective Action: Replace power supply A. You can do this with the DAE unit online, provided that power supply B is functioning properly. See the section entitled "Replacing a Power Supply Module in the DAE" in the NUMA-Q ®Installation Guide for CLARiiON® Disk Arrays.
Power supply A in the DAE unit with the enclosure address 1, dual-ported to Fibre Channel arbitrated loops 3 and 4, has exceeded its rated maximum operating temperature and therefore shut down.The DAE can continue to operate if the other power supply is functioning properly.
Corrective Action: Replace the faulty power supply at your earliest convenience. See the NUMA-Q® Installation Guide for CLARiiON® Disk Arrays.
One of the three fans in the fan pack has failed. The DAE unit can continue operating; the other two fans will speed up in order to compensate for the fault. From this point on, the DAE unit no longer has redundant cooling; if one of these remaining fans fails, power to all disk drives in the DAE unit will be shut down after two minutes.
Corrective Action: Replace the fan pack at your earliest convenience. See the section entitled "Replacing the Drive Fan Pack in the DAE" in the NUMA-Q® Installation Guide for CLARiiON® Disk Arrays.
At least two of the three fans in the fan pack have failed. Power to all disk drives in the DAE unit will be shut down two minutes after this event occurs.
Corrective Action: Replace the fan pack. See the section entitled "Replacing the Drive Fan Pack in the DAE" in the NUMA-Q Installation Guide for CLARiiON® Disk Arrays.
The DAE unit with enclosure address 1, dual-connected to the Fibre Channel loops named fabric3 and fabric4, has experienced a failure of LCC A. If this DAE unit had only one LCC connected to the host over the loop named fabric4 and that LCC failed, or a previous secondary connection to the DAE failed, you would see only the message, major:could not get recv diag status for DAE on fabric4 with address 1 (EES ID 16), up to four times.
Corrective Action: Replace the faulty LCC. See the section entitled "Replacing a Link Control Card in the DAE" in the NUMA-Q® Installation Guide for CLARiiON® Disk Arrays
The Media Interface Adapter (MIA) connected to the primary (PRI) port on LCC B of the DAE unit with enclosure address 1, on Fibre Channel fabric 4, has failed. If this DAE unit had only one LCC connected to the host over the loop named fabric4 and that LCC failed, or a previous secondary connection to the DAE failed, you would see only the message, major:could not get recv diag status for DAE on fabric4 with address 1 (EES ID 16), up to four times.
Corrective Action: Replace the MIA, as described in the first three procedural steps in the section entitled "Replacing a Link Control Card in the DAE" in the NUMA-Q® Installation Guide for CLARiiON® Disk Arrays, where the Figure entitled "Optical Cable Connection to the Link Control Card (LCC)" illustrates the MIA's connection to the LCC and to the fiber optic cable from the NUMA-Q host.
An error occurred while getting status (monitor) information.
Corrective action: Check logs for any previous errors; then contact NUMA-Q Customer Support.
The parent sesmond daemon detected that the child daemon exited because it received a signal. If the child daemon was not killed by a command entered manually, the cause is probably a bug.
Corrective action: Check logs for any previous errors, possible causes of the exit. Verify that the daemon has restarted, and contact NUMA-Q Customer Support.
The listed disk drive probably has errors. This message will be output every poll period until the drive access is corrected or the drive is deconfigured from the kernel.
Corrective action: Check that the drive is accessible. To prevent excessive logging error messages, use the sesmond option -x to exclude the use of the device. Use the full pathname, /usr/bin/sesmond, when manually starting sesmond; otherwise, the daemon cannot be killed by the shutdown script, which only recognizes the process by the name/usr/bin/sesmond.
Events with EES ID 47 are classified as major (rather than critical) errors when not accompanied by an inadvertent exit of the sesmond child daemon. A variety of messages may be displayed under this event ID; see (EES ID 47) critical .
EES classifies as minor events those errors that have no noticeable impact on the functionality of the program under report (the "provider"), such as unrecognized debug parameters that are being ignored. For ease of reference, the minor errors listed here are sorted by their EES ID number, in ascending order.
Also classified as minor EES events are those messages that announce the correction of errors. The error-corrected messages are described in Section 2.9, "sesmond Error-Correction Entries in .var/ees/eeslog."
All devices in a DAE unit have been deconfigured, and so no monitoring can occur for the unit.
This message notifies the user that the child sesmond daemon, after exiting unexpectedly, has been restarted.
The command line option -d, -X, or-x was used, and a device was specified that does not exist in the configuration tree as a fabric device (all DAE disks are on a fabric). If sd100 is the only device selected in the command option, the daemon will continue running, monitoring all devices. If other devices that are in the configuration tree are selected, the sesmond daemon will monitor them.
Messages are logged with severity ees_debug when sesmond is run with verbose/debug mode enabled--that is, when sesmond is started with -V, or signal 17 is sent to the sesmond process. Signal 17 turns on the verbose/debug mode, and signal 16 turns it off. These messages are useful when trying to debug problems with sesmond.
A message similar to this is output every time sesmond begins to monitor a DAE unit.
When faults involving FRUs within the DAE unit are corrected, sesmond notes the correction at the next polling interval and logs the appropriate error-correction entries.
ATTENTION Error-correction messages can be generated by transient errors that disappear without service intervention. Repeated instances of such events for a device may be early symptoms of total device failure; contact NUMA-Q Customer Support to determine whether preemptive servicing is warranted.
The following list describes the error-correction entry types. EES classifies these correction reports as "minor" events.
After monitoring had exceeded a previous polling period, the next monitoring period did not exceed the polling period.
A previous configuration check had failed, but the next has succeeded.
The disk drive in question either failed and was replaced, or suffered a transient fault that was corrected without user intervention.
The power supply in question either failed and was replaced, or suffered a transient fault that was corrected without user intervention.
The power supply in question shut down because it exceeded its maximum operating temperature; then, either the power supply was replaced or the overtemperature was corrected in some other way, such as lowering the ambient temperature near the DAE unit.
The fan pack experienced a single- or multiple-fan failure and was replaced, or suffered a transient fault that was corrected without user intervention.
The LCC in question either failed and was replaced, or suffered a transient fault that was corrected without user intervention.
The MIA on the primary LCC port either failed and was replaced, or suffered a transient fault that was corrected without user intervention.
The MIA on the expansion LCC port either failed and was replaced, or suffered a transient fault that was corrected without user intervention.