Chapter 2
sesmond Log Entries

If you do not have ptx/EES installed on your NUMA-Q host, the sesmond daemon will write messages only to ktlog. If you do have ptx/EES, sesmond will also write the log messages to /var/ees/eeslog. We recommend that you install ptx/EES and consult the messages in /var/ees/eeslog; this log not only contains more information but also structures that information more consistently, making the log entries easier to filter when you are searching for specific subsets of the information.


2.1 Format of Messages in the /var/ees/eeslog File

The sesmond daemon sends an initial message to the log for each DAE unit it is monitoring:

Started monitoring DAE on fabric-n  with address x

-- In which fabric-n is the name of the arbitrated loop on which the DAE unit is located, and x is the DAE enclosure address (with a possible value of 0-11). The DAE enclosure address is set by a switch on the front panel of the DAE unit.


ATTENTION

To view the messages in the file /var/ees/eeslog, you must use the ees_view utility, which converts the file contents from a compressed binary format into readable text.


From the time sesmond starts until the DAE unit is shut down, sesmond writes messages concerning faults on the DAE enclosure and its FRUs to the ptx/EES log /var/ees/eeslog. Messages are sent both when faults are discovered and when they are corrected.

All EES messages present their information in a sequence of standard fields, as follows:

binary timestamp product_nameproduct_version
host_nameprovider_namearea_namefile_name#lineno
EES_event_idsequence numberseveritydescriptiontimestamp

The following example shows how these fields might look when fully populated:

8904950989:ptx/SESMON:1.0.1A:elm2a89:sesmond::clariion_dae.c#150:
e=23:s=482:major:The recv diag configuration page for disk sd12
 is not that of a DAE, but the disk inquiry is for a DAE disk :
Fri Sep  3 16:16:29 1999

For more details concerning ptx/EES logging and filtering, see the EES User's Guide.


2.2 Format of Messages in the ktlog File

Messages logged to the ktlog file will not contain as much information as those written to /var/ees/eeslog. Also, the order of the fields will differ between the two types of logs. The following is a sample of a typical sesmond entry in ktlog:

362b72e9 10:12:09 tolog/warn p6763 ptx/SESMON:25:v1.0.1:
sesmond::ees_major:disk fault for disk sd16 in slot 6 
in DAE on fabric0 with address 0

Note that the timestamp in plain text immediately follows the binary timestamp. In addition, the severity level is written as "tolog/warn" instead of "warning," and comes earlier in the message than it does in the EES message format. Most importantly, the name of the host computer to which the message was sent, the name of the source file logging the event, and the line number of the message within that source file are not captured.


2.3 Interpreting Error Messages Logged by sesmond

The textual portions of the messages logged by sesmond do not differ significantly between ktlog and /var/ees/eeslog. This section presents the messages that may appear, interprets them, and suggests appropriate corrective action.

Messages are divided into the categories "critical," "major," "minor," "debug," and "info." Within those categories, the messages are sorted according to their EES event ID numbers, in ascending order.

The error messages that you will encounter in the normal course of operation are those that report on the status of hardware devices within the DAE unit. Refer to the EES event ID of the message listed here, to find interpretive information and corrective action:


2.4 Critical Error Messages

Critical messages concern problems that render ptx/SESMON unusable; for example, a problem that causes the sesmond daemon to exit inadvertently is critical. The following critical messages may be logged by sesmond.


(EES ID 43) critical: sesmond unexpected error

The first of three critical error messages that may be issued when the parent sesmond daemon detects that the child daemon has exited unexpectedly. The cause is a hardware or software bug.

Corrective action: Check logs for any previous errors, possible causes of the exit. Verify that the daemon has restarted, and contact your customer-support provider.


(EES ID 24) critical: exiting

This message is output when the (child) sesmond daemon detects an error and exits with a non-zero value.

Corrective action: Check logs for any previous errors, and verify that the daemon has restarted and is running. Contact your customer-support provider.


(EES ID 43) critical: sesmond exited with %u

The child sesmond daemon exited with a non-zero value, meaning that it encountered an unexpected error such as a memory-allocation failure.

Corrective action: Check logs for any previous errors as possible causes of the exit. Verify that the daemon has restarted, and contact your customer-support provider.


(EES ID 44) critical: too many sesmond failures, not respawning

The child sesmond daemon was restarted more than 5 times within the last (20 * poll_period) seconds. The default time period is 20*30 seconds = 600 seconds or 10 minutes.

Corrective action: Check logs for any previous errors and try to fix or work around the errors found. Otherwise, the DAE cannot be monitored; contact your customer-support provider.


(EES ID 47) critical:

EES event 47 displays a variety of text messages and can be either a critical or major event, depending upon the context. When this event is a critical error, it is accompanied by EES ID 24, critical: exiting. The first portion of the text string is the text of the failure, including the system or library call that failed ; the second portion is the errno text. The following text variants are possible:

sigset failed, errno
fork failed, errno
gettimeofday failed, errno
wait failed, errno
MPTIOCINQ failed, errno
MPTIOCRECDIAG failed, errno
FIOCADDR ioctl failed, errno
cfg_sys call failed, errno
setitimer failed, errno
could not open disk_name
could not open fabric_name

Corrective action: Determine what this error means in the context of the failed system or library call. Check for errors with more information following these events, such as the disk name to which the failure applies.



2.5 Major Error Messages

EES defines major errors are those that indicate a problem with noticeable impact on the program under report (the "provider"). A range of problems can be grouped under this condition: certain functionality within the provider is not available; the provider is performing poorly; or the provider is malfunctioning in some other way. For ease of reference, the major errors listed here are sorted by their EES ID number, in ascending order.


(EES ID 0) major: Error identifying the enclosure for device sd10

This message is output whenever an error occurs that may have prevented complete identification of a DAE.

Corrective action: Check logs for any previous errors. It is possible that the DAE is not being monitored - check the log to view monitored DAE units. Then contact your customer-support provider.


(EES ID 1) major: Internal error: could not get fabric address

Corrective action: Check logs for any previous errors. Then contact your customer-support provider.


(EES ID 2) major: Unexpected fan speed code of 0x%x

This is an internal error.

Corrective action: Try restarting sesmond or rerunning sesmonid. Contact your customer-support provider.


(EES ID 3) major: Internal error: bad fibre channel disk address %d for device %s

ptx/SESMON received an unexpected value for the Fibre Channel address of a disk in the DAE.

Corrective action: Check logs for any previous errors. Then contact your customer-support provider.


(EES ID 5) major: Unexpected device status code %d

The status code returned for a device is not on the list of predefined possible values.

Corrective action: Check logs for any previous errors. Then contact your customer-support provider.


(EES ID 6) major: configuration check failed, monitoring could be incomplete

An error occurred while sesmond was trying to determine what DAE units were configured on the system.

Corrective action: Check logs for any previous errors. Then contact your customer-support provider.


(EES ID 8) major:Could not get an enclosure status page for DAE on fabric%s with address %s

The sesmond daemon no longer has access to any of the disk drives in slots 0-3; access to at least one of these devices is needed for SES access to the DAE. The loss of access may have been caused by a failed connection to the DAE, or by the failure of the disks in slots 0-3.

Corrective action: Verify that at least one of the four disk drives in slots 0-3 is capable of sending and receiving I/O. If not, replace the failed drive or correct the failed connection.


(EES ID 10) major: Monitoring exceeded the polling interval; monitoring is probably stalled

This error is output the second time that the polling interval is exceeded.

Corrective action: Check for hung configuration (devctl) commands, and check to see if sesmond is using excessive CPU time.


(EES ID 12) major: Monitoring exceeded the polling interval

The signal-based monitor timer's poll period expired while sesmond was still monitoring or trying to monitor. This event can easily occur when a device is being configured or deconfigured from the system.

Corrective action: Shut down sesmond while deconfiguring, in order to avoid interactions with devctl, and restart sesmond after device configuration or deconfiguration is complete.


(EES ID 15) major: Could not get config graph, cfg_info_init returned NULL

The cfg call failed, probably because it could not allocate memory.

Corrective action: Check system memory resources.


(EES ID 16) major: Could not get recv diag status page disk sd25 in DAE on fabric2 with address 0

If the system is unable to communicate with a DAE disk in slots 0-3 that is visible to the operating system, this message will be output. Two of these disks normally have access to enclosure status information for each LCC. If just one of them fails, ptx/SESMON will use the other disk to access the enclosure status information. In that case, this message will be seen only once. If both disks with access to a given LCC fail (both disks in slots 0 and 2, or both disks in slots 1 and 3), this message will be seen every poll period until the failed disks are recovered or taken offline.

Corrective action: Check status of the disk and its connections. If necessary, replace the disk drive, which is replaceable with the DAE unit online. See Section of the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 17) major: Could not allocate %d bytes, monitoring aborted

A malloc call failed.

Corrective action: Check system memory resources.


(EES ID 18) major: The DAE on fabric2 with address 0 has no devices with SES access and cannot be monitored.

There are disks in a DAE unit configured into the system, but no disks in slots 0-3 are configured, so the DAE cannot be monitored.

Corrective action: Check logs for previous errors, and check that the proper disks were configured.


(EES ID 19) major: No matching function for element type %d

This is an internal error.

Corrective action: Check logs for previous errors. Contact your customer-support provider.


(EES ID 20) major: Could not find other power supply status

This is an internal error.

Corrective action: Try restarting the sesmond daemon or rerunning the sesmonid command. Contact your customer-support provider.


(EES ID 21) major: Internal error: No matching subelement count found for element of type %d

Corrective action: Check logs for any previous errors. Contact your customer-support provider.


(EES ID 22) major: UPS installed, but it is not supported

Probably hardware that supports UPS is matched with software that does not. Another possible cause is a hardware or software bug.

Corrective action: Contact your customer-support provider.


(EES ID 23) major: The recv diag configuration page for disk sd12 is not that of a DAE, but the disk inquiry is for a DAE disk

This message probably results either from a bug or from an unsupported configuration, such as having connected to theNUMA-Q host a disk that has DAE-specific firmware but is not in a DAE unit.

Corrective action: Contact your customer-support provider.


(EES ID 25) major: disk fault for disk in slot 0 in DAE on fabric4 with address 1

The disk drive in slot 0 in the DAE unit with the enclosure address 1 on Fibre Channel arbitrated loop 4 has failed.

Corrective Action: Replace the disk drive, which is replaceable with the DAE unit online. See the section entitled "Replacing or Adding a Disk Module in the DAE" in the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 27) major:power failure for power supply A in DAE on fabric4 with address 1

Power supply A in the DAE unit with the enclosure address 1 on Fibre Channel arbitrated loop 4 has failed.

Corrective Action: Replace power supply A. You can do this with the DAE unit online, provided that power supply B is functioning properly. See the section entitled "Replacing a Power Supply Module in the DAE" in the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 29) major:overtemperature failure for power supply A in DAE on fabric4 with address 1

Power supply A in the DAE unit with the enclosure address 1 on Fibre Channel arbitrated loop 4 has exceeded its rated maximum operating temperature and therefore shut down.The DAE can continue to operate if the other power supply is functioning properly.

Corrective Action: Replace the faulty power supply at your earliest convenience. See Section of the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 31) major: single fan failure for cooling element in DAE on fabric4 with address 1

One of the three fans in the fan pack has failed. The DAE unit can continue operating; the other two fans will speed up in order to compensate for the fault. From this point on, the DAE unit no longer has redundant cooling; if one of these remaining fans fails, power to all disk drives in the DAE unit will be shut down after two minutes.

Corrective Action: Replace the fan pack at your earliest convenience. See the Section entitled "Replacing the Drive Fan Pack in the DAE" in the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 31) major: multiple fan failure for cooling element in DAE on fabric4 with address 1

At least two of the three fans in the fan pack have failed. Power to all disk drives in the DAE unit will be shut down two minutes after this event occurs.

Corrective Action: Replace the fan pack. See the Section entitled "Replacing the Drive Fan Pack in the DAE" in the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 33) major: LCC failure for LCC A in DAE on fabric4 with address 1

LCC A of the DAE unit with enclosure address 1, on Fibre Channel fabric 4, has failed. This is not the LCC to which the NUMA-Qhost is connected; if it were, you would see only the message, major:could not get recv diag status for DAE on fabric4 with address 1 (EES ID 16), up to four times.

Corrective Action: Replace the faulty LCC. See the section entitled "Replacing a Link Control Card in the DAE" in the NUMA-Q Installation Guide for CLARiiON Disk Arrays.


(EES ID 35) major: primary MIA failure for LCC B in DAE on fabric4 with address 1

The Media Interface Adapter (MIA) connected to the primary (PRI) port on LCC B of the DAE unit with enclosure address 1, on Fibre Channel fabric 4, has failed. This MIA is not on the LCC to which the NUMA-Q host is connected; if it were, you would see only the message, major:could not get recv diag status for DAE on fabric4 with address 1 (EES ID 16), up to four times.

Corrective Action: Replace the MIA, as described in the first three procedural steps in the section "Replacing a Link Control Card in the DAE" of the NUMA-Q Installation Guide for CLARiiON Disk Arrays, where the Figure "Optical Cable Connection to the Link Control Card (LCC)" illustrates the MIA's connection to the LCC and to the fiber optic cable from the NUMA-Qhost.


(EES ID 37) major: expansion MIA failure for LCC B in DAE on fabric4 with address 1

The Media Interface Adapter (MIA) connected to the expansion (EXP) port on the DAE unit has failed. The message with EES ID 16, major:could not get recv diag status for DAE on fabric4 with address 1, will also be output if a DAE is connected to the expansion port.

Corrective Action: Replace the MIA, as described in the first three procedural steps in the section "Replacing a Link Control Card in the DAE" of the NUMA-Q Installation Guide for CLARiiON Disk Arrays, where the Figure "Optical Cable Connection to the Link Control Card (LCC)" illustrates the MIA's connection to the LCC and to the fiber optic cable from the NUMA-Q host.


(EES ID 40) major: An error occurred while monitoring DAE on fabric0 with address 0

An error occurred while getting status (monitor) information.

Corrective action: Check logs for any previous errors; then contact your customer-support provider.


(EES ID 43) major: sesmond got signal 9

The parent sesmond daemon detected that the child daemon exited because it received a signal. If the child daemon was not killed by a command entered manually, the cause is probably a bug.

Corrective action: Check logs for any previous errors, possible causes of the exit. Verify that the daemon has restarted, and contact your customer-support provider.


(EES ID 46) major: Could not get inquiry page for sd26

The listed disk drive probably has errors. This message will be output every poll period until the drive access is corrected or the drive is deconfigured from the kernel.

Corrective action: Check that the drive is accessible. Use the sesmond option -x to exclude the use of the device. Use the full pathname, /usr/bin/sesmond, when manually starting sesmond; otherwise, the daemon cannot be killed by the shutdown script, which only recognizes the process by the name/usr/bin/sesmond.


(EES ID 47) major:

Events with EES ID 47 are classified as major (rather than critical) errors when not accompanied by an inadvertent exit of the sesmond child daemon. A variety of messages may be displayed under this event ID; see (EES ID 47) critical .



2.6 Minor Error Messages

The EES classifies as minor events those errors that have no noticeable impact on the functionality of the program under report (the "provider"), such as unrecognized debug parameters that are being ignored. For ease of reference, the minor errors listed here are sorted by their EES ID number, in ascending order.

Also classified as minor EES events are those messages that announce the correction of errors. The error-corrected messages are described in Section 2.9, "sesmond Error-Correction Entries in .var/ees/eeslog."


(EES ID 7) minor: Halted monitoring of DAE on fabric0 with address 0

All devices in a DAE unit have been deconfigured, and so no monitoring can occur for the unit.


(EES ID 45) minor: restarting sesmond

This message notifies the user that the child sesmond daemon, after exiting unexpectedly, has been restarted.


(EES ID 48) minor: No fabric device sd100 in the configuration

The command line option -d, -X, or-x was used, and a device was specified that does not exist in the configuration tree as a fabric device (all DAE disks are on a fabric). If sd100 is the only device selected in the command option, the daemon will continue running, monitoring all devices. If other devices that are in the configuration tree are selected, the sesmond daemon will monitor them.



2.7 Debug Message


(EES ID 49) debug: ses_get_ses_status: checking element at 7, type 1

Messages are logged with severity ees_debug when sesmond is run with verbose/debug mode enabled--that is, when sesmond is started with -V, or signal 17 is sent to the sesmond process. Signal 17 turns on the verbose/debug mode, and signal 16 turns it off. These messages are useful when trying to debug problems with sesmond.



2.8 "Information" Message


(EES ID 9) info: Started monitoring DAE on fabric0 with address 0

This message is output every time sesmond begins to monitor a DAE unit.



2.9 sesmond Error-Correction Entries in /var/ees/eeslog

When faults involving FRUs within the DAE unit are corrected, sesmond notes the correction at the next polling interval and logs the appropriate error-correction entries.


ATTENTION

Error-correction messages can be generated by transient errors that disappear without service intervention. Repeated instances of such events for a device may be early symptoms of total device failure; contact your customer-support provider to determine whether preemptive servicing is warranted.


The following list describes the error-correction entry types. ptx/EES classifies these correction reports as "minor" events.


(EES ID 11) minor: Monitoring now normal

After monitoring had exceeded a previous polling period, the next monitoring period did not exceed the polling period.


(EES ID 13) minor: configuration check succeeded

A previous configuration check had failed, but the next has succeeded.


(EES ID 26) minor:disk fault corrected for disk sd27 in slot 1 in DAE on fabric4 with address 1

The disk drive in question either failed and was replaced, or suffered a transient fault that was corrected without user intervention.


(EES ID 28) minor:power failure corrected for power supply A or B in DAE on fabric4 with address 1

The power supply in question either failed and was replaced, or suffered a transient fault that was corrected without user intervention.


(EES ID 30) minor:overtemperature failure corrected for power supply A or B in DAE on fabric4 with address 1

The power supply in question shut down because it exceeded its maximum operating temperature; then, either the power supply was replaced or the overtemperature was corrected in some other way, such as lowering the ambient temperature near the DAE unit.


(EES ID 32) minor:fan failure corrected for cooling element in DAE on fabric4 with address 1

The fan pack experienced a single- or multiple-fan failure and was replaced, or suffered a transient fault that was corrected without user intervention.


(EES ID 34) minor:LCC failure corrected for LCC A in DAE on fabric4 with address 1

The LCC in question either failed and was replaced, or suffered a transient fault that was corrected without user intervention.


(EES ID 36) minor:primary MIA failure corrected for LCC B in DAE on fabric4 with address 1

The MIA on the primary LCC port either failed and was replaced, or suffered a transient fault that was corrected without user intervention.


(EES ID 38) minor:expansion MIA failure corrected for LCC B in DAE on fabric4 with address 1

The MIA on the expansion LCC port either failed and was replaced, or suffered a transient fault that was corrected without user intervention.