Troubleshooting the cluster

This topic provides an overview of how to resolve problems with the SAN File System cluster.

A SAN File System cluster can contain from two to eight engines, each running a separate instance of a metadata server. A metadata server has one of the following roles:
  • Master

    The master metadata server manages system metadata for the entire cluster. It controls all operations involving system metadata such as allocation of storage space, coordination of most administrative operations, and access to the global namespace. In addition, the master metadata server can perform the same tasks that are performed by subordinate metadata servers: managing file metadata and workload for one or more filesets.

    Only one metadata server at a time can act as the master in a cluster.

  • Subordinate
    Subordinate servers manage user metadata and workload for one or more filesets.
    Note: A fileset can be managed by only one metadata server.
To access the user data in a specific fileset, clients communicate with the metadata server that manages that fileset.

Metadata server failures

Metadata servers in the cluster rely on a heartbeat mechanism to verify availability. When a metadata server becomes unresponsive or fails, there are several possible reasons:

  • Operating system crashes or hangs
  • Metadata server hangs
  • Local network connection fails
  • Network partition occurs
  • Hardware failure occurs on the metadata server
If the operating system crashes or hangs, the metadata server hangs, or a hardware failure occurs, the failed metadata server is truly dead. It can no longer access metadata for filesets that it served.

If a local network connection fails or a network partition occurs, all metadata servers can function until they need to communicate with the master. When a metadata server continues to function but is not reachable from the cluster, the metadata server could potentially cause metadata corruption if not stopped. Such a metadata server is referred to as a rogue metadata server.

The cluster must guarantee that no server is rogue for all failure scenarios. To do this, SAN File System uses one of two possible containment scenarios:

  1. SAN File System attempts to abort the failed metadata server, resulting in a server core file that goes to /usr/tank/server. This method is the least disruptive way to handle a rogue metadata server because only the failed metadata server, and not the engine, is affected.
  2. If the abort fails, SAN File System metadata servers use the RSA II cards and the RS-485 network to stop the engine on which the failed metadata server is running. This method is the most disruptive because the entire engine stops. The results of executing this operation are logged to /usr/tank/server/log/log.stopengine.
When the engine stops, the RSA II automatically tries to restart it. If the RSA II cannot restart the engine, you must manually fix the fault (as you must in a hardware failure).
While the failed metadata server is down, the filesets originally served by the failed metadata server fail over to a surviving metadata server. These filesets become available again to clients after a brief pause, typically from one to two minutes.
Note: During this time, active operations of some applications might time out. Whether additional errors occur is based on how client applications respond to a timeout situation.
If the failed metadata server was the master, SAN File System automatically elects a new master. When the engine successfully restarts, the metadata server automatically restarts and rejoins the cluster if the automatic restart service is enabled.
Note: Some failure scenarios might result in the automatic restart service being disabled. If these failure scenarios occur, it requires manual intervention by the administrator to re-enable the automatic restart service. Once enabled, the metadata server automatically restarts and rejoins the cluster.
Any filesets that were statically assigned to the restarted metadata server fail back to that server when it rejoins the cluster.

Parent topic: Troubleshooting

Related reference
Commands

Library | Support | Terms of use | Feedback
(C) Copyright IBM Corporation 2003, 2004. All Rights Reserved.
IBM TotalStorage SAN File System v2.2