Troubleshooting the local network

Use the information in this topic to troubleshoot problems that you are having with the local network.

Note: For more information on metadata server failures, refer to Troubleshooting the cluster.

Problem

A problem exists with the local network on which the metadata servers communicate.

The problem might be a network fault or network partition:
  • A network fault. A local network fault can occur if there is a bad Ethernet adapter in an engine or the Ethernet cable is not connected between the Ethernet adapter and the IP network. In a local network fault, the cluster reacts as if the metadata server on which the fault occurred is down.

    The master metadata server excludes the failed metadata server and reforms the cluster. The metadata server that was excluded is aborted, resulting in a server core file that goes to /usr/tank/server. If the abort fails, the RSA II stops and restarts the engine that is hosting the failed metadata server. Review the logs (log.std and log.stopengine) on the master and the log.std log on the failed metadata server.

  • A network partition. A local network partition can occur if there is a problem in the Ethernet network that causes two or more metadata servers to lose communication with the master metadata server or with the rest of the cluster. Both sides of the partition attempt to take over, but the side of the partition that contains the majority of the metadata servers generally survives and reforms the cluster. If the partition results in an even number of metadata servers on each side of the partition, the side of the partition that contains the master survives and reforms the cluster. On the side of the partition that does not survive, the metadata servers are aborted or its engines are shutdown and restarted by the RSA II. Review the logs (log.std and log.stopengine) on the master of the surviving partition and the log.std log on all subordinate servers in the losing partition.

Investigation

If a local network fault occurs with one of the metadata servers, SAN File System performs the following actions to resolve the problem:

This illustration shows a local network fault in the cluster
  1. The master aborts the failed metadata server or, if the abort fails, the RSA II automatically shuts down and restarts the engine hosting the failed metadata server. SAN File System fails over all filesets served by the failed subordinate metadata server to another metadata server.
  2. After repairing the network fault, one of the following situations occurs:
    1. After restarting, the failed metadata server attempts to restart and goes into Initializing state because it cannot communicate with the master. After the network partition is fixed, the metadata server completes the initialization and automatically rejoins the cluster.
    2. If autorestart is disabled, run the following command from the command-line interface to restart the server: /usr/tank/admin/bin/sfscli startautorestart
    3. Wait for the cluster to be reformed to include this metadata server.
    4. Use the SAN File System console or the administrative CLI to verify that all metadata servers in the cluster are in an online state.
  3. SAN File System automatically fails back any static filesets assigned to the restored metadata server.
If there is a network partition, SAN File System performs the following actions to resolve the problem:

This illustration shows a network partition
  1. The master attempts to abort the server. If the abort fails, the RSA II automatically shuts down and restarts the engine hosting the failed metadata server. SAN File System fails over all filesets served by the failed metadata server to another metadata server.
  2. After repairing the network fault, one of the following situations occurs:
    1. After restarting, the failed metadata server attempts to restart and goes into Initializing state because it cannot communicate with the master. After the network partition is fixed, the metadata server completes the initialization and automatically rejoins the cluster.
    2. If autorestart is disabled, run the following command from the command-line interface to restart the server: /usr/tank/admin/bin/sfscli startautorestart
    3. Wait for the cluster to be reformed to include this metadata server.
    4. Use the SAN File System console or the administrative CLI to verify that all metadata servers in the cluster are in an online state.

Parent topic: Troubleshooting the cluster

Related tasks
Reassigning filesets to metadata servers
Taking a metadata server offline

Library | Support | Terms of use | Feedback
(C) Copyright IBM Corporation 2003, 2004. All Rights Reserved.
IBM TotalStorage SAN File System v2.2