Failover

If a metadata server fails and the metadata-server automatic restart service cannot bring it back into the cluster, or if you manually stop the metadata server, SAN File System automatically and non-disruptively fails over the failed metadata server's workload by redistributing its filesets and, if necessary, reassigning the master role to other active metadata servers.

SAN File System also detects rogue metadata servers. A rogue metadata server is not reachable from the cluster, fails to respond to requests, and might be running or have latent queued I/O. If a rogue metadata server is detected, the cluster first attempts to communicate with the rogue metadata server from disk to have it complete and quiesce all I/O activity that is failing and stops the engine running the rogue metadata server before failing over its workload.

Tip:
  1. After a failover, review the workload reassignments.
  2. Administrative commands that are interrupted by a failover need to be manually restarted against the new master metadata server.

Redistributing filesets

SAN File System attempts to reassign filesets in a useful way across the remaining active metadata servers based on a distribution algorithm. The distribution algorithm first attempts to redistribute the static filesets to a spare metadata server that is set aside for failover. A spare metadata server is one that has no static filesets assigned to it. If one spare exists, all static filesets assigned to the failed metadata server are distributed to a single spare. If more than one spare exists, all static filesets assigned to a single failed metadata server are distributed as a unit to a single spare metadata server. For each successive metadata server failure, all static filesets assigned to the next failed metadata server are distributed as a unit to the next spare metadata server in a round-robin fashion. If no spares exist, then failed static filesets are distributed across all surviving metadata servers on a per fileset basis in a round-robin fashion.

Dynamic filesets are distributed across all surviving metadata servers (including spares) on a per fileset basis in a round-robin fashion.

The failover is temporary for static filesets. A static fileset is a fileset that you manually assigned to a specific metadata server (using the mkfileset or setfilesetserver command). These filesets fail back to their statically assigned metadata server when that metadata server rejoins the cluster. Dynamic filesets, which are assigned to a metadata server by the system, are not reassigned to their previously assigned metadata server; however, they might be redistributed during failover to rebalance the workload after the static filesets fail back.

Reassigning the master role

When a failure affects the master metadata server, the master role is reassigned to another metadata server according to a quorum algorithm. This algorithm makes use of a quorum disk and a majority voting procedure to assign the master role to a metadata server that is a member of the largest active, mutually-connected group of metadata servers that all have access to the system storage pool.

Note: The quorum algorithm does not take into account the network connectivity between the metadata servers and the clients. If a network partition separates the clients from the metadata server, the chosen master might not be ideal.
Restriction: You cannot specify a preferred master metadata server as the failover target or predict the failover target. Reserve some space capacity for the master role on each metadata server in the cluster. The master role requires only a small amount of processing.

Parent topic: Cluster

Library | Support | Terms of use | Feedback
(C) Copyright IBM Corporation 2003, 2004. All Rights Reserved.
IBM TotalStorage SAN File System v2.2