Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


3.3.4 Suppressing host resets

In the case of a configuration in which more than two hosts are running and the hosts monitor each other, if one of the hosts detects a host failure (such as a monitoring path failure on another host), operation might be switched over to one host and all the other hosts might be reset, as shown in the figure below. In such a circumstance, jobs might be terminated if the system requires at least a certain number of active hosts.

Figure 3‒18: Example in which operation is switched over to one host and all the other hosts are reset

[Figure]

This example assumes that no more than three active servers can be run concurrently on a single host because of memory size limitations. If an attempt is made in this system to run five active servers on host 1, jobs will be terminated because the system cannot function.

In a system that requires at least a certain number of active hosts, HA Monitor can suppress hosts from being reset , thus protecting jobs from being terminated. The minimum number of hosts required for a system to run jobs is called the minimum number of active hosts. This subsection explains the operation and the environment settings when host resets are suppressed.

Organization of this subsection

(1) Operation when host resetting is suppressed

This subsection explains the HA Monitor processing when host resetting is suppressed under each of the following scenarios:

These examples assume that the minimum number of active hosts is three and a maximum of two active servers can be run per host.

(a) A host failure occurs on one host, the host is reset, and then jobs continue

The following figure shows the processing when a host failure occurs on one host, the host is reset, and then jobs continue.

Figure 3‒19: Processing when a host failure occurs on one host, the host is reset, and then jobs continue

[Figure]

The following explains the processing, where the step numbers correspond to the numbers in the figure.

  1. Because a host failure occurred on host 4, alive message transmission stops. As a result, hosts 1, 2, 3, and 5 detect a failure on host 4.

  2. Hosts 1, 2, 3, and 5 transmit failure detection notifications among themselves.

  3. Hosts 1, 2, 3, and 5 count the received failure detection notifications and check if the minimum number of active hosts is satisfied. As a result, host 1, which has the highest reset priority, resets host 4.

  4. The active server running on host 4 performs hot standby processing. In this example, operation is switched over to the standby server running on host 5.

In this example, jobs continue because the minimum number of active hosts is satisfied after hot standby processing and no more than two active servers are running on one host.

(b) A monitoring path failure occurs on one host, host resetting is suppressed, and jobs continue

The following figure shows the processing when a monitoring path failure occurs on one host, host resetting is suppressed, and jobs continue.

Figure 3‒20: Processing when a monitoring path failure occurs on one host, host resetting is suppressed, and jobs continue

[Figure]

The following explains the processing, where the step numbers correspond to the numbers in the figure.

  1. Because a monitoring path failure occurred on host 1, alive message transmission stops. As a result, isolated host 1 detects failures on hosts 2 through 5. At the same time, hosts 2 through 5 detect a failure on host 1.

  2. Host 1 is placed in the reception wait state because it cannot transmit failure detection notifications due to the monitoring path failure. At the same time, hosts 2 through 5 transmit failure detection notifications among themselves.

  3. Host 1 cannot check if the minimum number of active hosts are running because it cannot receive failure detection notifications. Therefore, this host suppresses resetting of hosts 2 through 5.

  4. Hosts 2 through 5 count the received failure detection notifications and check if the minimum number of active hosts is satisfied. As a result, host 2, which has the highest reset priority, resets host 1 after 10 seconds have passed.

  5. The active server running on host 1 performs hot standby processing. In this example, operation is switched over to the standby server running on host 2.

In this example, jobs continue because the minimum number of active hosts is satisfied after hot standby processing and no more than two active servers are running on one host.

(c) A host failure occurs in an N:2 configuration, the host is reset, and then jobs continue

The following figure shows the processing when a host failure occurs in an N:2 configuration, the host is reset, and then jobs continue.

Figure 3‒21: Processing when a host failure occurs in an N:2 configuration, the host is reset, and then jobs continue

[Figure]

The following explains the processing, where the step numbers correspond to the numbers in the figure.

  1. Because a host failure occurred on host 1, alive message transmission stops. As a result, hosts 2 through 5 detect a failure on host 1.

  2. Hosts 2 through 5 transmit failure detection notifications among themselves.

  3. Hosts 2 through 5 count the received failure detection notifications and check if the minimum number of active hosts is satisfied. As a result, host 2, which has the highest reset priority, resets host 1.

  4. The active server running on host 1 performs hot standby processing. In this example, operation is switched over to the server running on host 4.

In this example, jobs continue because the minimum number of active hosts is satisfied after hot standby processing and only one active server is running on each host.

(d) A host failure occurs while three hosts are running in reduced operation mode, host resetting is suppressed, and jobs are terminated

The following figure shows the processing when a host failure occurs while three hosts are running in reduced operation mode, host resetting is suppressed, and jobs are terminated.

Figure 3‒22: Processing when a host failure occurs while three hosts are running in reduced operation mode, host resetting is suppressed, and jobs are terminated

[Figure]

The following are details of the processing flow shown in the figure. The numbers correspond to the numbers in the figure.

  1. Because a host failure occurred on host 3, alive message transmission stops. Hosts 1 and 2 detect a failure that occurred on host 3 from disruption of alive messages.

  2. Hosts 1 and 2 transmit failure detection notifications with each other.

  3. Hosts 1 and 2 suppress resetting of host 3 because the numbers of failure detection notifications received by these hosts do not satisfy the minimum number of active hosts.

  4. If host resetting is suppressed on all hosts,# hosts 1 and 2 cannot check the status of host 3. To prevent multiple active servers from running, the server statuses change as follows:

    • Standby server whose server priority is lower than the standby server on host 3: Terminated (SBY → inactive).

      This is because the status of the standby server on host 3 might change if a failure occurs on host 1 or 2.

    • Standby server that has the highest server priority among all hosts: Placed in hot-standby wait state (SBYONL??).

      This is because the active server might be running on host 3.

    #

    You can use the following formula to determine the time required to suppress host resets on all hosts: total number of hosts making up the group × 10 seconds (varies depending on the system). For example, if there are 32 hosts, the time required for suppressing host resets on all hosts is as follows:

    32 × 10 = 320 seconds = 5 minutes, 20 seconds

In this example, jobs are terminated because no active server is running on host 3, in which case the minimum number of active hosts (3) is no longer satisfied.

Depending on the availability of standby servers that can be switched over to a remote host, you might need to perform some action, such as executing the monact command. For details, see 7.2.1 Starting if the server is terminated, or see 7.4.1 Starting a server in the wait state and then restarting jobs if the server is placed in hot-standby wait state (ONL??).

If a host reset is suppressed due to a monitoring path failure that was detected as a host failure, when the monitoring path is restored, use the monshow -c command to check the connection between HA Monitors. If there are unconnected hosts, use the monlink command to connect them.

(2) Failures for which host resetting can be suppressed

Host resetting can be suppressed only for those host failures that can be detected because alive message transmissions stop.

If a pair shutdown is notified from HA Monitor running on a remote host or if shared disk disconnection processing fails during hot standby processing, HA Monitor resets the remote host without waiting for a failure detection notification from the remote host and then performs hot standby processing. If HA Monitor receives a failure notification from hardware, or if a server failure occurs, HA Monitor performs hot standby processing without waiting for a failure detection notification.

(3) Conditions under which host resetting is suppressed and examples

Host resetting can be suppressed only when all the following conditions are satisfied:

The following figures show examples of when host resetting can be suppressed.

Figure 3‒23: Example 1 of when host resetting can be suppressed

[Figure]

In this example, host resetting can be suppressed because the following conditions are satisfied:

In this example, host resetting can be suppressed because the following conditions are satisfied:

Note

If you set the minimum number of active hosts to a value that is less than or equal to half of the total number of hosts that are connected, when a network split-brain condition occurs, the minimum number of active hosts might be exceeded within the split configuration, resulting in hosts resetting each other. Therefore, you must set the minimum number of active hosts to a value that greater than half of the total number of hosts that are connected.

(4) Required environment settings

To suppress host resets, specify the suppress_reset operand in the HA Monitor environment settings.