2.3.4 Detecting host failures

This subsection describes monitoring between hosts, and the reset and monitoring paths.

Organization of this subsection

(1) Monitoring between hosts
(2) Health check of reset paths
(3) Health check of monitoring paths
(4) Retrying transmission of query response messages
(5) Recognizing a slowdown of the local system

(1) Monitoring between hosts

The HA Monitors in the active and standby systems monitor hosts by checking the alive messages that are reported mutually at a specified interval. This subsection explains monitoring between hosts by using alive messages.

(a) Detecting host failures by using alive messages

HA Monitor detects host failures when the alive messages stop. You must specify the amount of time that is to pass since the last alive message was received before HA Monitor is to detect a host failure (host failure monitoring time). This specification is made in the patrol operand in the HA Monitor environment settings. You also specify the transmission interval for the alive messages in the alive_interval operand in the HA Monitor environment settings. If you set the host failure monitoring time to a small value, we strongly recommend that you change the alive message transmission interval from its default setting.

Transmission of alive messages starts when the HA Monitor in the remote system is contacted successfully and the hot standby operation becomes enabled. The alive messages are transferred via the monitoring path. If no alive message is received from the remote system within the specified host failure monitoring time, HA Monitor determines that a host failure has occurred in the remote system.

By default, the HA Monitors in the active and standby systems check mutually for alive messages at intervals of one second. You can change this alive message checking interval to 100 milliseconds by specifying the patrol_100ms operand in the HA Monitor environment settings. If you set the host failure monitoring time to a small value, we recommend that you set checking for alive messages to the interval of 100 milliseconds.

(b) Alive message transmission method

HA Monitor uses the unicast method to send and receive alive messages. This method sends and receives as many alive messages as there are hosts connected per monitoring path.

If you use HA Monitor Extension to set the maximum number of hosts to 33 or greater, HA Monitor uses the multicast method to send alive messages. The multicast method sends and receives one alive message per monitoring path. This method can reduce the number of messages that are sent and received as well as the overhead involved in sending and receiving the messages, thereby minimizing the machine's CPU usage and the loading on LANs used as monitoring paths. For details about changing the maximum number of hosts, see 3.7.2 Changing the maximum number of hosts.

To Page Top

(2) Health check of reset paths

To determine whether host reset is enabled in the event of a host failure, HA Monitor performs a health check on the status of all failure management processors in all remote systems that are connected. A health check starts when startup of the active and standby servers is completed and hot-standby switchover between the local and remote systems becomes possible.^# Thereafter, a health check is performed every two minutes. You can change this interval by specifying the resetpatrol operand in the HA Monitor environment settings.

A health check is performed from both the active system and the standby system.

#

In either of the following cases, HA Monitor starts a health check when a connection is established between hosts. HA Monitor continuously performs health checks while the hosts are connected.

The multi-standby function is used.
The reset_type operand is set to host in the HA Monitor environment settings.

If a failure is detected in the failure management processor or in the reset path, HA Monitor issues one of the following failure detection messages, and then suspends the reset path health check for the corresponding host:

KAMN399-E
KAMN624-E

If HA Monitor detects a failure in a failure management processor or on a reset path, it issues a failure detection message and suspends health checking on the reset path for the corresponding host. For details about how to handle the message, see the manual HA Monitor Cluster Software Messages. You can also use the reset path status display command (monrp command) to check the status of the failure management processor.

For details about the command, see 9. Commands. Note that after the host has been recovered from a failure, you can use the reset path status display command (monrp command) to restart the health check that was suspended.

If the resetpath_retry operand is set to use or is omitted in the HA Monitor environment settings, HA Monitor automatically checks the status at regular intervals. When the reset path is restored, HA Monitor outputs the KAMN646-I message, and restarts the health check that was suspended.

Note: If you are using BladeSymphony, you must specify the resetpatrol_mode operand in the HA Monitor environment settings. For details, see 8.3.1 HA Monitor environment settings (sysdef).

To Page Top

(3) Health check of monitoring paths

HA Monitor can perform health checks on the status of the monitoring paths at a specified interval. The health check interval is specified in the pathpatrol operand in the HA Monitor environment settings. If a failure occurs on a monitoring path, HA Monitor issues a communication error message.

Once HA Monitor has started, it checks the status of the monitoring paths each time the health check interval specified in the environment settings has elapsed. The health check is performed on all monitoring paths connected to the host and terminates when HA Monitor is terminated. This health checking occurs only where a monitoring path is connected between local and remote systems and HA Monitor is running.

The user can see the status of the monitoring paths by using the monitoring path status display command (monpath command).

When a failure is detected on a monitoring path and the pathpatrol_retry operand was specified in the HA Monitor environment settings, the monitoring path's status is rechecked. If the recheck still results in detection of the failure, HA Monitor issues one of the following messages:

KAMN609-W
KAMN635-E
KAMN640-E
KAMN641-W

Once the KAMN641-W message is issued for any reason, or if the monitoring path status display command (monpath command) detects a failure, this message will not be issued again until after the system recovers from the failure.

For details about how to handle the messages, see the information provided in the manual HA Monitor Cluster Software Messages. Note that if the monitoring path status display command (monpath command) is executed while the monitoring path's status is being rechecked, the KAMN393-E message is output, and the status of monitoring paths cannot be checked.

To Page Top

(4) Retrying transmission of query response messages

HA Monitor transfers via the monitoring path not only alive messages but also other messages with the remote system. For example, the monitoring path is also used for transfer of query response messages for determining whether the same server is already running in the remote system.

If transmission of such a query response message fails, HA Monitor retries transmission of the message every three seconds until the message is sent successfully. The user can change this message transmission retry interval by specifying the message_retry operand in the HA Monitor environment settings.

If no monitoring is being performed between hosts, such as before transmission of alive messages begins, and a query response message cannot be sent successfully within 60 seconds,^# HA Monitor determines that a host failure has occurred.

#: If the value of the message_retry operand in the HA Monitor environment settings is greater than 60 seconds and a query response message cannot be sent successfully within that amount of time, HA Monitor determines that a host failure has occurred.

To Page Top

(5) Recognizing a slowdown of the local system

This subsection explains HA Monitor's processing when HA Monitor cannot operate for more than the host failure monitoring time specified in the patrol operand in the HA Monitor environment settings for reasons such as high system load (slowdown of host).

If a slowdown occurs in the active system, HA Monitor performs hot standby switching to the standby system.

On the other hand, if a slowdown occurs in the standby system, the HA Monitor in the active system determines that a host failure has occurred in the standby system. In this case, when the standby system recovers from the slowdown, a status conflict occurs between the hosts because the HA Monitor in the standby system has not detected a host failure in the standby system.

Therefore, the HA Monitor in the standby system reconnects hosts when the standby system has recovered from slowdown, and then restarts the standby server. This automatically resolves the status conflict between the hosts and restarts monitoring of both active and standby systems.

To Page Top