Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


2.3.2 Detecting server failures (in the server mode)

When the server is in the server mode, HA Monitor monitors the server and performs the hot standby operation in the event of a server failure. This subsection explains how HA Monitor detects server failures and HA Monitor's processing after detection of a failure.

Organization of this subsection

(1) How server failures are detected

Server failures can be classified into two types, those that can be detected by the server itself and those that cannot.

When a failure that can be detected by the server itself occurs:

HA Monitor receives a failure notification from the server and detects the server failure.

When a failure that cannot be detected by the server occurs:

On the active server, HA Monitor detects the failure. The server monitoring method depends on the program being used.

If TP1/Server Base or HiRDB is being used:

HA Monitor detects a failure by monitoring operation reports from the server. To monitor operation reports from a server, specify the server failure monitoring time for the patrol operand in the server environment definition. The server produces operation reports on a regular basis, and HA Monitor checks at one-second intervals whether there is an operation report from the server. If HA Monitor does not receive any operation reports from a server within the specified server failure monitoring time, HA Monitor assumes that a server failure has occurred. To monitor operation reports from a server, specify A for the server_type operand in the server environment definition. Alternatively, omit the server_type operand.

Note that HA Monitor does not detect failures on the standby server. The user must restart and terminate the standby server, if necessary.

(2) HA Monitor processing after detection of a server failure

When HA Monitor detects a server failure on the active server and hot standby is possible, HA Monitor performs one of the types of processing listed below. The type of processing to be performed must have been specified by the user in advance in the switchtype operand in the server environment definition.

The following subsections provides details of each of these types of processing.

(a) Performs the hot standby operation

When HA Monitor detects a server failure, the active system's HA Monitor forcibly terminates the active server and performs the hot standby operation to switch over to the standby system. The following figure provides an example of the processing that is performed by HA Monitor when no operation report is received and a server failure is detected.

Figure 2‒13: HA Monitor processing when a server failure is detected (when hot standby operation is performed)

[Figure]

(b) Waits for a restart of the active server; if restart fails, performs the hot standby operation

When HA Monitor detects a server failure, it waits for a restart of the active server resulting in the server failure. The active server's status until it is restarted is called the restart wait state.

If the active server restarts successfully, there is no change in the host used to execute jobs after the failure because HA Monitor does not perform the hot standby operation. If restart of the active server fails, the server program retries the restart for a specified number of times. If the restart still fails and HA Monitor detects the active server restart limit, HA Monitor performs the hot standby operation to switch over to the standby system. HA Monitor processing during hot standby is the same as in (a) Performs the hot standby operation.

The active server's restart limit is the maximum number of attempts to be made to restart the active server. HA Monitor detects the following values as the active server's restart limit:

  • Program's retry count limit

    This is the maximum number of retry attempts. This value is specified in the program.

  • Active server's restart monitoring time limit

    This is the amount of monitoring time before the active server is restarted after a server failure at the active server has been detected. This value is specified in the server environment definition.

(c) Waits for a restart of the active server; if restart fails, waiting for operator intervention

This subsection explains HA Monitor processing when it involves waiting for a restart of active server after detection of a server failure and then waiting for an operator to take action if the restart fails.

The active server restart processing is the same as in (b) Waits for a restart of the active server; if restart fails, performs the hot standby operation. Note that when HA Monitor is to wait for operator intervention, you can specify only the program's retry count as the active server's restart limit.

If restart of the active server fails and HA Monitor detects the active server's restart limit, HA Monitor terminates the active server in the active system. In the standby system, HA Monitor terminates the standby server that corresponds to the active server resulting in the failure and then restarts the standby server to place it in the active server start wait state. The user must either start the standby server in the active server start wait state as the active server or terminate it. For details about how to operate a server that is in wait state, see 7.4.1 Starting a server in the wait state and then restarting jobs.