Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


7.4 Hot standby operation when a failure occurs

This section explains the actions to be taken by the operator after HA Monitor has detected a failure.

The procedure from detection of a failure to job recovery is explained below. During operation, you can verify from the messages output to syslog whether the system is running normally.

To take an appropriate action when an error message is issued:

  1. Check the error messages.

    Check what the error messages output to syslog say.

  2. Check the status of hosts and servers.

    Check if HA Monitor has completed hot standby processing. Execute the server and host status display command (monshow command) and check if the following is true:

    • The active server has restarted in the active system.

      You must check this if you have specified restart or manual in the switchtype operand in the server environment definition.

    • The active server has been switched over to the standby server in the standby system.

    If the active server is in the wait state and not running, the job has terminated. To restart the job, an operator intervention is required. For details about the operator intervention, see 7.4.1 Starting a server in the wait state and then restarting jobs.

  3. Make sure that the job has resumed without problems.

    Check that there are no problems with the job. The check items and methods depend on the nature of the job. For example, check the following:

    • Whether the server and client are able to communicate with each other

    • Whether the program used to execute the job is running correctly

  4. Resolve the cause of the failure at the host where the failure occurred.

    Resolve the cause of the failure on the basis of the messages that are output. If you need to manipulate shared resources, observe the notes in 7.2.3 Notes about maintaining shared resources. Also, collect HA Monitor error information, if necessary. For details, see 7.4.2 Collecting error information.

    The required actions are explained for the major causes of failures. For details about the failure handling procedures, see 7.5 Handling errors.

  5. Restart the host where the failure occurred as the standby system.

    If the failure occurred in the primary system, job execution continues in the secondary system. If you restart the host where the failure occurred as the standby system, you can be prepared for a subsequent failure at the system that is currently executing jobs. For details, see 7.4.3 Restarting the host where a failure occurred as the standby system.

  6. Check the status of servers and hosts.

    Use HA Monitor commands to check that the server or host where the failure occurred has restarted successfully. For details, see 7.4.4 Checking the status of servers and hosts after handling an error.

Organization of this section