Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


7.5.2 Handling server restart errors

This subsection explains how to handle server restart errors when a server is set to be restarted automatically in the event of a server failure but the server has failed to restart. When you perform your action depends on the server operating mode:

For a server in the server mode
  • Take action when the restart limit is detected after the server restart has been attempted as many times as specified.

For a server in the monitor mode
  • Take action when the server restart fails.

  • Take action when program restart fails.

Organization of this subsection

(1) For a server in the server mode

This subsection explains the handling methods that apply in either of the cases noted below. If switch or restart is specified in the switchtype operand in the server environment definition, there is no need for the operator to take an action because HA Monitor performs hot standby processing.

When either of these conditions is applicable, the active server in the active system (where the failure occurred) is terminated and the standby server in the standby system is placed in the active server start wait state.

Action

Because the standby server has been placed in the active server start wait state, start the standby server in the start wait state as the active server in order to restart the jobs.

To do this:

  1. Check the status of the active system.

    Check the following:

    • The active server has been terminated.

    • The shared resources that are used by the server have been disconnected.

  2. Start the standby server in the standby system as the active server.

    Execute the wait-state server start command (monact command) to start the standby server in the active server start wait state as the active server.

  3. Verify that the active server has started.

    Check that the KAMN251-I message has been issued.

  4. Resolve the cause of the server failure in the system where the failure occurred.

  5. From the active system, start the standby server.

    Execute the start command provided by the program.

To check after taking action:

  1. Verify that the system is ready for hot standby processing.

    Either of the following verifies this:

    • The KAMN252-I message has been issued.

    • The server and host status display command (monshow command) displays ONL as the active server's status and SBY as the standby server's status.

(2) For a server in the monitor mode

This subsection explains how to handle restart errors for a server in the monitor mode.

(a) If restart of the active server fails

If you are running servers in the monitor mode, the following might be the causes of an active server restart error:

  • The server start command executed by HA Monitor has failed.

  • The server termination command executed by HA Monitor has failed.

This subsection explains the handling methods that apply to the cases below. If a different message has been issued, the possible cause would be different. Take appropriate action according to the displayed messages.

  • The server is set to be restarted in the event of a server failure.

  • The KAMN273-E message has been issued.

If the server start command specified in the name or actcommand operand in the server environment definition returns a nonzero value during server restart, the processing is cancelled (the server termination command specified in the termcommand operand in the server environment definition is not executed).

Action

Correct the server start or termination command or the server environment definition and then restart the server.

  1. Resolve the cause of the error.

    Resolve the cause of the error on the basis of the error code displayed in the KAMN273-E message. For details about the error codes, see Table 7-5 List of error codes that can be displayed in the KAMN273-E message (if restart of the active server fails).

  2. From the standby system, start the active server.

    You can use either of the following methods:

    • In the active system (where the failure occurred), execute the monitor-mode server termination command (monend command) to terminate the active server, and then in the standby system, execute the monitor-mode server start command (monbegin command) to start the active server.

    • In the standby system, execute the monitor-mode server start command (monbegin command) to start the standby server, and then in the active system, execute the server hot-standby switchover command (monswap command) to perform planned hot-standby switchover to the standby system.

  3. From the active system, start the standby server.

    Execute the monitor-mode server start command (monbegin command) to start the standby server.

To check after taking action:

  1. Verify that the system is ready for hot standby processing.

    Either of the following verifies this:

    • The KAMN252-I message has been issued.

    • The server and host status display command (monshow command) displays ONL as the active server's status and SBY as the standby server's status.

The following table lists the error codes that are displayed in the KAMN273-E message and describes the actions to be taken.

Table 7‒5: List of error codes that can be displayed in the KAMN273-E message (if restart of the active server fails)

Cause code

Detail code

Explanation of code

Action

1

errno for the system call

A system error occurred when the server start command was executed.

Resolve the cause of the system call error.

2

Command's return value

The server start command returned a nonzero value.

Check the specified server start command and correct it.

126

Execution permissions for the file specified in the name or actcommand operand are not granted.

Grant execution permissions to the server start command.

127

The file specified in the name or actcommand operand was not found.

Check that the value specified in the name or actcommand operand in the server environment definition matches the storage location of the server start command.

3

errno for the system call

A system error occurred when the server termination command was executed.

Resolve the cause of the system call error.

4

Command's return value

The server termination command returned a nonzero value.

Check the specified server termination command and correct it.

126

The server termination command does not have the execution permissions for the file specified in the termcommand operand.

Grant execution permissions to the server termination command that was specified in the termcommand operand in the server environment definition.

127

The file specified in the termcommand operand was not found.

Check that the value specified in the termcommand operand in the server environment definition matches the storage location of the server termination command.

(b) If restart of the standby server fails

If you are running servers in the monitor mode, restart of the standby server is canceled and the standby server stops in the following cases:

  • HA Monitor failed to execute any of the following commands:

    • The command specified for the sby_actcommand operand in the server environment definition

    • The command specified for the sby_termcommand operand in the server environment definition

  • HA Monitor executes either of the following commands and the return code is other than 0:

    • The command specified for the sby_actcommand operand in the server environment definition

    • The command specified for the sby_termcommand operand in the server environment definition

Action

Use the following procedure to restart the standby server:

  1. Resolve the cause of the error.

    Resolve the cause of the error on the basis of the error code displayed in the KAMN273-E message. For details about the error codes, see Table 7-6 List of error codes that can be displayed in the KAMN273-E message (if restart of the standby server fails).

  2. From the standby system, start the standby server.

    In the standby system (where the failure occurred), execute the monbegin command to start the standby server.

To check after taking action:

  1. Verify that the system is ready for hot standby processing.

    Either of the following verifies this:

    • The KAMN252-I message has been issued.

    • The server and host status display command (monshow command) displays ONL as the active server's status and SBY as the standby server's status.

The following table lists the error codes that are displayed in the KAMN273-E message and describes the actions to be taken.

Table 7‒6: List of error codes that can be displayed in the KAMN273-E message (if restart of the standby server fails)

Cause

Code

Detail code

Explanation of code

Action

1

errno for the system call

A system error occurred during execution of the command specified for the sby_actcommand operand.

Resolve the cause of the system call error.

2

Command's return value

The command specified for the sby_actcommand operand returned a value other than 0 as a return value.

Check and correct the specified command.

126

Execution permissions are not granted to the command specified for the sby_actcommand operand.

Grant execution permissions to the command.

127

The command specified for the sby_actcommand operand does not exist.

Check whether the location of the command is correct.

3

errno for the system call

A system error occurred during execution of the command specified for the sby_termcommand operand.

Resolve the cause of the system call error.

4

Command's return value

The command specified for the sby_termcommand operand returned a value other than 0 as a return value.

Check and correct the specified command.

126

Execution permissions are not granted to the command specified for the sby_termcommand operand.

Grant execution permissions to the command.

127

The command specified for the sby_termcommand operand does not exist.

Check whether the location of the command is correct.

(c) When program restart fails

If you create a program restart command when using the monitor-mode program management function, you can restart only the program.

The following might be the causes of a program restart error:

  • HA Monitor failed to execute the program restart command.

  • The program restart command timed out before its processing was completed.

  • The UAP timed out before it issued the hamon_patrolstart function.

This subsection explains the handling methods that apply to the cases below. If a different message has been issued, the possible cause would be different. Take appropriate action according to the displayed messages.

  • The program has been set to be restarted in the event of a program error.

  • The KAMN285-E message has been issued.

Action

Correct the monitor-mode program environment definition and then restart the server. Alternatively, correct the program restart command and then restart the program.

  1. Resolve the cause of the error.

    Resolve the cause of the error on the basis of the error code displayed in the KAMN285-E message. For details about the error codes, see Table 7-7 List of error codes displayed in the KAMN285-E message.

  2. Restart the program in the active system.

    You can use either of the following methods:

    • Restart the server. Execute the monitor-mode server termination command (monend command) to terminate the active server, and then execute the monitor-mode server start command (monbegin command) to start the active server.

    • Manually restart the UAP that resulted in the error.

To check after taking action:

  1. Verify that the program has started successfully.

    Either of the following verifies this:

    • The KAMN284-I message has been issued.

    • The server and host status display command (monshow -u command) displays ACT as the program status.

The following table lists the error codes that are displayed in the KAMN285-E message and describes the actions to be taken.

Table 7‒7: List of error codes displayed in the KAMN285-E message

Cause code

Detail code

Explanation of code

Action

1

1

The program restart command was not completed before a timeout occurred.

Resolve the cause so that the program restart command will be completed.

2

The UAP did not issue the hamon_patrolstart function before a timeout occurred.

Resolve the cause so that the UAP issues the hamon_patrolstart function before a timeout occurs.

3

The program restart command was not completed and the UAP did not issue the hamon_patrolstart function before a timeout occurred.

Resolve the cause so that the program restart command is completed and the UAP issues the hamon_patrolstart function before a timeout occurs.

2

Command's return value

The program restart command returned a nonzero value.

Check the specified program restart command and correct it.

126

The program restart command does not have the execution permissions for the file specified in the restartcommand operand.

Grant execution permissions to the program restart command that was specified in the restartcommand operand in the monitor-mode program environment definition.

127

The file specified in the restartcommand operand was not found.

Check that the value specified in the restartcommand operand in the monitor-mode program environment definition matches the storage location of the program restart command.

256

A system error occurred while the program restart command was executing.

Eliminate the cause of the system call error.