Hitachi

Hitachi Advanced Database Setup and Operation Guide


16.16.10 Hardware-related problems (when the multi-node function is being used)

This subsection explains the steps to take when a hardware-related problem occurs.

Organization of this subsection

(1) Steps to take when a disk failure occurs

For details about the action to take when a disk failure occurs, see 15.4 Disk-related problems.

(2) Steps to take when the disk becomes full

(a) When the space shortage is caused by the DB area files

When the disk become full because of an increase in the size of the DB area files, take the necessary steps based on the explanations in the following subsections:

(b) When the space shortage is caused by a factor other than the DB area files

(3) Steps to take when a communication failure, CPU failure, or power supply failure occurs

This subsection explains the steps to take when a communication failure, CPU failure, or power supply failure occurs.

(a) When a communication failure occurs between the HADB client and the HADB server on the master node

In this case, the applicable transaction returns an error indicating that a communication failure has occurred. If this happens, perform the following procedure.

Procedure:

  1. Separate the master node where the communication failure occurred, from the multi-node configuration

    Switch the master node by executing a command. The master node where the communication failure occurred, is separated from the multi-node configuration. For details about how to switch over the master node by using a command, see 16.7 Switching over the master node by using a command.

  2. Investigate and address the cause of the communication failure

  3. Return the node you separated in step 1, back to the multi-node configuration

    For details about how to return nodes to a multi-node configuration, see 16.15.3 Returning a node to the multi-node configuration.

(b) When a communication failure occurs between the HADB client and the HADB server on a slave node

In this case, the applicable transaction returns an error indicating that a communication failure has occurred.

Investigate the cause of the communication failure. If you cannot take a corrective action immediately, terminate the HADB server on the slave node where the failure occurred.

(c) When a communication failure occurs between nodes

The following explains the action to take when a communication failure occurs between nodes. The action to take differs depending on whether you are using host reset or SCSI reservation for shared disk.

  • When using host reset

    If a communication error occurs between nodes, the HA Monitor detects a node failure. At this time, HA Monitor decides which nodes to separate from the multi-node configuration. If the master node is one of the nodes that are separated from the multi-node configuration, the master node is switched over.

    The server machines of nodes that are separated from the multi-node configuration are stopped by HA Monitor (the power for the server machines is shut off). If a server machine does not stop, execute the HA Monitor monswap command, and then stop the server machine manually.

    Then, restart the server machine, and investigate and address the cause of the communication failure.

    After taking the aforementioned steps, return the relevant nodes to the multi-node configuration as necessary. For details about how to return nodes to a multi-node configuration, see 16.15.3 Returning a node to the multi-node configuration.

  • When using SCSI reservation for shared disk

    If a communication error occurs between nodes, the HA Monitor detects a node failure. At this time, HA Monitor decides which nodes to separate from the multi-node configuration. If the master node is one of the nodes that are separated from the multi-node configuration, the master node is switched over. The server machines of nodes that are separated from the multi-node configuration are not stopped.

    In this case, investigate and address the cause of the communication failure. After taking the aforementioned steps, return the relevant nodes to the multi-node configuration as necessary. For details about how to return nodes to a multi-node configuration, see 16.15.3 Returning a node to the multi-node configuration.

    If the failure is on the monitoring path of HA Monitor, you need to recover the path by using the monlink command of HA Monitor. For details, see Hot standby operation when a failure occurs in the manual HA Monitor for Linux(R) (x86).

(d) When a CPU failure occurs

If a CPU failure occurs on a slave node, that slave node is separated from the multi-node configuration.

If a CPU failure occurs on the master node, the master node is switched over. The master node where the CPU failure occurred, is separated from the multi-node configuration.

After restarting the OS, investigate and address the cause of the problem. After addressing the problem, return the separated nodes back to the multi-node configuration as necessary. For details about how to return nodes to a multi-node configuration, see 16.15.3 Returning a node to the multi-node configuration.

(e) When a power supply failure occurs

If a power failure occurs on a slave node, that slave node is separated from the multi-node configuration.

If a power failure occurs on the master node, the master node is switched over. The master node where the power failure occurred, is separated from the multi-node configuration.

After restarting the OS, investigate and address the cause of the problem. After addressing the problem, return the separated nodes back to the multi-node configuration as necessary. For details about how to return nodes to a multi-node configuration, see 16.15.3 Returning a node to the multi-node configuration.