Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


7.5.8 Handling device failures on a shared disk (while the active server is terminating) (using SCSI reservation for shared disk)

If an I/O error occurs while the active server is terminating on the device specified in the scsi_device or dmmp_device operand in the server environment definition, HA Monitor issues the KAMN725-W and KAMN726-E messages and terminates the OS. In a redundant configuration with multipath software, HA Monitor issues the KAMN726-E message only when a failure has occurred on all paths to the same shared disk. If a failure has occurred on some but not all of the paths to the same shared disk, HA Monitor resumes the server termination processing without terminating the OS.

To recover from the device failure after the active server has been terminated:

  1. Determine the cause of the device failure.

    Determine the cause of the failure by referencing the KAMN725-W and KAMN726-E messages and the message issued by the kernel and by using hardware management tools.

  2. Resolve the cause of the device failure.

    Resolve the cause of the device failure by taking appropriate action, such as by replacing the erroneous device.

    In a redundant configuration with multipath software, do not at this time restore the path that resulted in the failure to online status (failback).

  3. In a multi-path configuration, restore the path that has been recovered from the failure to online status (failback).

    Restore a recovered path to online status (failback) by using the appropriate command provided by the multipath software (HDLM, DMMP, or HFC-PCM). For details about how to restore paths to online status, see the manual Hitachi Dynamic Link Manager Software User's Guide (for Linux(R) systems). Alternatively, see the documentation for DMMP or HFC-PCM.

    For a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), this step is not necessary.

  4. Cancel the reservation.

    If the failure occurred while the active server was terminating normally, cancel the reservation by referencing 7.5.12 Canceling SCSI reservation for shared disk.

    If the failure occurred while planned termination or abnormal termination was underway on the active server, there is no need to cancel the reservation because the standby server forcibly obtains a reservation.