Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


7.5.7 Handling device failures on a shared disk (while the active server is running) (using SCSI reservation for shared disk)

If an I/O error occurs while the active server is running on the device specified in the scsi_device or dmmp_device operand in the server environment definition, HA Monitor issues the KAMN725-W and KAMN726-E messages. In a redundant configuration with multipath software, HA Monitor issues the KAMN726-E message only if a failure occurred on all paths to the same shared disk.

This subsection explains the recovery procedure.

Note that when a device file name (/dev/sdx) is added due to replacement or addition of a host adapter (HBA) or addition of a shared disk (LU), you must terminate the active server in order to recover from the error.

  1. Determine the cause of the device failure.

    Determine the cause of the failure by referencing the KAMN725-W and KAMN726-E messages and the message issued by the kernel and by using hardware management tools.

  2. Terminate the standby server by using the standby server termination command (monsbystp command) (for a resource server, the standby resource server termination command (monressbystp command)).

    Make sure that you terminate the standby server before you start the recovery processing. Immediately after the recovery processing has been completed, a reservation for the shared disk will not yet have been obtained. Therefore, if alive message transmission stops before a reservation is obtained after recovery processing has been completed, resulting in hot standby processing, the shared disk might not be protected.

  3. Resolve the cause of the device failure.

    Resolve the cause of the device failure by taking appropriate action, such as by replacing the erroneous device.

    In a redundant configuration with multipath software, at this time, do not restore to online status the path that resulted in the failure (failback).

  4. Obtain the reserved status for the disk.

    The following shows the checking method:

    # /usr/bin/sg_persist --in --no-inquiry --read-reservation --device=device-name
    Legend:

    device-name: Value indicated in the KAMN725-W message. In a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), this is a symbolic link. In a redundant configuration with multipath software, this is a physical device. If the device name displayed in the KAMN725-W message is followed by a path name, this is the path name.

    The following example specifies /dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065/ as the device name.

    Example
    # /usr/bin/sg_persist --in --no-inquiry --read-reservation --device=/dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065/
    Result 1:

    If the following text is displayed, confirm that 0x1 is the same as the key that has been output to the KAMN725-W message.

    PR generation=0xnn, Reservation follows:
      Key=0x1
      scope: LU_SCOPE,  type: Write Exclusive, registrants only

    If the value matches the key displayed in the KAMN725-W message, a reservation has been obtained. In a single-path configuration, in a VMware ESXi-based virtualization environment (where DMMP is not used), or in an HFC-PCM environment, go to step 9. In a redundant configuration with multipath software, go to step 5, because even if a reservation has been obtained, it might not have been obtained for the path that is to be recovered from the failure.

    If the value matches the key displayed in the KAMN725-W message, hot standby processing might have already been performed on the remote host or a reservation obtained by the remote host before the failure might still be available. The server running on the local host will issue the KAMN726-E message and will be terminated forcibly after a while. In such a case, skip the steps beginning with step 5 and perform the procedure described in (1) Handling when devices have already been reserved.

    Result 2:

    If the following message is issued, no reservation has been obtained. Go to step 5.

    PR generation=0xnn, there is NO reservation held
  5. Prepare for obtaining a reservation.

    The following shows how to prepare:

    # /usr/bin/sg_persist --out --no-inquiry --register-ignore --param-sark=key  --device=device-name
    Legend:

    key, device-name: These are the values displayed in the KAMN725-W message. In a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), the device name is a symbolic link, and in a redundant configuration with multipath software, it is a physical device. If the device name displayed in the KAMN725-W message is followed by a path name, this is the path name.

    The following example specifies /dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065 as the device name.

    Example
    # /usr/bin/sg_persist --out --no-inquiry --register-ignore --param-sark=0x1 --prout-type=5 --verbose --device=/dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065

    If nothing is displayed after execution, the preparation has been completed. Go to step 6.

    In a redundant configuration with multipath software, if a reservation has already been obtained in step 4, go to step 8.

    Note that the message shown below might be displayed. In this case also, go to step 6 because the preparation has been completed. In a redundant configuration with multipath software, if a reservation has already been obtained in step 4, go to step 7.

    persistent reserve out: scsi status: Reservation Conflict
    PR out: command failed
  6. Obtain a reservation.

    How to obtain a reservation is shown below. Make sure that you specify 5 for --prout-type.

    # /usr/bin/sg_persist --out --reserve --param-rk=key --prout-type=5 --device=device-name
    Legend:

    key, device-name: These are the values displayed in the KAMN725-W message. In a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), the device name is a symbolic link, and in a redundant configuration with multipath software, it is a physical device. If the device name displayed in the KAMN725-W message is followed by a path name, this is the path name.

    The following example specifies /dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065 as the device name.

    Example
    # /usr/bin/sg_persist --out --reserve --param-rk=0x1 --prout-type=5 --device=/dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065

    If nothing is displayed after execution, a reservation has been obtained successfully. If any message is displayed, check the specified key and device name for any errors and then re-execute the command.

  7. Check whether a reservation has been obtained.

    The following shows how to check:

    # /usr/bin/sg_persist --in --no-inquiry --read-reservation --device=device-name
    Legend:

    device-name: This is the value indicated in the KAMN725-W message. In a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), this is a symbolic link, and in a redundant configuration with multipath software, it is a physical device. If the device name displayed in the KAMN725-W message is followed by a path name, this is the path name.

    The following example specifies /dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065 as the device name.

    Example
    # /usr/bin/sg_persist --in --no-inquiry --read-reservation --device=/dev/disk/by-id/scsi-360060e8010462fe004f2b6ae00000065

    If nothing is displayed after execution, a reservation has been obtained successfully. If any message is displayed, evaluate steps 5 and 6 and then re-execute the command.

    Note that the message shown below might be displayed. In this case also, a reservation has been obtained successfully. In a single-path configuration, in a VMware ESXi-based virtualization environment (where DMMP is not used), or in an HFC-PCM environment, go to step 9. In a redundant configuration with multipath software, go to step 8.

    PR generation=0xnn, Reservation follows:
      Key=0x1
      scope: LU_SCOPE,  type: Write Exclusive, registrants only
  8. In a redundant configuration with multipath software, restore the path that has been recovered from the failure to online status (failback).

    Restore a recovered path to online status (failback) by using the appropriate command provided by the multipath software (HDLM, DMMP, or HFC-PCM). For details about how to restore paths to online status, see the manual Hitachi Dynamic Link Manager Software User's Guide (for Linux(R) systems). Alternatively, see the documentation for DMMP or HFC-PCM.

    For a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), this step is not necessary.

  9. Restart the standby server that was terminated in step 2.