Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


7.5.6 Handling device failures on a shared disk (while the active server is starting) (using SCSI reservation for shared disk)

This subsection explains how to handle failures while the active server is starting on a device that contains a shared disk to be reserved.

The failures discussed here are listed below. Note that the device referred to here is the device specified in the scsi_device or dmmp_device operand in the server environment definition.

For details about the handling of I/O errors that have occurred on some of the devices, see 7.5.7 Handling device failures on a shared disk (while the active server is running) (using SCSI reservation for shared disk).

Organization of this subsection

(1) Handling when devices have already been reserved

If some or all of the devices have already been reserved, HA Monitor issues the KAMN725-W and KAMN726-E messages and the server startup processing fails. If this happens, determine how reservation occurred, and then take appropriate action.

The following explains possible causes and the actions to be taken.

Cause 1
  • When the active server was terminated, a reservation could not be released for a reason such as an I/O error (the KAMN726-E message is issued)

  • A host failure occurred on the active system while hot standby was not possible, such as when the standby server was inactive

  • The OS was shut down while the active server was running

  • When hot standby processing was performed due to a host failure, the active server was terminated before the failed host and the standby server were started

Action 1

Release the reservation by using the reservation release command (monscsiclr command) or by following the procedure described in 7.5.11 Dealing with the server that does not release the reservation (using SCSI reservation for shared disk), and then restart the server.

Cause 2
  • A server for which the same shared disk is defined is running as the active server due to an HA Monitor definition error

  • The device has been reserved by a program that is not an HA Monitor

Action 2

If the device was reserved for one of these reasons or the cause is unknown, take appropriate action according to the figure shown below, then restart the server.

The numbers in the figure correspond to the step numbers in the detailed explanation below.

Figure 7‒4: Determining the cause and taking appropriate action when devices are already reserved

[Figure]

  1. Is the host address indicated by the KAMN725-W message the same as the address operand value in the local host's HA Monitor environment settings?

    Yes: Go to step 2

    No: Go to step 7

  2. Is KAMN728-W output when a server indicated by KAMN725-W is stopped?

    Yes: Release the reservation by using the reservation release command (monscsiclr command) or by referencing 7.5.11 Dealing with the server that does not release the reservation (using SCSI reservation for shared disk).

    No: Go to step 3

  3. Did any one of the following events occur on the local host when the server indicated by the server alias name in the KAMN725-W message was running as the active server?

    • OS shutdown

    • Cancellation of reservation failed when the server indicated by the server alias name in the KAMN725-W message was terminating (the KAMN725-W message was issued).

    • The KAMN725-W message was issued due to a resource disconnection error on one of the servers in the active system and then HA Monitor was terminated.

    Yes: Release the reservation by using the reservation release command (monscsiclr command) or by referencing 7.5.11 Dealing with the server that does not release the reservation (using SCSI reservation for shared disk).

    No: Go to step 4

  4. Is there an HA Monitor (including a test environment) on the host that is not connected to the local host?

    Yes: Go to step 5

    No: Go to step 6

  5. Is the scsi_device or dmmp_device operand value in the server environment definition the same as that for the local host?

    Yes: Check the shared disk configuration and then correct the value of the scsi_device or dmmp_device operand.

    No: Go to step 6

  6. Have the devices been reserved by a software program other than HA Monitor?

    Yes: Terminate the program that reserved the devices.

    No: Cancel the reservation by referencing 7.5.12 Canceling SCSI reservation for shared disk.

  7. Is KAMN728-W output when a server indicated by KAMN725-W is stopped in the system indicated by KAMN725-W?

    Yes: Release the reservation by using the reservation release command (monscsiclr command) or by referencing 7.5.11 Dealing with the server that does not release the reservation (using SCSI reservation for shared disk).

    No: Go to step 8

  8. Did any one of the following events occur on the active system when the server indicated by the server alias name in the KAMN725-W message was running as the active server on the host indicated by the host address in the same message?

    • OS shutdown

    • Cancellation of reservation failed when the server indicated by the server alias name in the KAMN725-W message was terminating (the KAMN725-W message was issued).

    • The KAMN725-W message was issued due to a resource disconnection error on one of the servers in the active system and then HA Monitor was terminated.

    Yes: Release the reservation by using the reservation release command (monscsiclr command) or by referencing 7.5.11 Dealing with the server that does not release the reservation (using SCSI reservation for shared disk).

    No: Go to step 9

  9. Is the device name specified in the scsi_device or dmmp_device operand for the host containing the server indicated by the server alias name in the KAMN725-W message the same as the scsi_device or dmmp_device operand value for the host indicated by the host address in the same message?

    Yes: Go to step 4

    No: Correct the scsi_device or dmmp_device operand values so that they are the same for both hosts.

(2) Handling when I/O errors have occurred on all devices

If I/O errors occur on all of the devices, HA Monitor issues the KAMN725-W and KAMN726-E messages and the server startup processing fails. To handle I/O errors:

  1. Determine the causes of the I/O errors.

    Determine the causes of the errors by checking the KAMN725-W and KAMN726-E messages and the messages issued by the kernel and by using hardware management tools.

  2. Resolve the causes of the I/O errors.

    Resolve the causes of the I/O errors by taking measures such as replacing erroneous devices.

  3. If you are using a multi-path configuration, restore the paths recovered from the I/O errors to online status (failback).

    Restore a recovered path to online status (failback) by using the appropriate command provided by the multipath software (HDLM, DMMP, or HFC-PCM). For details about how to restore paths to online status, see the manual Hitachi Dynamic Link Manager Software User's Guide (for Linux(R) systems). Alternatively, see the documentation for DMMP or HFC-PCM.

    For a single-path configuration, or in a VMware ESXi-based virtualization environment (where DMMP is not used), this step is not necessary.

  4. Restart the server.