Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


2.3.5 Host reset

Host reset means forcibly terminating the system in order to shut down I/O operations on a host that has experienced a failure. A host reset prevents the same active server from running on multiple hosts.

You must decide whether to use this function. For details about how to decide, see 1.4 Hot-standby switchover methods and 3.1 List of functions supported by HA Monitor.

Host reset is achieved by linking HA Monitor and a failure management processor. If VMware ESXi-based virtualization is used when host reset is performed, VMware ESXi is linked with a failure management processor.

HA Monitor monitors for host failures and issues a host reset request when it detects a host failure and the hot standby operation has been enabled. Normally, the standby system resets the active system on which the failure occurred (resetting the active system). Depending on the type of hot-standby configuration, the active system can reset the standby system (resetting the standby system).

Note

For details about what to do if the host reset fails, see 7.5.3 Handling host reset errors.

Organization of this subsection

(1) Resetting the active system

If the HA Monitor in the standby system detects a host failure in the active system, it issues an active system reset request to the failure management processor in the active system. It does this by using a reset command via the reset path. If a system dump can be obtained during the reset operation, a system dump of the reset system is obtained by an OS function. When the reset operation is completed, HA Monitor starts the hot standby operation. If the reset operation fails, the user must reset the system manually and then obtain a system dump.

You must have Linux Tough Dump installed to obtain a system dump successfully. For details about Linux Tough Dump, contact a Hitachi sales representative.

The following figure shows host status monitoring and host reset.

Figure 2‒14: Host status monitoring and host reset

[Figure]

HA Monitor monitors the remote system by sending alive messages. If a host failure occurs in the active system, the HA Monitor in the standby system issues a reset request to the failure management processor# because it cannot receive an alive message from the active system. The failure management processor# then sends a host reset instruction.

#

If the server model is HA8000xN or later, HA Monitor links with the failure management processor on the management server.

If VMware ESXi-based virtualization is used, VMware ESXi is used, not the failure management processor.

Because host reset is achieved by linking with hardware, you must use the address operand in the HA Monitor environment settings to specify a unique host address for each host. This host address is neither an IP address in TCP/IP nor a MAC address in OSI. The HA Monitor administrator can specify any address.

If the system being used is BladeSymphony, the same processing is performed by HA Monitor and the failure management processor, although the standby system that issues a reset instruction might be in the same chassis.

If VMware ESXi-based virtualization is used and a failure occurs in the virtual machine where the HA Monitor in the active system is running, the HA Monitor in the standby system resets only the virtual machine resulting in the failure. The HA Monitor in the standby system issues a reset request to the VMware ESXi that manages HA Monitor's virtual machine, so that the virtual machine resulting in the failure is reset. Therefore, when VMware ESXi-based virtualization is used, you must use the HA Monitor environment setup command (monsetup command) to specify in the HA Monitor environment settings the IP address of the VMware ESXi and the virtual machine name, in addition to the normal settings.

If host reset fails in a virtualization environment that uses Hitachi server virtualization (Virtage) or VMware ESXi, you can use the physical partition reset function to reset the entire processor. For details about the physical partition reset function, see 3.3.6 Physical partition reset function in a virtualization environment.

(2) Resetting the standby system

Depending on the type of hot-standby configuration, the standby system might reset the active system or another standby system.

(a) 1-to-1 switchover configuration

If no alive message is received from the standby system within the host failure monitoring time specified in the patrol operand in the HA Monitor environment settings, the HA Monitor in the active system determines that a host failure has occurred in the standby system. In this case, the HA Monitor in the active system does nothing because the active server can continue job processing.

If a failure occurs in the active system while the standby system is in host failure status, hot standby switching to the standby system is not possible. Therefore, you can have the HA Monitor in the active system reset the standby system when a host failure is detected in the standby system. You use the standbyreset operand in the HA Monitor environment settings to specify whether the standby system is to be reset in the event of a host failure. Do not perform standby system reset in any configuration other than the 1-to-1 switchover configuration.

If the standby system is reset in the event of a host failure in the standby system, the operator must determine the cause of the host failure and take appropriate action.

(b) Multi-standby configuration

In a multi-standby configuration, there are multiple standby systems per active system. If a host failure occurs in one of the standby systems, another standby system might reset the standby system that failed. For details about host reset in a multi-standby configuration, see 4.2.3 Host reset when there are multiple standby systems.