Hitachi

uCosminexus Service Platform Setup and Operation Guide


7.9.2 How to recover during cluster configuration

The troubleshooting methods for an error which occurs during cluster configuration are described by separately for load-balancing cluster configuration and for HA cluster configuration.

Organization of this subsection

(1) Failure and recovery in load-balancing cluster configuration

By arranging the HCSC server machines having space in load-balancing cluster, as a preparation in case the HCSC server on which load-balancing cluster is configured fails, you can maintain a fixed quantity of load of HCSC server operating with load-balancing cluster.

Important note

Every HCSC server has a reception queue for standard asynchronous reception (MDB (WS-R)/MDB (DB queue)). Therefore, you cannot send messages that are retained in the reception queue for standard asynchronous reception (MDB (WS-R)/MDB (DB queue)), when there is a failure in load-balancing cluster configuration.

(a) Disconnecting HCSC server with a failure

When there is a failure in an HCSC server that configures load-balancing cluster, disconnect the HCSC server in which the failure has occurred from the load-balancing cluster. After disconnecting, eliminate the failure in HCSC server so that service component execution requests can be received. Recover by operating the HCSC server from which you eliminated failure, in load-balancing cluster.

(b) Restoring HCSC server with a failure

The following points describe the steps to disconnect and restore HCSC server in which a failure has occurred:

  1. Set the load-balancer such that when a failure occurs, service component execution requests are not sent to the HCSC server in which the failure has occurred.

    Control the service component execution requests by setting the load-balancer. The setting method is different as per the specifications of each load-balancer.

  2. If you are using CTM, lock queue.

    For details about how to lock queue, see "3.7.4 Locking and controlling requests for a schedule queue" in "Application Server Expansion Guide".

  3. Eliminate the fault from the HCSC server, in which the failure has occurred so that service component execution requests can be received.

  4. If you are using CTM, release the lock for queue.

    For details about how to release the queue lock, see "5.4.2 Methods of locking a service and stopping a J2EE application that can be executed for each system" in "Application Server Operation, Monitoring, and Linkage Guide".

  5. Set the load-balancer to send service component execution requests to the HCSC server from which the fault was eliminated.

    The setting method is different as per the specifications of every load-balancer.

(2) Failure and recovery in HA cluster configuration

When a failure occurs in the executing node, the process is switched to standby node by the cluster software. After switching to standby node, eliminate the fault in executing node and, resume the operations in executing node.

Important note

When there is a failure in executing node and operations are switched to standby node, you must eliminate the fault in executing node and return to the operations in executing node. You can neither use the standby node as an executing node and nor can you use the node with failure as a standby node by eliminating the fault.

(a) Disconnecting node

The following figure shows the operation when process is switched to standby node by cluster software, in case of a failure in executing node:

Figure 7‒137:  Switching to standby node when a failure occurs (HA cluster)

[Figure]

(b) Recovering executing node

When a failure occurs in executing node and operation are switched to standby node, you must eliminate the fault in executing node and, return to the operation in executing node.

The following points describe the steps to eliminate the fault in executing node and, return operation to executing node:

  1. Disconnect the executing node and standby node from the network for service requester so that execution requests from service requester are not received.

  2. Stop the standby node reception (Standard reception and user-defined reception).

    For details about how to stop standard reception, see "5.3.33 Terminating the Standard Reception".

    For details about how to stop user-defined reception, see "5.3.34 Stopping the user -defined reception".

  3. Stop the standby node HCSC server.

    For details about how to stop HCSC server, see "5.3.38 Terminating the HCSC Server".

  4. Restore the fault in the HCSC server of executing node.

    For details about how to collect failure information and how to recover from the failure, see "7. Troubleshooting".

  5. Start the executing node HCSC server.

    For details about how to start HCSC server, see "5.3.4 Starting HCSC server".

  6. Start the executing node reception (standard reception and user-defined reception).

    For details about how to start standard reception, see "5.3.9 Starting standard receptions".

    For details about how to start user-defined reception, "5.3.8 Starting user-defined receptions".

  7. Restore operation by connecting the executing node and standby node to the network for service requester so that execution requests from service requester can be received in executing node.