Node management in OpenTP1

When OpenTP1 starts, startup on the local node is reported to the name service of the OpenTP1 instances running on another node, and the connection already established is forcibly closed. This functionality can be used at system switchover, for example.

To enable OpenTP1 startup to be notified to another node, specify Y in the name_notify operand in the system common definition on both the sending and receiving nodes.

OpenTP1 on both nodes must be version 05-02 or later to use this functionality.

Figure 3-22 shows an example of a system configuration when using the startup notification facility at system switchover.

Figure 3-22 Example of using the startup notification facility at system switchover

[Figure]

OpenTP1-B goes down due to a server failure or other error, and a system switchover occurs. OpenTP1-A cannot detect the failure in OpenTP1-B, so the connection remains open.
The systems are switched, and OpenTP1-C starts on the standby node.
If the startup notification facility is enabled, notification that OpenTP1-C has started is sent to OpenTP1-A.
OpenTP1-A forcibly closes its connection to OpenTP1-B.

Because communication among OpenTP1-A, OpenTP1-B, and OpenTP1-C resumes in this way from the establishment of a new connection, processing continues without any communication errors.

If startup fails to be notified to OpenTP1-A for any reason, message KFCA00642-W is output on OpenTP1-C. In this case, you must execute the namunavl command on OpenTP1-A. By specifying the -l option in the namunavl command, you can find out which nodes could not be notified that OpenTP1-C had started.

Note: The startup notification facility cannot be used when multiple instances of OpenTP1 are running on a monitored host, or when multiple instances of OpenTP1 run with the same IP address after a system switchover (an environment with only one LAN board).

(2) Node monitoring facility

The node monitoring facility polls nodes at regular intervals and detects communication failures.

Using the node monitoring facility, you can monitor the status of OpenTP1 on nodes specified in the all_node operand and all_node_ex operand in the system common definition. If an OpenTP1 node cannot be detected as active, this facility deletes all cached service information relating to the node and closes the connection.

Node monitoring minimizes errors because failures are detected and failed nodes are forcibly disconnected.

Figure 3-23 shows an example of monitoring other nodes by using the node monitoring facility.

Figure 3-23 Monitoring other nodes by using the node monitoring facility

[Figure]

The node monitoring facility at OpenTP1-A periodically polls OpenTP1-B, OpenTP1-C, and OpenTP1-D.

If the OpenTP1-C node goes down, the node monitoring facility detects that OpenTP1-C cannot be reached.
OpenTP1-C is disconnected and message KFCA00650-I is output.
The failed node is registered in the RPC suppression list^#. Service information about the node is deleted from the cached service information.

#

An RPC suppression list contains information about nodes on which the OpenTP1 system is inactive.

The node monitoring facility checks whether nodes are active at the intervals specified in the name_audit_interval operand of the name service definition. To use the node monitoring facility, specify 1 or 2 in the name_audit_conf operand of the name service definition.

The node monitoring facility behaves as follows according to the value specified in the name_audit_conf operand:

1 specified in name_audit_conf
Send-only nodes are monitored. When send processing ends, the node monitoring facility behaves as follows:
- If send processing by a previously active node fails
  The monitored node is judged to be in the stopped state. Message KFCA00650-I is output to the standard output, and information about the node is entered in the RPC suppression list.
- If send processing by a previously stopped node succeeds
  The previously stopped node is judged to be in an active state. Message KFCA00651-I is output to the standard output, and information about the node is deleted from the RPC suppression list.
2 specified in name_audit_conf
Send/receive nodes are monitored. When send/receive processing ends, the node monitoring facility behaves as follows:
- If send or receive processing by a previously active node fails^#
  The monitored node is judged to be in the stopped state. Message KFCA00650-I is output to the standard output, and information about the node is entered in the RPC suppression list.
- If send or receive processing by a previously stopped node succeeds
  The previously stopped node is judged to be in an active state. Message KFCA00651-I is output to the standard output, and information about the node is deleted from the RPC suppression list.
#

If a timeout occurs, receive processing is assumed to have failed. The timeout value is the value specified in the name_audit_watch_time operand of the name service definition.

While the node monitoring facility is enabled, the following action occurs according to the status of the monitored node.

If communication with a previously active node fails
The node with which communication failed is judged to be in the stopped state. Message KFCA00650-I is output to the standard output, and information about the node is entered in the RPC suppression list.
If communication with a previously stopped node succeeds
The previously stopped node is judged to be in an active state. Message KFCA00651-I is output to the standard output, and information about the node is deleted from the RPC suppression list.

The namalivechk command is another way of checking whether nodes are active. Table 3-4 describes the differences between using the node monitoring facility and the namalivechk command.

Table 3-4 Comparison of node monitoring using the node monitoring facility and the namalivechk command

Item	Node monitoring facility	namalivechk command
Monitored nodes	All nodes specified in the all_node operand of the system common definition (whether active or not) All nodes specified in the all_node_ex operand of the system common definition (whether active or not)	Nodes specified in the all_node operand of the system common definition, on which OpenTP1 has not been detected as inactive All nodes specified in the all_node_ex operand of the system common definition (whether active or not)
Operation when an inactive node is detected	If the node is specified in the all_node operand, information about the node is entered in the RPC suppression list if not already entered. If the node has already been entered, no action is taken. The connection with the inactive node is closed. Cached service information about the inactive node is deleted.	Information about any node specified in the all_node operand that is found to be inactive is entered in the RPC suppression list. The connection with the inactive node is closed. Cached service information about the inactive node is deleted.
Operation when an active node is detected	If the node is specified in the all_node operand and has been entered in the RPC suppression list, information about the node is deleted from the list.	No action

Note

The node monitoring facility cannot be used when multiple instances of OpenTP1 are running on a monitored host, or when multiple instances of OpenTP1 run with the same IP address after a system switchover (an environment with only one LAN board).
Change the following operands to tune the sensitivity with which a node-down condition is detected by the node monitoring facility:
If 1 is specified in the name_audit_conf operand:
Change the ipc_conn_interval operand in the system common definition.
If 2 is specified in the name_audit_conf operand:
Change the name_audit_watch_timeg
A maximum of 60 nodes can be monitored concurrently by the node monitoring facility. If more than 60 nodes are specified in the all_node and all_node_ex operands in the system common definition, monitoring is performed for each group of 60 or less in turn.
If a large number of nodes are specified in the all_node and all_node_ex operands in the system common definition, use of the node monitoring facility may affect the RPCs issued by UAPs. In this case, the value specified in the name_audit_interval operand should not be too small. Also, do not use the namalivechk command too soon after the previous execution.

(3) Facility for monitoring nodes registered in the RPC suppression list

The name service can check at 180-second intervals whether nodes registered in the RPC suppression list are active again. This facility is separate from the node monitoring facility. Specify whether to use this facility in the name_rpc_control_list operand of the name service definition.

Decide whether to use the facility for monitoring nodes registered in the RPC suppression list according to the setting for the node monitoring facility. For example, monitoring of nodes registered in the RPC suppression list should be disabled if either of the following occur:

A node that has been restored after a failure may be deleted from the RPC suppression list, even though the time specified in the name_audit_interval operand has not yet elapsed. If this occurs, message KFCA00651-I will not be output.
If 2 is specified in the name_audit_conf operand, message KFCA00650-I may be output periodically.

If the facility for monitoring nodes registered in the RPC suppression list is disabled and a value of 180 seconds or longer is specified in the name_audit_interval operand, it takes longer than usual for a recovered node to be deleted from the RPC suppression list.

We recommend that you set the node monitoring facility and the facility for monitoring nodes registered in the RPC suppression list as follows:

If the name_audit_conf operand is set to 1 or 2, and a value of less than 180 is specified in the name_audit_interval operand
We recommend that you specify N in the name_rpc_control_list operand.
If the name_audit_conf operand is set to 0 or omitted
We recommend that you specify Y or omit the name_rpc_control_list operand.
Specifying 0 or omitting the name_audit_conf operand, and specifying N in the name_rpc_control_list operand, disables both the node monitoring facility and the facility for monitoring nodes registered in the RPC suppression list. This results in the following situation:
- A registered node cannot be deleted from the RPC suppression list unless there is communication between that node and the local node.
- If the local node is not specified in the all_node operand of a node registered in the RPC suppression list, that node cannot be deleted from the RPC suppression list unless OpenTP1 is restarted on the local node.

3.2.3 Node management in OpenTP1

(1) Startup notification facility

(2) Node monitoring facility

(3) Facility for monitoring nodes registered in the RPC suppression list

(4) Node information display