Monitoring the status of server processes (message queue monitoring facility)

A server on which server processing is down might experience degraded responses to UAPs, and even system freezes. This section explains how to use the message queue monitoring facility to monitor for server processes that are down.

Server processing down means that the server process has entered a state in which it is unable to perform any processing at all, due to CPU load-induced degradation of processing performance, I/O delay arising from an I/O error, or some other extreme condition.

Organization of this section: (1) Overview of message queue monitoring facility; (2) Action to be taken when the warning message is issued; (3) Taking steps to prevent server processing from going down; (4) Remarks

(1) Overview of message queue monitoring facility

HiRDB's server process allocation processing uses a message queue. When server processing is down, messages can no longer be read from the message queue. Under HiRDB, if messages cannot be read from the message queue once a specified amount of time has been reached (which is called the message queue monitoring time), a warning message (KFPS00888-W message) and an error message (KFPS00889-E message) are issued. This capability is provided by the message queue monitoring facility. When these messages are issued, server processing might have gone down.

The message queue monitoring time is normally 600 seconds (10 minutes). You can change this time with the pd_queue_watch_time operand.

(2) Action to be taken when the warning message is issued

When the warning message is issued, the possibility exists that server processing has gone down, so you should take one of the following actions:

Restart the unit
Cancel transactions

(a) Restarting the unit

You can restore the server process that went down by restarting the unit in which it was running. Normally, when the message queue monitoring time is exceeded, HiRDB terminates forcibly the unit whose server processing has gone down.

If you do not want the unit to be terminated forcibly, specify continue in the pd_queue_watch_timeover_action operand.

(b) Canceling transactions

If you do not take the action described in (a) above (including when you cannot), use the pdcancel command to stop the transactions that are executing on the server whose processing is down. If there were no transactions, use the OS's kill command to terminate the server that stopped responding. Afterwards, identify the cause of the server process no-response and take appropriate action.

(3) Taking steps to prevent server processing from going down

The following table shows the causes of message queue stagnation and the corrective measures to take for the server whose message queue is being monitored.

Table 8-12 Causes of message queue stagnation and corrective measures

Cause	Server process being monitored				Corrective measure
Cause	FES	BES	DS	SDS	Corrective measure
Messages cannot be read from the message queue because of a high CPU load.	Y	Y	Y	Y	Check the cause of the CPU overload and take appropriate corrective action.
Messages cannot be read from the message queue because an I/O error is delaying input/output.	Y	Y	Y	Y	Check the cause of the I/O error and take appropriate corrective action.
If the number of concurrent connection requests exceeds the value of the pd_max_users operand, the number of processes available for reading messages from the message queue is insufficient (this tends to occur when the high-speed connection facility is used).	Y	--	--	Y	Either reduce the number of concurrent connection requests or connect normally without using the high-speed connection facility. Alternatively, increase the value of the pd_max_users operand.
The number of processes available for reading messages from the message queue is insufficient when the number of active back-end servers or dictionary servers is smaller than the number of active front-end servers or utility servers (this tends to occur in a multiple front-end server environment).	--	Y	Y	--	Ensure that the correct values are specified in the pd_max_bes_process, pd_max_dic_process, and pd_max_users operands. Alternatively, reduce the number of connections.

Legend:

Y: Applicable

--: Not applicable

Reference note

The number of HiRDB server processes that can be active is restricted by the following operands:

pd_max_server_process
If the number of active servers in a unit is large, carefully estimate the value to be specified for this operand. Also, if the standby-less system switchover facility is used, the estimate must take system switchover into account.
pd_max_bes_process
If multiple front-end servers or the standby-less system switchover (1:1) facility is used, carefully estimate the value to be specified for this operand.
pd_max_dic_process
If multiple front-end servers are used, carefully estimate the value to be specified for this operand.
pd_ha_max_server_process
If the standby-less system switchover (effects distributed) facility is used, carefully estimate the value to be specified for this operand.
pd_max_users
If the number of concurrent connections is large, an appropriate value must be specified.

8.12 Monitoring the status of server processes (message queue monitoring facility)

(1) Overview of message queue monitoring facility

(2) Action to be taken when the warning message is issued

(a) Restarting the unit

(b) Canceling transactions

(3) Taking steps to prevent server processing from going down

(4) Remarks