Nonstop Database, HiRDB Version 9 System Operation Guide

[Contents][Index][Back][Next]

8.12 Monitoring the status of server processes (message queue monitoring facility)

A server on which server processing is down might experience degraded responses to UAPs, and even system freezes. This section explains how to use the message queue monitoring facility to monitor for server processes that are down.

Server processing down means that the server process has entered a state in which it is unable to perform any processing at all, due to CPU load-induced degradation of processing performance, I/O delay arising from an I/O error, or some other extreme condition.

Organization of this section
(1) Overview of message queue monitoring facility
(2) Action to be taken when the warning message is issued
(3) Taking steps to prevent server processing from going down
(4) Remarks

(1) Overview of message queue monitoring facility

HiRDB's server process allocation processing uses a message queue. When server processing is down, messages can no longer be read from the message queue. Under HiRDB, if messages cannot be read from the message queue once a specified amount of time has been reached (which is called the message queue monitoring time), a warning message (KFPS00888-W message) and an error message (KFPS00889-E message) are issued. This capability is provided by the message queue monitoring facility. When these messages are issued, server processing might have gone down.

The message queue monitoring time is normally 600 seconds (10 minutes). You can change this time with the pd_queue_watch_time operand.

(2) Action to be taken when the warning message is issued

When the warning message is issued, the possibility exists that server processing has gone down, so you should take one of the following actions:

(a) Restarting the unit

You can restore the server process that went down by restarting the unit in which it was running. Normally, when the message queue monitoring time is exceeded, HiRDB terminates forcibly the unit whose server processing has gone down.

If you do not want the unit to be terminated forcibly, specify continue in the pd_queue_watch_timeover_action operand.

(b) Canceling transactions

If you do not take the action described in (a) above (including when you cannot), use the pdcancel command to stop the transactions that are executing on the server whose processing is down. If there were no transactions, use the OS's kill command to terminate the server that stopped responding. Afterwards, identify the cause of the server process no-response and take appropriate action.

(3) Taking steps to prevent server processing from going down

The following table shows the causes of message queue stagnation and the corrective measures to take for the server whose message queue is being monitored.

Table 8-12 Causes of message queue stagnation and corrective measures

Cause Server process being monitored Corrective measure
FES BES DS SDS
Messages cannot be read from the message queue because of a high CPU load. Y Y Y Y Check the cause of the CPU overload and take appropriate corrective action.
Messages cannot be read from the message queue because an I/O error is delaying input/output. Y Y Y Y Check the cause of the I/O error and take appropriate corrective action.
If the number of concurrent connection requests exceeds the value of the pd_max_users operand, the number of processes available for reading messages from the message queue is insufficient (this tends to occur when the high-speed connection facility is used). Y -- -- Y Either reduce the number of concurrent connection requests or connect normally without using the high-speed connection facility. Alternatively, increase the value of the pd_max_users operand.
The number of processes available for reading messages from the message queue is insufficient when the number of active back-end servers or dictionary servers is smaller than the number of active front-end servers or utility servers (this tends to occur in a multiple front-end server environment). -- Y Y -- Ensure that the correct values are specified in the pd_max_bes_process, pd_max_dic_process, and pd_max_users operands. Alternatively, reduce the number of connections.

Legend:
Y: Applicable
--: Not applicable
Reference note
The number of HiRDB server processes that can be active is restricted by the following operands:
  • pd_max_server_process
    If the number of active servers in a unit is large, carefully estimate the value to be specified for this operand. Also, if the standby-less system switchover facility is used, the estimate must take system switchover into account.
  • pd_max_bes_process
    If multiple front-end servers or the standby-less system switchover (1:1) facility is used, carefully estimate the value to be specified for this operand.
  • pd_max_dic_process
    If multiple front-end servers are used, carefully estimate the value to be specified for this operand.
  • pd_ha_max_server_process
    If the standby-less system switchover (effects distributed) facility is used, carefully estimate the value to be specified for this operand.
  • pd_max_users
    If the number of concurrent connections is large, an appropriate value must be specified.

(4) Remarks

You can use the pdls -d scd command to determine the time at which the last message was read from the message queue.