Nonstop Database, HiRDB Version 9 System Operation Guide
A server on which server processing is down might experience degraded responses to UAPs, and even system freezes. This section explains how to use the message queue monitoring facility to monitor for server processes that are down.
Server processing down means that the server process has entered a state in which it is unable to perform any processing at all, due to CPU load-induced degradation of processing performance, I/O delay arising from an I/O error, or some other extreme condition.
HiRDB's server process allocation processing uses a message queue. When server processing is down, messages can no longer be read from the message queue. Under HiRDB, if messages cannot be read from the message queue once a specified amount of time has been reached (which is called the message queue monitoring time), a warning message (KFPS00888-W message) and an error message (KFPS00889-E message) are issued. This capability is provided by the message queue monitoring facility. When these messages are issued, server processing might have gone down.
The message queue monitoring time is normally 600 seconds (10 minutes). You can change this time with the pd_queue_watch_time operand.
When the warning message is issued, the possibility exists that server processing has gone down, so you should take one of the following actions:
You can restore the server process that went down by restarting the unit in which it was running. Normally, when the message queue monitoring time is exceeded, HiRDB terminates forcibly the unit whose server processing has gone down.
If you do not want the unit to be terminated forcibly, specify continue in the pd_queue_watch_timeover_action operand.
If you do not take the action described in (a) above (including when you cannot), use the pdcancel command to stop the transactions that are executing on the server whose processing is down. If there were no transactions, use the OS's kill command to terminate the server that stopped responding. Afterwards, identify the cause of the server process no-response and take appropriate action.
The following table shows the causes of message queue stagnation and the corrective measures to take for the server whose message queue is being monitored.
Table 8-12 Causes of message queue stagnation and corrective measures
Cause | Server process being monitored | Corrective measure | |||
---|---|---|---|---|---|
FES | BES | DS | SDS | ||
Messages cannot be read from the message queue because of a high CPU load. | Y | Y | Y | Y | Check the cause of the CPU overload and take appropriate corrective action. |
Messages cannot be read from the message queue because an I/O error is delaying input/output. | Y | Y | Y | Y | Check the cause of the I/O error and take appropriate corrective action. |
If the number of concurrent connection requests exceeds the value of the pd_max_users operand, the number of processes available for reading messages from the message queue is insufficient (this tends to occur when the high-speed connection facility is used). | Y | -- | -- | Y | Either reduce the number of concurrent connection requests or connect normally without using the high-speed connection facility. Alternatively, increase the value of the pd_max_users operand. |
The number of processes available for reading messages from the message queue is insufficient when the number of active back-end servers or dictionary servers is smaller than the number of active front-end servers or utility servers (this tends to occur in a multiple front-end server environment). | -- | Y | Y | -- | Ensure that the correct values are specified in the pd_max_bes_process, pd_max_dic_process, and pd_max_users operands. Alternatively, reduce the number of connections. |
You can use the pdls -d scd command to determine the time at which the last message was read from the message queue.
All Rights Reserved. Copyright (C) 2011, 2015, Hitachi, Ltd.