2.7.5 Remote host monitoring with the health check function

The health check function is meant to detect problems in JP1/Base, but this is not possible if a hangup or other error occurs in the function itself. Also, in a system that uses JP1/IM - Manager, if an error occurs in the event service, JP1 events cannot be issued or forwarded, so the higher-level host cannot be notified even if an error is detected.

In case something happens and there is no way of detecting or notifying a process error on the local host, the JP1/Base health check function and the event service can be monitored from a remote host. A maximum of 2,500 remote hosts can be monitored from one host.

The following describes how to monitor a remote host in a system that uses JP1/IM - Manager, and how the system operates when monitoring a remote host.

Organization of this subsection

(1) Remote host monitoring in a system that uses JP1/IM - Manager
(2) Operation with a large number of monitored hosts
(3) Checking the time spent on each monitoring operation
(4) Operation when errors occur in a hierarchical configuration
(5) Operation when a monitoring error occurs due to a temporary failure
(6) Reviewing the monitoring interval
(7) Reviewing the communication timeout value
(8) Operation when a monitored host stops

(1) Remote host monitoring in a system that uses JP1/IM - Manager

You can monitor whether the JP1/Base health check function and event service are operating normally on the remote hosts.

The following describes remote host monitoring in a system that uses JP1/IM - Manager, based on the following configuration example.

Figure 2‒36: Example of remote host monitoring in a system that uses JP1/IM - Manager

The hosts in this example have the following settings.

Host	Purpose	Setting for remote host monitoring
hostA	Manager host	Monitor hostB and hostX.
hostB	Submanager host	Monitor hostA, hostY, and hostZ.
hostX	Agent host	None
hostY	Agent host	None
hostZ	Agent host	None

The following processing is performed if an error occurs in the health check function or event service at agent hostY or manager hostA.

Error in the health check function at hostY: The health check function at hostB detects the error and issues a JP1 event. The JP1 event is forwarded to hostA. At hostA, a message about the problem at hostY appears in JP1/IM - View.
Error in the event service at hostY: The health check function on hostY detects an error, but cannot issue a JP1 event. Therefore, the health check function at hostB detects the error and issues a JP1 event. The JP1 event issued by hostB is forwarded to hostA. At hostA, a message about the problem at hostY appears in JP1/IM - View.
Error in the health check function at hostA: The health check function at hostB detects the error and issues a JP1 event. The JP1 event is forwarded to hostA. At hostA, a message about the problem on the local host appears in JP1/IM - View.
Error in the event service at hostA: If the health check function is enabled at JP1/IM - Manager on hostA, the health check function at JP1/IM - Manager detects the error in the event service on the local host and a message appears in JP1/IM - View.

To Page Top

(2) Operation with a large number of monitored hosts

When two or more remote hosts are monitored from a single host, the health check function checks the status of the JP1/Base processes at each host in turn. It takes about 3 seconds at each host. This can take a long time if there are a large number of hosts to monitor.

For example, for one host to check 200 hosts might take about 600 seconds from start to finish. You can reduce the monitoring time by splitting the target hosts into groups, and setting a dummy manager host for each group.

Figure 2‒37: Example of monitoring 200 hosts

In this example, the target hosts are split into groups of 20 hosts each. Manager hostA monitors the dummy manager hosts (host1, host21, and so on). As monitoring is by group rather than by individual host, the monitoring time can be cut to about 60 seconds.

To Page Top

(3) Checking the time spent on each monitoring operation

To check the time spent on each monitoring (polling) operation, check the polling completion message (KAVA7239-I). Each time monitoring finishes, the following polling completion message (KAVA7239-I) is output:

KAVA7239-I Monitoring by the health check function is complete. (host name = manager-host-name, monitoring time = monitoring-time-in-seconds, monitoring interval = monitoring-interval-in-seconds)

The polling completion message (KAVA7239-I) is output if YES is specified for the POLLENDMSG parameter in the health check definition file. When JP1/Base is installed, the default setting is that the polling completion message is not output. For details about the health check definition file, see Health check definition file in 16. Definition Files.

To Page Top

(4) Operation when errors occur in a hierarchical configuration

The following describes error handling when the target hosts are arranged in a hierarchy, as in the figure below.

Figure 2‒38: Example of error handling in a hierarchical configuration

If an error occurs in the health check function or event service at hostB, errors at hostD and hostE being monitored by hostB cannot be detected or reported.

If hostB is restored quickly, any JP1 event issued because of an error at hostD or hostE while hostB was stopped will be forwarded when hostB retries the send operation at recovery. If hostB recovery takes a long time, you must change the settings in the health check definition file (jbshc.conf) so that hostD and hostE will be monitored directly by hostA until hostB is restored.

As illustrated in this example, in a hierarchical configuration, it is a good idea to prepare a health check definition file (jbshc.conf), specifying that the agent hosts are to be monitored directly from the manager host in the event of an error on the submanager host.

To Page Top

(5) Operation when a monitoring error occurs due to a temporary failure

If remote host monitoring of the health check function fails to monitor a monitored host due to, for example, a system overload or network problem, the function reports an error to the monitoring host (manager).

However, if the cause of the error is temporary, the error might be corrected spontaneously after a while. Therefore, you might not want all errors to be reported to the monitoring host. In such a case, you can set a threshold (monitoring threshold) to distinguish real errors from temporary errors. Specify how many times monitoring can fail in succession as the monitoring threshold. When the specified threshold is reached, the function judges that an error has occurred on the monitored host and reports an error to the monitoring host.

The following figure compares the behavior in cases where no monitoring threshold is set and where the monitoring threshold is set to 3.

Figure 2‒39: Behavior of the health check function in cases where a monitoring threshold is set and not set

As shown in the figure above, if no monitoring threshold is set, an error message is output at the first failure in monitoring. On the other hand, if the monitoring threshold is set to 3, an error message is output at the third successive failure in monitoring. Note that setting the monitoring threshold to 1 is equivalent to setting no monitoring threshold.

You can specify the monitoring threshold in the health check definition file (jbshc.conf). Note that setting the monitoring threshold causes a delay in detection of an error on the monitored host. The delay (in seconds) can be approximated as follows: monitoring-interval x (monitoring-threshold - 1). In normal operation, do not change the monitoring threshold from the initial value of 1. If you change the monitoring threshold from the initial value, determine the new value through careful consideration of the conditions in which failures and errors occur.

Determine whether the cause of failure in monitoring is temporary from messages that are output. The following shows the guidelines for determination.

If monitoring fails, message KAVA7223-E or KAVA7229-W is output.
If recovery of connection is confirmed by a subsequent connectivity check, message KAVA7224-I is output.

If messages are output in the above order, this indicates that a temporary failure occurred and was then corrected in a short time. In this case, monitoring failed because, for example, connection to the monitored host failed, a session with the monitored host was closed, or connection (or communication) timed out. For the cause of failure, see the detailed information that is output in message KAVA7223-E or KAVA7229-W. Note that detailed information is output to messages if the ERROR_DETAIL parameter is enabled (ON) in the health check definition file (jbshc.conf). If the cause of failure in monitoring is a connection or communication timeout, you also have to review the timeout value.

To Page Top

(6) Reviewing the monitoring interval

In the health check definition file (jbshc.conf), you can specify an interval for monitoring remote hosts. Perform a trial run before you start operations, and check whether the specified monitoring interval is appropriate. If message KAVA7219-W is output to the integrated trace log, the monitoring interval might be too short. Change the interval, referring to the estimate equation given in Health check definition file in 16. Definition Files.

To Page Top

(7) Reviewing the communication timeout value

If the system load on a monitored host is high, a delay in response from the monitored host to the monitoring host (manager) might cause a failure in monitoring. In such a case, the possibility of failure due to a delay in response can be reduced by adjusting the communication timeout value on the manager side.

You can specify the communication timeout value in the health check definition file (jbshc.conf). When you specify the communication timeout value, do not set a time longer than the monitoring interval. If the communication timeout value is longer than the monitoring interval, monitoring might not finish within a monitoring interval. In normal operation, do not change the communication timeout value from the initial value of 60 (seconds). If a processing delay occurs in a condition such as shown below, review the communication timeout value according to the operational condition:

The system load increases temporarily or periodically.
The system is operating at nearly full performance.

Determine whether a communication timeout occurred from the detailed information in message KAVA7223-E or KAVA7229-W that is output when monitoring fails. If occurrence of a connection or communication timeout is indicated as the cause, this means that a communication timeout occurred. Detailed information can be output to messages if the ERROR_DETAIL parameter is enabled (ON) in the health check definition file (jbshc.conf).

To Page Top

(8) Operation when a monitored host stops

If version 10-00 or later of JP1/Base is installed on both the manager host and monitored host, you can choose whether to monitor when the monitored host starts and stops. If you choose to monitor this activity, you can prevent an error from being reported if a host shuts down normally when scheduled to do so.

The following figure shows how the system behaves when JP1/Base monitors the startup and shutdown of monitored hosts, and when it does not.

Figure 2‒40: Operation when monitoring and not monitoring monitored host startup and shutdown

JP1/Base issues JP1 events when it starts and when it stops. If JP1/Base is configured to monitor when monitored hosts start and stop, and receives a JP1 event reporting that a monitored host has stopped, it outputs the message KAVA7228-I. Although JP1/Base will continue to check the connection to the monitored host at the specified monitoring interval, it will not declare an error if a connection cannot be established.

Note that, while JP1/Base on the manager host is stopped, JP1/Base cannot receive a JP1 event reporting that a monitored host has stopped. Therefore, after JP1/Base on the manager host has started, if monitoring fails for a monitored host that has not been monitored before, JP1/Base issues the message KAVA7229-W.

KAVA7229-W Monitoring cannot be performed because a connection cannot be established with Host B, which is not receiving stop notifications.

If monitoring fails for a monitored host that has been monitored before or one that sent a JP1 event reporting that a monitored host had started, JP1/Base issues the message KAVA7223-E.

KAVA7223-E Monitoring cannot be performed because a connection with Host B cannot be established.

If a monitored host that output the message KAVA7229-W or KAVA7223-E is in a state such that it is available to be monitored after the connection is confirmed, or if a JP1 event is received from the monitored host reporting that the monitored host has started, JP1/Base issues the message KAVA7224-I and restarts monitoring the host.

Note: If you want to monitor the starting and stopping of monitored hosts, you must configure the monitored hosts to send JP1 events reporting that the host has started and stopped to the manager host.

In contrast, if JP1/Base is configured to not monitor the starting and stopping of monitored hosts, it does not output a message when it receives a JP1 event reporting that an agent host has stopped. In this scenario, JP1/Base continues to monitor the host in the normal way, even after it receives a JP1 event reporting that the host has stopped. If JP1/Base cannot connect with that host, it issues the message KAVA7223-E.

The message output when the health check function (for remote host monitoring) starts indicates whether JP1/Base is configured to monitor the starting and stopping of monitored hosts. The following table shows which setting is indicated by each message ID.

Setting	Message ID
Monitor starting and stopping of monitored hosts	KAVA7231-I
Do not monitor starting and stopping of monitored hosts	KAVA7230-I

You can specify whether JP1/Base monitors the starting and stopping of monitored hosts by the setting of the STOP_CHECK parameter in the health check definition file (jbshc.conf).

To Page Top