2.5.5 Health check
The health check function checks the operating status of APM on the monitoring server, and checks for discrepancies of monitoring conditions between SSO and APM.
Upon detecting discrepancies in monitoring conditions between SSO and APM, the health check function synchronizes the monitoring conditions between SSO and APM.
The function can conduct three types of checks: a system health check, a regular health check, and an on-demand health check. The following describes these types of checks.
- Organization of this subsection
(1) System health check
A system health check is a health check that SSO always performs automatically.
The following are the execution triggers for system health checks:
-
JP1/Cm2/SSO is started.
-
A monitoring server is added.
-
The startup event is received from APM.
-
For a monitoring server that is in the status in which monitoring is impossible, the monitoring conditions are changed, or discrepancies in monitoring conditions are corrected.
-
An event other than an event indicating that monitoring is in progress is received from APM.
-
An error occurs during TCP communication with APM.
The following describes settings related to the system health check.
(a) System health check at SSO startup
The monitoring manager might be in a high load state when SSO starts. In such a high load state, if a health check is conducted on all the monitoring servers, a communication overload might occur when, for example, the monitoring manager receives monitoring start requests, causing the health check to fail. To prevent such health check failures, you can specify settings that delay the start of health checks or that conduct health checks sequentially for a certain group of monitoring servers. The relevant settings are the following keys in the ssoapmon action definition file (ssoapmon.def):
-
sso-start-hcheck-delay
-
sso-start-hcheck-interval
-
sso-start-hcheck-unit
The figure below shows an overview of conducting system health checks at SSO startup. The operations instructed by the above-listed keys are described in this figure. Note that this figure is an example of when sso-start-hcheck-unit is set to 1 and a health check is performed for each APM.
The system health checks conducted at SSO startup notify SSO of the states of all the processes and services monitored on the monitoring servers as events.
By setting an interval time for system health checks at SSO startup, you can distribute monitoring start requests to the monitoring servers, making it possible to distribute the load of receiving status change events from the monitoring servers. You can also distribute the NNMi processing load caused by issuance of incidents.
(b) System health check at APM startup event reception
When APM starts, it issues a startup event to SSO. After receiving the event, SSO runs a health check on APM following the lapse of the health check delay time at APM startup. The delay time is specified in the ssoapmon action definition file. The following figure shows the overview of a system health check when an APM startup event is received.
After receiving an APM startup event, SSO conducts a health check on the target monitoring server. When the health check finishes, the statuses of all monitored processes and services are reported to SSO as events.
APM or the whole machine might be in a high load state when the monitoring server starts. In this case, response to the health check might be delayed, causing the health check to fail. You can avoid health check failure by setting a delay time for starting of the health check at APM startup.
(2) Regular health check
A regular health check means a health check that SSO periodically executes.
To enable regular health checks, set a health check interval either from the Set Health Check Interval window or in the hcheck key in the monitoring server definition file.
(a) Health check retry function
SSO periodically executes a regular health check. At such times, if the monitoring server is in a high load state, response from APM might be delayed. If this delay leads to a timeout, the health check will fail. To prevent the health check from failing in such a situation, use the health check retry function. You can set this function to retry execution of the health check when the health check temporarily fails.
If a regular health check fails, the health check is retried the number of times set for hcheck-retry-count in the ssoapmon action definition file. If the health check fails even after the set number of retries, the status of APM processes and services becomes Unknown. When a health check fails, its execution is retried at the time interval set to hcheck-retry-interval in the ssoapmon action definition file. The following figure shows an overview of retrying a health check.
- #1
-
For details on the monitoring status, see 2.5.2(1) Monitored status management.
- #2
-
This is the health check interval that can be set in the Set Health Check Interval window.
(b) Communication protocols of a regular health check
By default, regular health checks use communication via the SNMP (UDP) protocol. However, by enabling the TCP health check function, you can perform health checks with the highly reliable TCP protocol. The communication method for regular health checks can be selected according to the monitoring server. The following figure shows an example of a system that uses both of these protocols for health checks.
As shown in the above figure, some definition files are required for enabling the TCP health check function. Make sure that the necessary definition files are created and set up on both the monitoring manager (SSO side) and the monitoring server (APM side) as shown below.
-
On the monitoring manager (SSO side)
Set up the TCP agent definition file (ssotcpagent.conf). In addition, if necessary, adjust the value of the connect-retry-interval: key in the ssoapmon action definition file (ssoapmon.def).
-
On the monitoring server (APM side)
Set up the TCP service definition file (apmtcpserv.conf). Setting up this file is unnecessary on monitoring servers that are not subject to TCP health checks.