Hitachi

JP1 Version 12 JP1/Performance Management User's Guide


16.2.4 Notes on the health check function

This subsection contains notes on the health check function.

Legend:

↓ ... ↓: The digits after the decimal point are cut off.

Organization of this subsection

(1) Polling-related settings of the health check function

The health check function polls each agent host to check whether it is active. The following table describes how the number of hosts that can be polled simultaneously, the polling interval for each agent host, and the polling timeout time are determined for polling by the health check function.

Table 16‒16: Polling-related settings of the health check function

Polling-related setting

PFM - Manager is 10-00 or later and the parallel check mode is enabled

PFM - Manager is 09-00 or earlier or 10-00 or later and the parallel check mode is disabled

Number of hosts that can be polled simultaneously

Usually, the setting does not need to be changed.

To change the setting, use the Parallel Confirmation Count property.

Default: 10 (hosts)

1 (cannot be changed)

Polling interval

Usually, the setting does not need to be changed.

To change the setting, use the Minimum Period per Host property.

Default: 0 (seconds)

The polling interval is automatically calculated based on the value of the Collection Interval property and the number of hosts in the system (the minimum is 2 seconds).

If you want to change the polling interval, you need to change the value of the Collection Interval property.

Polling timeout time

Usually, the setting does not need to be changed.

To change the setting, use the Timeout Period per Host property.

Default: 30 (seconds)

The poling timeout time is automatically calculated based on the polling interval (2 to 10 seconds).

If you want to change the polling timeout time, you need to change the value of the Collection Interval property.

When the parallel check mode is enabled, a sufficient polling timeout time is allocated regardless of the number of monitored hosts in the system. When the parallel check mode is disabled, a sufficient value must be set for the Collection Interval property for the number of monitored hosts in the system because the polling timeout time is automatically calculated based on the value of the Collection Interval property and the number of monitored hosts in the system.

If an inappropriate polling interval is set, PFM - Agent or PFM - RM might be frequently considered as inactive, polling might not be completed for all agent hosts within the set interval#, or other problems might occur. This subsection describes how to check the currently set polling interval and how to correct the polling interval if it is inadequate.

#

For details on the behavior of the health check function if polling is not completed within the set interval for all agent hosts, see 16.2.4(2) Behavior of the health check function if polling is not completed for all agent hosts within the specified interval and 16.2.4(3)(b) Occasions for storing data and evaluating alarms if polling is not completed for all agent hosts within the specified interval.

(a) When the parallel check mode is enabled

When the parallel check mode is enabled, you can configure the properties described in the table below. Usually, you do not need to change the values of the properties. However, you can tune the operation of the health check function by changing the values of the properties if incorrect operating statuses are frequently detected or polling does not complete within the interval specified for the Collection Interval property.

Table 16‒17: Properties to be configured when the parallel check mode is enabled

Property

Description

Purpose for using this property

Parallel Confirmation Count

Number of hosts that can be polled simultaneously when the parallel check mode is enabled

Set a large value when many monitored hosts exist in the system and you want to reduce the length of time required for checking all the hosts in the system. Because the health check function checks as many hosts as specified for this property at one time, setting a large value for this property reduces the length of time required for checking all the hosts in the system.

Set a small value if you want to reduce the load of communication for checking. When you set a small value, the amount of communication performed for parallel checking is reduced, decreasing the amount of communication per unit time.

Minimum Period per Host

Minimum polling interval (seconds) per host when the parallel check mode is enabled

Specify the length of time to wait between the beginning of polling for a host and the beginning of polling for the next host if you want to avoid congested communication for checking at the same time. When you specify 0, the next host is immediately polled after polling for the previous host ends.

Note that polling is performed simultaneously for the number of hosts specified for the Parallel Confirmation Count property even if the Minimum Period per Host property is set. Use the Minimum Period per Host property to increase the interval between the successive polling for hosts.

Timeout Period per Host

Polling timeout time (seconds) per host when the parallel check mode is enabled

If polling a monitored host takes too long and times out due to heavy load on the network or the monitored host, the health check function might incorrectly detect the operating status of the host (such as Unconfirmed). If timeout frequently occurs, set a large value for this property.

If you set a large value, it takes a long time to check all the hosts in the system. You need to increase the polling interval as well.

The following describes the purpose of each property.

■ Purpose of the number of hosts that can be polled simultaneously (Parallel Confirmation Count property)

This property enables parallel polling for the specified number of hosts. See the following figure for an example.

[Figure]

■ Purpose of the minimum polling interval (Minimum Period per Host property)

When 0 is set for the Minimum Period per Host property, the health check function immediately polls the next host when polling for the previous host ends. See the following figure for an example.

[Figure]

If a value greater than 0 is set for the Minimum Period per Host property, polling for the next host waits until the length of time set for the Minimum Period per Host property elapses from the beginning of polling for the previous host. This applies even if polling for the previous host is completed before the length of time set for the Minimum Period per Host property expires. If polling for a host ends after the length of time set for the Minimum Period per Host property ends, the health check function immediately starts to poll the next host. See the following figure for an example.

[Figure]

■ Purpose of the polling timeout time (Timeout Period per Host property)

If polling a host takes too long and exceeds the timeout time set for this property, the health check function determines the status of the host as Unconfirmed, cancels polling, and starts to poll the next host. See the following figure for an example.

[Figure]

(b) If the parallel check mode is disabled

To set an appropriate polling interval for all agent hosts, you need to carefully determine the polling interval and the polling timeout time for each agent host. The following describes how to check the polling interval and the polling timeout time for each agent host.

■ Polling interval for each agent host

The health check agent automatically calculates the interval of polling for each agent host based on the number of agent hosts connected to PFM - Manager. If multiple instances of PFM - Agent or PFM - RM are running on the same host, the health check agent combines all the intervals of polling as one interval of polling for the host.

The following formula describes how to calculate the polling interval for each host. Use the formula if you want to check the polling interval for each agent host in the system you are using:

Polling interval for each host (in seconds) =

↓ (0.7 × polling interval value#1) ÷ total number of hosts#2

#1

This value is displayed for the Polling Interval property under the Health Check Configurations folder of the health check agent.

#2

Number of hosts running PFM - Agent or PFM - RM connected to PFM - Manager. The number of hosts is 1 if multiple instances of PFM - Agent or PFM - RM are installed on the same host or multiple instance environments exist.

The minimum polling interval for each host is 2 seconds. If the result of the above formula is less than 2 seconds, polling is performed every 2 seconds.

Note

The value set for the monitoring level of the health check function (value specified for the Monitoring Level property under the Health Check Configurations folder of the health check agent) does not affect polling intervals. For this reason, you do not need to take the value set for the monitoring level into consideration when you check the polling interval for each agent host.

■ Polling timeout time

When the health check function performs polling, the function communicates with the Status Server service on the host the function intends to check for its activation status. The health check function monitors the status of an agent based on the response from the Status Server service. If the health check function receives no response from the connected Status Server service within the specified timeout time, a timeout error occurs. The timeout time is determined based on the polling interval. When you check the timeout time in the system that is currently running, make sure that the timeout time is either of the following:

  • If the polling interval is 10 seconds or longer, the timeout time must be 10 seconds.

  • If the polling interval is shorter than 10 seconds, the timeout time must be the same as the value of the polling interval (in seconds).

The minimum timeout time is 2 seconds.

■ Criteria for determining whether the polling interval for all agent hosts is adequate

To determine whether the specified polling interval for all agent hosts is adequate, check the following points with the previously specified items in mind: the polling interval for each host described in Polling interval for each agent host and the polling timeout time described in Polling timeout time.

  • Whether the polling interval for each agent host is 10 seconds or longer

  • Whether a polling timeout time suitable for the operating environment is set

Generally, as the number of agent hosts increases, the polling interval for each host and the length of time before occurrence of a timeout error become shorter. When a short timeout time is set, the health check function is more likely to determine that PFM - Agent or PFM - RM has stopped. When the specified polling interval for all agent hosts is too short, polling for all agent hosts does not complete within the specified interval. To prevent this problem, specify a polling interval for all agent hosts that allows at least 10 seconds as the polling interval for each agent host.

The following example describes how to check the polling interval and the polling timeout time for each host and how to determine whether the settings are adequate.

Prerequisites
  • Number of hosts running PFM - Agent or PFM - RM: 50

  • Value for the Polling Interval property: 300

Calculating the polling interval for each host

Polling interval for each host

= ↓ 0.7 × 300 ÷ 50 ↓

= 4.2 ↓

= 4 (seconds)

Polling timeout time

The polling timeout time is 4 seconds because the polling interval for each host is 4 seconds.

Determining whether the settings are adequate
  • The polling interval for each host is 4 seconds: The interval is less than 10 seconds, which is too short.

  • The polling timeout time is 4 seconds: Depends on the operating environment.

As a result, you can see that the polling interval for all agent hosts needs to be changed to a more adequate value.

■ Procedure for calculating an adequate polling interval for all agent hosts

Perform the following procedure to calculate an adequate polling interval for all agent hosts. When the interval is determined, set the value for the Collection Interval property for the Health Check Detail (PD_HC) record of the health check agent, which is correlated with the value of the Polling Interval property.

To calculate an adequate polling interval for all agent hosts:

  1. Calculate the value of the Polling Interval property from the polling interval for each agent host.

  2. A polling interval for all agent hosts must be a multiple of 60 (seconds). Round to the nearest multiple of 60 (seconds) the determined value of the Polling Interval property.

The following example describes how to estimate an adequate polling interval for all agent hosts by using the formula described in Polling interval for each agent host.

Prerequisites
  • Number of hosts running PFM - Agent or PFM - RM: 50

  • Polling interval for each host: 10 seconds

Calculating the polling interval for all agent hosts

10 = (0.7 × polling interval) ÷ 50

Polling interval

= 50 × 10 ÷ 0.7

= 714.2

[Figure] 720 (round to the nearest multiple of 60)

As a result, 720 is the value to be specified for the Collection Interval property for the Health Check Detail (PD_HC) record as the polling interval for all agent hosts.

(2) Behavior of the health check function if polling is not completed for all agent hosts within the specified interval

If the specified polling interval for all agent hosts is too short, polling for all agent hosts might not be completed within the specified interval. If that occurs, polling continues until it is completed for all agents. The next polling, which is scheduled to be performed during the extended previous polling, will be skipped.

The figure below describes this case. In the example, the specified polling interval for all agent hosts is 300 seconds, but polling actually takes 390 seconds.

Figure 16‒6: Behavior of the health check function if polling is not completed for all agent hosts within the specified interval

[Figure]

Because the polling that starts at scheduled polling start time (1) does not finish when scheduled polling start time (2) is reached, the scheduled second polling is skipped. The next polling starts at scheduled polling start time (3) after the first polling is completed.

(3) Storing health check results and evaluating alarms

Like the usual functions of PFM - Agent and PFM - RM, you can store the history data of health check results and evaluate alarms for the health check function. The following are notes on storing history data and evaluating alarms for the health check agent.

(a) Occasions for storing data and evaluating alarms

To store history data of health check results, enable the record collection setting of the health check agent. You can evaluate alarms by defining alarms for the health check agent. Storage of history data and alarm evaluation are performed after polling for all agent hosts (which are the target of health checking) is completed, which is when the next polling starts. This means that storage of history data and alarm evaluation are not completed at the time when polling is completed.

The figure below describes this case. In the example, the Host Availability (PI_HAVL) record is collected. The polling interval for all agent hosts is 300 seconds, but polling actually takes 180 seconds.

Figure 16‒7: Occasions for storing data and evaluating alarms

[Figure]

The first polling is completed at 10:03:00 and the second polling is completed at 10:08:00. At 10:03:00 and 10:08:00, the next polling is not started yet. For this reason, history data is not stored and alarms are not evaluated at these points. History data is stored and alarms are evaluated at 10:05:00 and 10:10:00 when the next polling starts.

(b) Occasions for storing data and evaluating alarms if polling is not completed for all agent hosts within the specified interval

If polling does not complete within the polling interval for all agent hosts, history data of records is not stored and alarms are not evaluated. History data of records is stored and alarms are evaluated after polling for all agent hosts is completed when the next polling starts.

The figure below describes this case. In the example, the Host Availability (PI_HAVL) record is collected. The polling interval for all agent hosts is 300 seconds, but polling actually takes 390 seconds.

Figure 16‒8: Occasions for storing data and evaluating alarms if polling is not completed for all agent hosts within the specified interval

[Figure]

At 10:05:00, which is the start time for the second polling, the first polling is not completed. As a result, history data is not stored and alarms are not evaluated. History data is stored and alarms are evaluated at 10:10:00 when the second polling starts after the first polling is completed.