Hitachi

Job Management Partner 1 Version 10 Job Management Partner 1/IT Service Level Management Description, User's Guide, Reference and Operator's Guide


3.1.4 Using threshold value monitoring for detection of threshold overages

Threshold value monitoring monitors each monitoring item. For details about the monitoring items, see 3.1.1 ITSLM's monitoring methods and types of monitored targets.

This subsection explains threshold value monitoring.

Organization of this subsection

(1) About threshold value monitoring

Threshold value monitoring detects an overage of the threshold set for the performance of a monitored service.

If an SLO has been defined, you can detect an overage of the SLO value by specifying the SLO value as the threshold. If no SLOs have been defined, you can detect an overage of some criterion assumed for service performance by specifying for the threshold a value representing the criterion.

The following shows an example in which an overage of a threshold is detected by threshold value monitoring.

Figure 3‒18: Example in which an overage of a threshold is detected

[Figure]

This example monitors average response time. As time passed, the service performance value increased until an overage of the threshold was detected.

When an overage of a threshold is detected, an error is displayed in the window.

The following shows an example in which an error is displayed in the window.

Figure 3‒19: Example in which an error is displayed in the window (threshold value monitoring)

[Figure]

The information displayed in the window includes an error icon, the detection date and time, the name of the service group subject to the error, and the service name. If service performance keeps exceeding the threshold, an error is displayed only the first time overage of the threshold is detected. You can view the service performance leading up to and following the displayed error in a graph.

The following shows an example of a graph that is displayed.

Figure 3‒20: Example of a graph that is displayed (threshold value monitoring)

[Figure]

In the graph, an error icon indicates the time the threshold was exceeded and a colored bar indicates the time period during which the event resulting in the overage of the threshold is assumed to have occurred.

To run threshold value monitoring, you must specify the following item in the Settings window:

Threshold

Specifies the reference threshold that is to be used to determine the status of the monitored service.

When linking with Performance Management

If you link ITSLM with Performance Management, you can also run threshold value monitoring for system performance. In threshold value monitoring for system performance, there are two types of monitoring items:

  • Monitoring item to be reported when it exceeds the threshold

  • Monitoring item to be reported when it drops below the threshold

You can determine which type applies to a monitoring item by checking the Monitor settings area in the Settings window. If the icon in the Threshold column is [Figure], the monitoring item is reported when it exceeds the threshold. If the icon in the Threshold column is [Figure], the monitoring item is reported when it drops below the threshold.

(2) Detection criteria

In threshold value monitoring, an overage of the threshold is detected if the overage persists, so as to avoid detecting a transient overage of the threshold. This subsection explains the criteria for detecting an overage of the threshold when service performance is monitored and when system performance is monitored.

When service performance is monitored

The detection criteria depend on the service performance measurement count per 60 seconds and the sloThresholdRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties). The sloThresholdRate property value is applied to threshold value monitoring for all monitored services.

For details about editing the system definition file, see 5.6.1 Editing the system definition files.

The following table describes the relationship between the sloThresholdRate property value and the behavior of threshold value monitoring

Table 3‒5: Relationship between sloThresholdRate property value and the behavior of threshold value monitoring

No.

sloThresholdRate property value (n)

Behavior of threshold value monitoring

1

1

An overage of the threshold is detected when service performance exceeds the threshold even once.

When service performance no longer exceeds the threshold, it is determined to have returned to normal.

2

2 to 98

An overage of the threshold is detected when service performance exceeds the threshold S [Figure] n [Figure] 100 times (rounded up) in 60 seconds.

When the number of times service performance exceeds the threshold is less than S [Figure] n [Figure] 100 times (rounded up) in 60 seconds, service performance is determined to have returned to normal.

3

99 to 100

An overage of the threshold is detected when service performance continues to exceed the threshold for 60 seconds.

When service performance falls below the threshold even once, it is determined to have returned to normal.

Legend:

S: Number of times service performance is measured in 60 seconds

n: sloThresholdRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties)

When system performance is monitored

The detection criterion is the number of the most recent measurements that exceed a specified value that generates an event. The number of times exceeded and the number of times measured are specified in Occurrence frequency under SLO monitor settings in the Monitor settings area of the Settings window. The following table describes the correspondence between the settings and the criterion for detecting an overage beyond the threshold.

Table 3‒6: Criterion for detecting an overage of the threshold

No.

Occurrence frequency settings

Criterion for detecting an excess beyond the threshold

1

1 is specified for both times exceeded and times measured

An overage of the threshold is detected when performance data for the current time exceeds the threshold.

2

A value other than 1 is specified for either times exceeded or times measured or both

An overage of the threshold is detected when the following conditions are both satisfied:

  • Performance data for the current time exceeds the threshold.

  • The specified number of measurements exceed the threshold more times than specified. If the measurement acquisition count is less than the specified measurement count, the performance data has already exceeded the threshold more times than specified.

The following notes apply to evaluating detection of overages of the threshold:

  • Once a notification is sent, no more notifications are sent until the status returns to normal even if the conditions are satisfied again.

  • When monitoring is stopped, the measurement acquisition count and the number of times an overage of the threshold occurred are initialized to 0. When monitoring is restarted, no previous measurement values obtained before monitoring was stopped are used for new detection.

  • If no measurement value was obtained at a given time due to an error, that time is ignored and as many available most recent measurement values as needed for the specified detection are used.

(3) Criteria for determining that performance has returned to normal

This subsection explains for service performance and for system performance the criteria for determining that performance has returned to normal since it exceeded the threshold.

When service performance is monitored

Performance is determined to have returned to normal when service performance did not exceed the threshold more than S [Figure] n [Figure] 100 times (rounded up) for the past 60 seconds.

S indicates the number of times service performance was measured in 60 seconds. n indicates the sloThresholdRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties).

For details about editing the system definition file, see 5.6.1 Editing the system definition files.

For example, if S is 60 and n is 10, performance is determined to have returned to normal when the number of times service performance exceeded the threshold is less than 6. Recovery of performance is not detected from transient values that are smaller than the threshold.

When system performance is monitored

The criterion for determining that performance has returned to normal after exceeding the threshold depends on the value specified in Occurrence frequency for the monitored service under SLO monitor settings in the Monitor settings area of the Settings window.

For details about setting Occurrence frequency under SLO monitor settings, see 3.2.8 Setting up the monitoring items for system performance (working with Performance Management).

The following table explains the correspondence between the specifiable values and the criterion for determining that performance has recovered.

Table 3‒7: Criterion for determining that performance has returned to normal after it exceeded the threshold

No.

Occurrence frequency values

Criterion for determining recovery

1

1 is specified for both M and N

Performance is determined to have returned to normal when the performance data for the current time does not exceed the threshold.

2

A value other than 1 is specified for either M or N or for both

Performance is determined to have returned to normal when the number of times the most recent M measurement values exceeded the threshold is less than N.

When the measurement acquisition count is less than M, performance is determined to have returned to normal if the number of times all the measurement values obtained so far exceeded the threshold is less than N.

Legend:

M: Number of measurements taken as specified for Occurrence frequency (measured) under SLO monitor settings

N: Number of times a measured value is allowed to exceed the limit as specified for Occurrence frequency (Times exceeded) under SLO monitor settings

Recovery is determined when the values of the reported monitoring items are updated. Therefore, determination of recovery takes place in the interval during which information about the corresponding monitoring item is acquired.

You can check the recovery status in the Home or Real-time Monitor window. For details about how to check the Home window, see 4.3.1 Checking the status of the monitored services of all service groups. For details about how to check the Real-time Monitor window, see 4.3.2 Checking the status of the monitored services in a specific service group.

(4) Supplementary information

(5) Related topics