3.1.4 Using threshold value monitoring for detection of threshold overages
Threshold value monitoring monitors each monitoring item. For details about the monitoring items, see 3.1.1 ITSLM's monitoring methods and types of monitored targets.
This subsection explains threshold value monitoring.
- Organization of this subsection
(1) About threshold value monitoring
Threshold value monitoring detects an overage of the threshold set for the performance of a monitored service.
If an SLO has been defined, you can detect an overage of the SLO value by specifying the SLO value as the threshold. If no SLOs have been defined, you can detect an overage of some criterion assumed for service performance by specifying for the threshold a value representing the criterion.
The following shows an example in which an overage of a threshold is detected by threshold value monitoring.
This example monitors average response time. As time passed, the service performance value increased until an overage of the threshold was detected.
When an overage of a threshold is detected, an error is displayed in the window.
The following shows an example in which an error is displayed in the window.
The information displayed in the window includes an error icon, the detection date and time, the name of the service group subject to the error, and the service name. If service performance keeps exceeding the threshold, an error is displayed only the first time overage of the threshold is detected. You can view the service performance leading up to and following the displayed error in a graph.
The following shows an example of a graph that is displayed.
In the graph, an error icon indicates the time the threshold was exceeded and a colored bar indicates the time period during which the event resulting in the overage of the threshold is assumed to have occurred.
To run threshold value monitoring, you must specify the following item in the Settings window:
- Threshold
-
Specifies the reference threshold that is to be used to determine the status of the monitored service.
- When linking with Performance Management
-
If you link ITSLM with Performance Management, you can also run threshold value monitoring for system performance. In threshold value monitoring for system performance, there are two types of monitoring items:
-
Monitoring item to be reported when it exceeds the threshold
-
Monitoring item to be reported when it drops below the threshold
You can determine which type applies to a monitoring item by checking the Monitor settings area in the Settings window. If the icon in the Threshold column is , the monitoring item is reported when it exceeds the threshold. If the icon in the Threshold column is , the monitoring item is reported when it drops below the threshold.
-
(2) Detection criteria
In threshold value monitoring, an overage of the threshold is detected if the overage persists, so as to avoid detecting a transient overage of the threshold. This subsection explains the criteria for detecting an overage of the threshold when service performance is monitored and when system performance is monitored.
- When service performance is monitored
-
The detection criteria depend on the service performance measurement count per 60 seconds and the sloThresholdRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties). The sloThresholdRate property value is applied to threshold value monitoring for all monitored services.
For details about editing the system definition file, see 5.6.1 Editing the system definition files.
The following table describes the relationship between the sloThresholdRate property value and the behavior of threshold value monitoring
Table 3‒5: Relationship between sloThresholdRate property value and the behavior of threshold value monitoring No.
sloThresholdRate property value (n)
Behavior of threshold value monitoring
1
1
An overage of the threshold is detected when service performance exceeds the threshold even once.
When service performance no longer exceeds the threshold, it is determined to have returned to normal.
2
2 to 98
An overage of the threshold is detected when service performance exceeds the threshold S n 100 times (rounded up) in 60 seconds.
When the number of times service performance exceeds the threshold is less than S n 100 times (rounded up) in 60 seconds, service performance is determined to have returned to normal.
3
99 to 100
An overage of the threshold is detected when service performance continues to exceed the threshold for 60 seconds.
When service performance falls below the threshold even once, it is determined to have returned to normal.
- When system performance is monitored
-
The detection criterion is the number of the most recent measurements that exceed a specified value that generates an event. The number of times exceeded and the number of times measured are specified in Occurrence frequency under SLO monitor settings in the Monitor settings area of the Settings window. The following table describes the correspondence between the settings and the criterion for detecting an overage beyond the threshold.
Table 3‒6: Criterion for detecting an overage of the threshold No.
Occurrence frequency settings
Criterion for detecting an excess beyond the threshold
1
1 is specified for both times exceeded and times measured
An overage of the threshold is detected when performance data for the current time exceeds the threshold.
2
A value other than 1 is specified for either times exceeded or times measured or both
An overage of the threshold is detected when the following conditions are both satisfied:
-
Performance data for the current time exceeds the threshold.
-
The specified number of measurements exceed the threshold more times than specified. If the measurement acquisition count is less than the specified measurement count, the performance data has already exceeded the threshold more times than specified.
The following notes apply to evaluating detection of overages of the threshold:
-
Once a notification is sent, no more notifications are sent until the status returns to normal even if the conditions are satisfied again.
-
When monitoring is stopped, the measurement acquisition count and the number of times an overage of the threshold occurred are initialized to 0. When monitoring is restarted, no previous measurement values obtained before monitoring was stopped are used for new detection.
-
If no measurement value was obtained at a given time due to an error, that time is ignored and as many available most recent measurement values as needed for the specified detection are used.
-
(3) Criteria for determining that performance has returned to normal
This subsection explains for service performance and for system performance the criteria for determining that performance has returned to normal since it exceeded the threshold.
- When service performance is monitored
-
Performance is determined to have returned to normal when service performance did not exceed the threshold more than S n 100 times (rounded up) for the past 60 seconds.
S indicates the number of times service performance was measured in 60 seconds. n indicates the sloThresholdRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties).
For details about editing the system definition file, see 5.6.1 Editing the system definition files.
For example, if S is 60 and n is 10, performance is determined to have returned to normal when the number of times service performance exceeded the threshold is less than 6. Recovery of performance is not detected from transient values that are smaller than the threshold.
- When system performance is monitored
-
The criterion for determining that performance has returned to normal after exceeding the threshold depends on the value specified in Occurrence frequency for the monitored service under SLO monitor settings in the Monitor settings area of the Settings window.
For details about setting Occurrence frequency under SLO monitor settings, see 3.2.8 Setting up the monitoring items for system performance (working with Performance Management).
The following table explains the correspondence between the specifiable values and the criterion for determining that performance has recovered.
Table 3‒7: Criterion for determining that performance has returned to normal after it exceeded the threshold No.
Occurrence frequency values
Criterion for determining recovery
1
1 is specified for both M and N
Performance is determined to have returned to normal when the performance data for the current time does not exceed the threshold.
2
A value other than 1 is specified for either M or N or for both
Performance is determined to have returned to normal when the number of times the most recent M measurement values exceeded the threshold is less than N.
When the measurement acquisition count is less than M, performance is determined to have returned to normal if the number of times all the measurement values obtained so far exceeded the threshold is less than N.
Recovery is determined when the values of the reported monitoring items are updated. Therefore, determination of recovery takes place in the interval during which information about the corresponding monitoring item is acquired.
You can check the recovery status in the Home or Real-time Monitor window. For details about how to check the Home window, see 4.3.1 Checking the status of the monitored services of all service groups. For details about how to check the Real-time Monitor window, see 4.3.2 Checking the status of the monitored services in a specific service group.
(4) Supplementary information
-
Threshold value monitoring for service performance begins 60 seconds after ITSLM - UR collects the first service performance data after monitoring starts.
-
Threshold value monitoring for system performance begins immediately after PFM - Agent or PFM - RM collects the first system performance data after monitoring starts.