Hitachi

Job Management Partner 1 Version 10 Job Management Partner 1/IT Service Level Management Description, User's Guide, Reference and Operator's Guide


3.1.2 Using out-of-range value detection for detection of unusual status in monitored services

Out-of-range value detection is performed for each monitoring item. You can also combine multiple monitoring items and monitor them as a set. For details about the monitoring items, see 3.1.1 ITSLM's monitoring methods and types of monitored targets.

This subsection provides an overview of out-of-range value detection and how to obtain the base line, upper-limit value, and lower-limit value.

Organization of this subsection

(1) About out-of-range value detection

If the performance of a monitored service becomes noticeably poor, this monitoring method regards this change as an early warning sign of a potential service performance error. The method obtains an average value from accumulated past service performance data and detects any value that differs significantly from this average as constituting an out-of-range value. The average value obtained from accumulated past service performance data is called the baseline.

In out-of-range value detection, some upper margin from the baseline and some lower margin from the baseline are used as upper-limit and lower-limit values. This detection method checks whether the current service performance is veering significantly away from the baseline (that is, differs significantly from the usual service performance) and determines the current service performance to constitute an out-of-range value when it falls beyond the upper-limit or lower-limit value. The baseline and the upper-limit and lower-limit values are updated every 60 seconds.

Outlier detection is based on statistics using standard deviation. For the baseline, the average of the service performance data collected in the past is used. The upper and lower limits are calculated based on that average and standard deviation.

The following figure shows an example in which unusual service performance is detected by out-of-range value detection.

Figure 3‒5: Example in which unusual service performance is detected by out-of-range value detection

[Figure]

This example monitors the average response time. The service performance value increased as time went by and an out-of-range value was detected when it exceeded the upper-limit value.

The upper-limit and lower-limit values are determined by setting a sensitivity that determines a distance from the baseline, beyond which point the performance of a monitored service is to be detected as an out-of-range value. The sensitivity setting determines the sensitivity of detection.

In out-of-range value detection, you can combine multiple monitoring items together as a set.

By combining multiple monitoring items, you can improve the precision of predictive error detection in service performance by taking into account a correlation among monitoring items. The two monitoring items that can be combined are average response time and throughput.

When these two monitoring items are correlated, one of them might seem abnormal, but it might not appear to be abnormal when the correlation is taken into account. For example, if the average response time is increasing but this is the result of an increase in throughput due to an increase in the number of users using the monitored service, this increase in average response time might be treated as a normal change in service performance due to the increased system load. In out-of-range value detection using a combination of multiple monitoring items, you can improve detection precision by treating a change in service performance caused by such a correlation as normal and not detecting it.

The following provides an example in which unusual service performance is detected by out-of-range value detection with a combination of multiple monitoring items.

Figure 3‒6: Example in which unusual service performance is detected by out-of-range value detection with a combination of multiple monitoring items

[Figure]

In A in the figure, an unusual increase either in average response time or in throughput in the same period would be detected as a warning sign of a service performance error. However, in B in the figure, the increases in both response time and throughput in the same period are treated as being normal due to their correlation and they are not detected.

In out-of-range value detection with a combination of multiple monitoring items, the correlation of the two service performance items is taken into account in determining the baseline. When service performance falls beyond the upper-limit or lower-limit value that has been determined based on this baseline, the correlation is treated as not being the cause and a warning sign is detected.

In out-of-range value detection with a combination of multiple monitoring items, the baseline and the upper-limit and lower-limit values are updated every hour.

A detected out-of-range value is displayed in the window as a warning.

The following shows an example of a window in which a warning is displayed.

Figure 3‒7: Example of window with a warning displayed (out-of-range value detection)

[Figure]

The information displayed in the window for a warning includes a warning icon, the detection date and time, the name of the service group detected for the warning, and the service name. If service performance continues to exceed the upper-limit value or continues to be lower than the lower-limit value, only the first warning detected is displayed. You can view the service performance leading up to and following the point of the warning as a graph.

The following shows an example of a graph.

Figure 3‒8: Example of a graph (out-of-range value detection)

[Figure]

In the graph, a warning icon indicates the time the service performance exceeded the upper-limit value or dropped below the lower-limit value and a colored belt indicates the time period during which the event resulting in the out-of-range value is suspected to have occurred.

To perform out-of-range value detection, you must specify the following items in the Settings window:

Days till start

Specifies the number of days for which service performance data is to be accumulated before out-of-range value detection is to be started. Out-of-range value detection requires that service performance data be accumulated from the monitored service running in the actual operating environment before a baseline can be calculated. If service performance data is accumulated for at least one day, out-of-range value detection can be performed. However, if the number of days specified for accumulation of service performance data is less than the number of days to be used in the baseline calculation, the obtained baseline might be unrealistic because there is not enough data to calculate it. For Days till start, we recommend that you specify a value that is at least equal to the number of days to be used in the baseline calculation.

Days in baseline calculation

Specifies the number of days' worth of accumulated past service performance data that are to be used for calculation of the baseline.

Sensitivity

Specifies a sensitivity setting for out-of-range value detection that is to be used to determine the distance from the baseline to the upper-limit and lower-limit values. You can select High, Middle, or Low for the sensitivity setting. High sensitivity reduces the distance from the baseline to the upper-limit and lower-limit values, making service performance anomalies more likely to be detected. Low sensitivity increases the distance from the baseline to the upper-limit and lower-limit values, making service performance anomalies less likely to be detected. For High, the distance is half of the distance for Middle; for Low, the distance is 1.5 times the distance for Middle.

The following shows examples in which the distance from the baseline to the upper-limit and lower-limit values is narrowed or widened depending on the sensitivity.

Figure 3‒9: Examples in which the distance from the baseline to the upper-limit and lower-limit values is narrowed or widened

[Figure]

This example monitors the average response time. The service performance is the same in both graphs. However, when the distance from the baseline to the upper-limit and lower-limit values is narrow, as in the graph on the left, more out-of-range values in service performance are detected than when the distance from the baseline to the upper-limit and lower-limit values is wide, as in the graph on the right.

Setting the upper-limit and lower-limit values for out-of-range value detection

Use the serviceBaselineExclusion property in ITSLM - Manager's system definition file (jp1itslm.properties) to set upper-limit and lower-limit values for out-of-range value detection.

When this property is set to true: ITSLM detects only values that exceed the upper-limit value from the baseline.

Figure 3‒10: Example of detecting only values that exceed the upper-limit value

[Figure]

When this property is set to false: ITSLM detects any value exceeding the upper-limit value or dropping below the lower-limit value from the baseline.

Figure 3‒11: Example of detecting any value exceeding the upper-limit value or dropping below the lower-limit value

[Figure]

For details about how to edit system definition files, see 5.6.1 Editing the system definition files.

When linking with Performance Management

If you link ITSLM with Performance Management, you can also perform out-of-range value detection in system performance. However, when system performance is monitored, out-of-range value detection using a combination of multiple monitoring items is not supported.

Use the systemBaselineExclusion property in ITSLM - Manager's system definition file (jp1itslm.properties) to set upper-limit and lower-limit values for out-of-range value detection for system performance. For details about how to edit system definition files, see 5.6.1 Editing the system definition files.

(2) How to obtain the baseline and upper-limit and lower-limit values

The baseline used as the criterion for determining out-of-range values is obtained as follows:

For a service monitoring configuration
  1. The average throughput service performance (average processing count) for the monitored service over the past one hour is determined.

  2. From the accumulated service performance data (over a maximum of 60 days), the service performances whose averaged processing counts for the same period are the closest to the past hour's average processing count are selected.

  3. From the service performances for the selected dates, an average value (baseline) up to one hour ahead from the current time is calculated every minute for each monitoring item.

For a system monitoring configuration
  1. The average of the system values measured during the past hour for the selected monitored target is calculated.

  2. From the accumulated system performance data (over a maximum of 60 days), the dates whose average system performance for the same period is the closest to the current average value are selected.

  3. For each monitoring item, based on the system performance data for the selected dates, the average of the values from the present time to an hour later (the baseline) is calculated every minute.

For example, if service performance for the same time period differs greatly depending on the day of the week, such as a monitored service whose processing counts for regular business days differ considerably from the processing counts for weekends and holidays, a realistic baseline can be obtained by calculating it based on past service performance that takes into account regular business days and weekends and holidays.

The following example selects the past service performance for baseline calculation from the service performance data accumulated for the past 60 days and using the days whose average processing counts are closest to the past hour's average processing count.

Figure 3‒12: Example of selecting the past service performance for baseline calculation from the service performance data accumulated for the past 60 days and using the days whose average processing counts are closest to the past hour's average processing count

[Figure]

The service performances for the days whose average processing counts are the closest to the average processing count for the past hour are selected from the accumulated past service performance data. In this example, the past hour's average processing count is 300. Therefore, the service performance from yesterday and from 60 days ago, which have the closest average processing counts, are selected from all the service performance over the past 60 days. The service performance from two days ago is not selected because its average processing count is quite different from the past hour's average processing count. As many service performance values are selected from the past service performance data as the number of days specified for Days in baseline calculation in the Settings window.

The following rules apply to selection of past service performance:

According to these rules, past service performance over as many days as specified for Days in baseline calculation under Error Predict. settings is used (or past service performance over the number of days for which data has been accumulated is used). This selection of past service performance occurs once an hour on the hour.

The baseline is calculated after selection of the days to be used, in order of priority, from the oldest data that is entered.

You can start out-of-range value detection if one day's worth of service performance data has been accumulated. However, the baseline might not be realistic until as many days' worth of service performance data as needed for baseline calculation has accumulated. To obtain a realistic baseline, we recommend that you do not start out-of-range value detection until enough service performance data for the baseline calculation has accumulated.

For Days till start in the Settings window, specify the number of days service performance data is to be accumulated before out-of-range value detection is started.

The following figure shows the relationship between the number of days used for baseline calculation (Days in baseline calculation) and the number of days before out-of-range value detection is started (Days till start).

Figure 3‒13: Relationship between the number of days used for baseline calculation (Days in baseline calculation) and the number of days before out-of-range value detection is started (Days till start)

[Figure]

This example specifies 15 days for Days in baseline calculation and 5 days for Days till start.

In this example, out-of-range value detection starts five days after ITSLM operation began. The baseline is calculated from the past five days' worth of service performance. Because this value is less than the number of days set for baseline calculation, the resulting baseline might not be realistic. The most realistic baseline is obtained on the 15th day of ITSLM operation, because as many days' worth of service performance data as needed for baseline calculation has been accumulated.

When multiple monitoring items are combined, there are some differences in the baseline calculation method.

When out-of-range value detection is performed with multiple monitoring items combined, past service performance is based on average throughput (average processing count), and then the average correlation between average response time and throughput is obtained for the selected days, and finally the baseline is calculated.

Because the baseline used for detection is different, an out-of-range value in out-of-range value detection with multiple monitoring items combined differs as indicated below from the out-of-range value in normal out-of-range value detection:

Therefore, an out-of-range value exceeding the upper-limit value in normal out-of-range value detection might be less than the lower-limit value in out-of-range value detection with multiple monitoring items combined. Conversely, an out-of-range value that is less than the lower-limit value in normal out-of-range value detection might exceed the upper-limit value in out-of-range value detection with multiple monitoring items combined.

The following figure shows an example in which out-of-range values are detected in out-of-range value detection with multiple monitoring items combined.

Figure 3‒14: Example in which out-of-range values are detected in out-of-range value detection with multiple monitoring items combined

[Figure]

This example monitors average response time and throughput. The baseline is calculated on a graph containing both monitoring items. Any value exceeding the upper-limit value or less than the lower-limit value is detected as an out-of-range value, and the monitoring items are treated as having no correlation.

The upper-limit and lower-limit values for out-of-range value detection or for out-of-range value detection with multiple monitoring items combined are determined by irregularity from the past service performance selected for the baseline calculation and the sensitivity that has been set for tuning detectability.

When linking with Performance Management

The baseline used for monitoring system performance is calculated from system performance data that is selected using the same criteria as for monitoring service performance. If a day has the highest priority in one monitoring item but no system performance was collected, that day is not selected; the day with the next highest priority is selected. Therefore, the days used for baseline calculation might depend on the monitoring items.

When you monitor system performance, you can specify separately for each monitoring item the number of days to be used for baseline calculation.

(3) Detection criteria

In out-of-range value detection, out-of-range values are detected only when they occur consecutively so as to avoid detecting transient out-of-range values.

This subsection explains the criteria for detecting out-of-range values when service performance is monitored and when system performance is monitored.

When service performance is monitored

The detection criteria depend on the service performance measurement count per 60 seconds and the outlierRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties). The outlierRate property value is applied to out-of-range value detection for all monitored services.

For details about editing the system definition file, see 5.6.1 Editing the system definition files.

The following table describes the relationship between the outlierRate property value and the behavior of out-of-range value detection.

Table 3‒2: Relationship between outlierRate property value and the behavior of out-of-range value detection

No.

outlierRate property value (n)

Behavior of out-of-range value detection

1

1

An out-of-range value is detected when service performance exceeds the upper-limit value or drops below the lower-limit value even once.

When service performance does not exceed the upper-limit value or drop below the lower-limit value over the next 60 seconds, it is determined to have returned to normal.

2

2 to 98

An out-of-range value is detected when service performance exceeds the upper-limit value or drops below the lower-limit value S [Figure] n [Figure] 100 times (rounded up) in 60 seconds.

When the number of times service performance exceeds the upper-limit value or drops below the lower-limit value is less than S [Figure] n [Figure] 100 times (rounded up) for 60 seconds, the service performance is determined to have returned to normal.

3

99 to 100

An out-of-range value is detected when service performance continues to exceed the upper-limit value or be below the lower-limit value for 60 seconds.

When service performance falls within the upper-limit and lower-limit values even once, it is determined to have returned to normal.

Legend:

S: Number of times service performance is measured in 60 seconds

n: outlierRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties)

Note that consecutive out-of-range values are checked for those exceeding the upper-limit value separately from those dropping below lower-limit value. This means that an out-of-range value exceeding the upper-limit value followed consecutively by one dropping below the lower-limit value are not detected as consecutive out-of-range values.

The following service performances are not processed as out-of-range value even if they exceed an upper-limit value or drop below a lower-limit value:

  • Average response time when average response time and throughput are both 0 (throughput is detected as an out-of-range value)

  • Error rate when error rate and throughput are both 0 (throughput is detected as an out-of-range value)

This is because a throughput value of 0 indicates that there is no data. However, for throughput itself, this 0 value indicates the service performance has a processing count of 0. Therefore, the throughput is still detected as an out-of-range value if its value of 0 is above the upper-limit value or below the lower-limit value.

These average response time and error rate are still used as past service performance for baseline calculation because they can be in a normal status even though they were not processed as out-of-range values.

When system performance is monitored

The detection criterion is the number of the most recent measurements that exceed a specified value that generates an event. The number of times exceeded and the number of times measured are specified in Occurrence frequency under Error Predict. settings in the Monitor settings area of the Settings window. The following table describes the correspondence between the settings and the criterion for detecting an out-of-range value.

Table 3‒3: Criterion for detecting an out-of-range value

No.

Occurrence frequency settings

Criterion for detecting an out-of-range value

1

1 is specified for both times exceeded and times measured

An out-of-range value is detected if performance data for the current time falls beyond the lower or upper limit for predictive error detection.

2

A value other than 1 is specified for either times exceeded or times measured or both

An out-of-range value is detected if the following conditions are both satisfied:

  • Performance data for the current time falls beyond the lower or upper limit for predictive error detection.

  • The number of the most recent measurements that fell beyond the lower or upper limit for predictive error detection exceeded the specified value. If the measurement acquisition count is less than the specified measurement count, the performance data has already fallen beyond the lower or upper limit for predictive error detection more times than specified.

The following notes apply to evaluating out-of-range value detection:

  • Once a notification is sent, no more notifications are sent until the status returns to normal even if the conditions are satisfied again.

  • When monitoring is stopped, the measurement acquisition count and the number of times an excess beyond the upper-limit and lower-limit values occurred are initialized to 0. When monitoring is restarted, no previous measurement values obtained before monitoring was stopped are used for new detection.

  • If no measurement value was obtained at a given time due to an error, that time is ignored and as many available most recent measurement values as needed for the specified detection are used.

  • If no past data for baseline calculation is available at the time detection is checked, that time is treated as normal (neither the upper nor the lower limit has been exceeded).

(4) Criteria for determining that performance has returned to normal

This subsection explains for service performance and for system performance the criteria for determining that performance has returned to normal since it exceeded the upper-limit value or dropped below the lower-limit value.

When service performance is monitored

Performance is determined to have returned to normal when service performance did not exceed the upper-limit value or drop below the lower-limit value more than S [Figure] n [Figure] 100 times (rounded up) for the past 60 seconds.

S indicates the number of times service performance was measured in 60 seconds. n indicates the outlierRate property value specified in ITSLM - Manager's system definition file (jp1itslm.properties).

For details about editing the system definition file, see 5.6.1 Editing the system definition files.

For example, if S is 60 and n is 10, performance is determined to have returned to normal when the number of times service performance exceeded the upper-limit value or dropped below the lower-limit value is less than 6. The upper-limit value and lower-limit value are checked separately. When both return to normal, performance is determined to have returned to normal. Recovery of performance is not detected from transient values that approach the baseline.

When system performance is monitored

The criterion for determining that performance has returned to normal depends on the value specified in Occurrence frequency for the monitored service under Error Predict. settings in the Monitor settings area of the Settings window.

For details about setting Occurrence frequency under Error Predict. settings, see 3.2.8 Setting up the monitoring items for system performance (working with Performance Management).

The following table explains the correspondence between the specifiable values and the criterion for determining that performance has recovered.

Table 3‒4: Criterion for determining that performance has returned to normal since it exceeded the upper-limit value or dropped below the lower-limit value

No.

Occurrence frequency values under Error Predict. settings

Criterion for determining recovery

1

1 is specified for both M and N

Performance is determined to have returned to normal when the performance data for the current time is not above the upper-limit value or below the lower-limit value.

2

A value other than 1 is specified for either M or N or for both

Performance is determined to have returned to normal when the number of times the most recent M measurement values exceeded the upper-limit value for predictive error detection is less than N and the number of times they dropped below the lower-limit value is less than N.

When the measurement acquisition count is less than M, performance is determined to have returned to normal when the number of times all the measurement values obtained so far exceeded the upper-limit value for predictive error detection is less than N and the number of times they dropped below the lower-limit value is less than N.

Legend:

M: Number of measurements taken as specified for Occurrence frequency (measured) under Error Predict. settings

N: Number of times a measured value is allowed to exceed the limit as specified for Occurrence frequency (Times exceeded) under Error Predict. settings

Recovery is determined when the values of the reported monitoring items are updated. Therefore, determination of recovery takes place in the interval during which information about the corresponding monitoring items is acquired.

You can check the recovery status in the Home or Real-time Monitor window. For details about how to check the Home window, see 4.3.1 Checking the status of the monitored services of all service groups. For details about how to check the Real-time Monitor window, see 4.3.2 Checking the status of the monitored services in a specific service group.

(5) Supplementary information

(6) Related topics