Hitachi

JP1 Version 11 Performance Management: Getting Started (Service Level Management)


4.5 Detection of an unusual status of a monitored service by out-of-range value detection

When the performance of a monitored service is particularly different from usual, out-of-range value detection perceives this as a sign of an abnormality in service performance. Out-of-range value detection is performed for each monitoring item. You can also combine multiple monitoring items and monitor them as a set.

The method obtains an average value from accumulated past service performance data and detects any value that differs significantly from this average as constituting an out-of-range value. The average value obtained from accumulated past service performance data is called the baseline.

In out-of-range value detection, some upper margin from the baseline and some lower margin from the baseline are used as upper-limit and lower-limit values. This detection method checks whether the current service performance is veering significantly away from the baseline (that is, differs significantly from the usual service performance) and determines the current service performance to constitute an out-of-range value when it falls beyond the upper-limit or lower-limit value. The baseline and the upper-limit and lower-limit values are updated every 60 seconds.

Outlier detection is based on statistics using standard deviation. For the baseline, the average of the service performance data collected in the past is used. The upper and lower limits are calculated based on that average and standard deviation.

The following figure shows an example in which unusual service performance is detected by out-of-range value detection.

Figure 4‒9: Example in which unusual service performance is detected by out-of-range value detection

[Figure]

This example monitors the average response time. The service performance value increased as time went by and an out-of-range value was detected when it exceeded the upper-limit value.

The upper-limit and lower-limit values are determined by setting a sensitivity that determines a distance from the baseline, beyond which point the performance of a monitored service is to be detected as an out-of-range value. The sensitivity setting determines the sensitivity of detection.

In out-of-range value detection, you can combine multiple monitoring items together as a set.

By combining multiple monitoring items, you can improve the precision of predictive error detection in service performance by taking into account a correlation among monitoring items. The two monitoring items that can be combined are average response time and throughput.

When these two monitoring items are correlated, one of them might seem abnormal, but it might not appear to be abnormal when the correlation is taken into account. For example, if the average response time is increasing but this is the result of an increase in throughput due to an increase in the number of users using the monitored service, this increase in average response time might be treated as a normal change in service performance due to the increased system load. In out-of-range value detection using a combination of multiple monitoring items, you can improve detection precision by treating a change in service performance caused by such a correlation as normal and not detecting it.

The following provides an example in which unusual service performance is detected by out-of-range value detection with a combination of multiple monitoring items.

Figure 4‒10: Example in which unusual service performance is detected by out-of-range value detection with a combination of multiple monitoring items

[Figure]

In A in the figure, an unusual increase either in average response time or in throughput in the same period would be detected as a warning sign of a service performance error. However, in B in the figure, the increases in both response time and throughput in the same period are treated as being normal due to their correlation and they are not detected.

In out-of-range value detection with a combination of multiple monitoring items, the correlation of the two service performance items is taken into account in determining the baseline. When service performance falls beyond the upper-limit or lower-limit value that has been determined based on this baseline, the correlation is treated as not being the cause and a warning sign is detected.

In out-of-range value detection with a combination of multiple monitoring items, the baseline and the upper-limit and lower-limit values are updated every hour.

A detected out-of-range value is displayed in the window as a warning.

The following shows an example of a window in which a warning is displayed.

Figure 4‒11: Example of window with a warning displayed (out-of-range value detection)

[Figure]

The information displayed in the window for a warning includes a warning icon, the detection date and time, the name of the service group detected for the warning, and the service name. If service performance continues to exceed the upper-limit value or continues to be lower than the lower-limit value, only the first warning detected is displayed. You can view the service performance leading up to and following the point of the warning as a graph.

The following shows an example of a graph.

Figure 4‒12: Example of a graph (out-of-range value detection)

[Figure]

In the graph, a warning icon indicates the time the service performance exceeded the upper-limit value or dropped below the lower-limit value and a colored belt indicates the time period during which the event resulting in the out-of-range value is suspected to have occurred.

To perform out-of-range value detection, you must specify the following items in the Settings window:

Days till start

Specifies the number of days for which service performance data is to be accumulated before out-of-range value detection is to be started. Out-of-range value detection requires that service performance data be accumulated from the monitored service running in the actual operating environment before a baseline can be calculated. If service performance data is accumulated for at least one day, out-of-range value detection can be performed. However, if the number of days specified for accumulation of service performance data is less than the number of days to be used in the baseline calculation, the obtained baseline might be unrealistic because there is not enough data to calculate it. For Days till start, we recommend that you specify a value that is at least equal to the number of days to be used in the baseline calculation.

Days in baseline calculation

Specifies the number of days' worth of accumulated past service performance data that are to be used for calculation of the baseline.

Sensitivity

Specifies a sensitivity setting for out-of-range value detection that is to be used to determine the distance from the baseline to the upper-limit and lower-limit values. You can select High, Middle, or Low for the sensitivity setting. High sensitivity reduces the distance from the baseline to the upper-limit and lower-limit values, making service performance anomalies more likely to be detected. Low sensitivity increases the distance from the baseline to the upper-limit and lower-limit values, making service performance anomalies less likely to be detected. For High, the distance is half of the distance for Middle; for Low, the distance is 1.5 times the distance for Middle.

The following shows examples in which the distance from the baseline to the upper-limit and lower-limit values is narrowed or widened depending on the sensitivity.

Figure 4‒13: Examples in which the distance from the baseline to the upper-limit and lower-limit values is narrowed or widened

[Figure]

This example monitors the average response time. The service performance is the same in both graphs. However, when the distance from the baseline to the upper-limit and lower-limit values is narrow, as in the graph on the left, more out-of-range values in service performance are detected than when the distance from the baseline to the upper-limit and lower-limit values is wide, as in the graph on the right.

For details about monitor items, see 4.2 Monitoring methods and monitor items of SLM.

For details about out-of-range value detection, see the description about out-of-range value detection in the manual JP1/Service Level Management.