3.2.5 Anomaly detection
- Organization of this subsection
(1) Overview
Anomaly detection is a technique for detecting unusual behavior or events that deviate from normal patterns. JP1/IM calculates the mean and standard deviation based on historical data. It determines that an anomaly occurs when the current value falls outside these ranges and provides a function to alert you. Use the Z score to analyze and determine anomalies.
A Z score, also known as a standard score, represents the relationship between a value and the average of a group of values. Calculated by dividing the difference between the value and the mean by the standard deviation of the groups.
You can use anomaly detection to detect outliers. It is also applicable for detecting errors in cases where there is limited operational data, making it difficult to set threshold values.
|
|
In the following cases, perform error detection by setting the threshold value.
-
When it is necessary to detect an error in a monitoring item that is easy to set a threshold value (item for which an acceptable value for operation is fixed)
-
When anomaly detection is used, and an error must be detected if the value exceeds a certain threshold from the mean even when the allowable range is not exceeded.
(2) Prerequisites
- Alert definition function prerequisites for anomaly detection
When using anomaly detection function, make sure that the following conditions are satisfied:
-
Metric for which you want to configure anomaly detection alerting must be supported by JP1/IM - Agent.
-
The performance-data retention period specified in Prometheus command-argument must be greater than the alert measurement period defined in anomaly detection. For details about how long it takes to retain performance data, see prometheus command options in Service definition file (jpc_program-name_service.xml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.
-
JP1/IM of anomaly detection calculates a Z-score from the data that can be collected at that time. Therefore, if Intelligent Integrated Management Database stores performance data that is less than or equal to the metered alert duration, you might be notified of unwanted alerts. We recommend that you define anomaly detection alert definitions with performance data that is greater than or equal to the measurement duration.
(3) Combinations of available versions
The following combinations of the available versions of anomaly detection functionality are:
-
JP1/IM - Manager 13-00 or later
-
JP1/IM - Agent 13-11 or later
- Important
-
Do not use metric for anomaly detection that is constantly constant. If you use such a metric, you will not be able to calculate correctly.
(4) Alert definition function for anomaly detection
You review metric, coverage, and thresholds (Z-scores) and define anomaly detection alerting. Anomaly detection alert definition is specified in the alert conditional expression (PromQL expression) specified in alert configuration file (jpc_alerting_rules.yml) "expr" field of JP1/IM - Agent. Conditional expressions specify the criteria for determining whether the Z-score calculated by anomaly detection exceeds the score set as a threshold value. If it exceeds, it is judged as an outlier value and an error is detected.
Here is the formula for the Z score:
-
Formula for the Z score
Use avg_over_time() to calculate data points and data averages.
Use stddev_over_time() to calculate the standard-deviation of the data.
avg_over_time() is a function that returns the mean values of the measurement range (range vector) of the specified metric.
stddev_over_time() is a function that returns the value of the standard deviation of the measurement range (range vector) of the specified metric.
Here is a sample alert-definition for anomaly detection:
- Alert definition example
-
Calculate the Z score from the mean and standard deviation based on 1 week's worth of operating data, and alert you when data exceeding the threshold is detected in the Z score (calculation result)
groups: - name: node_exporter rules: - alert: MemAvailable(Node exporter) expr: 3 < abs((avg_over_time(node_memory_MemAvailable_bytes[5m])-avg_over_time(node_memory_MemAvailable_bytes[7d]))/stddev_over_time(node_memory_MemAvailable_bytes[7d])) : (Omitted) : -
Calculate the Z-score from the mean and standard deviation of CPU unused ratio (label mode = "idle" for windows_cpu_time_total metric) for each instance based on 1 week's worth of operating data, and notify you when data exceeding the threshold value of Z-score (calculated result) is detected (using by and rate())
groups: - name: windows_exporter rules: - alert: cpu_unsed expr: 3 < abs((avg_over_time((avg by (instance)(rate(windows_cpu_time_total{mode="idle"}[2m])))[5m:])*100) - (avg_over_time((avg by (instance)(rate(windows_cpu_time_total{mode="idle"}[2m])))[7d:])*100)) / (stddev_over_time((avg by (instance)(rate(windows_cpu_time_total{mode="idle"}[2m])))[7d:])*100) : (Omitted) :
When used in conjunction with rate(), as shown above, the ":" must be appended to an end of the duration specified by avg_over_time() and stddev_over_time().
For details about the alert configuration file, see Alert configuration file (jpc_alerting_rules.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.
For details about avg_over_time function, stddev_over_time function, and rate function, see 2.7.4(7)(f) Available functions.
(a) Example of a reference value to be set in a calculation formula for Z-score
This section describes the standard values used for the calculation formulas set in anomaly detection alert-definition.
When anomaly detection is used, it is assumed that the data storage period specified in prometheus command-argument is changed to a period appropriate for the operation. The value you set for storage period must be greater than measurement period for the alerts that be defined by anomaly detection.
■ Setting the measurement period
For avg_over_time() and stddev_over_time(), here is an example of a measure duration reference value that you specify to calculate the mean or standard deviation: Change the settings as appropriate for the system to be operated.
-
Standard value (for a system that has been in stable operation for a while after its operation)
The reference value for the measurement period is 7d.
Set the data storage period of prometheus specified by the command argument to "7d" or more, and set [7d] for each measurement period of avg_over_time() and stddev_over_time().
-
For systems that are just running
The reference value for the measurement period is 3d.
Set the data storage period of prometheus specified by the command argument to a value greater than or equal to "3d", and set the value between [3d] and [2d] for each measurement period of avg_over_time() and stddev_over_time().
The reference values for the measurement periods shown above are only guidelines. By extending prometheus storage period, you can set a value of "7d" or greater.
■ Setting the Z-score (threshold)
Here is a sample threshold criterion that you set in anomaly detection alert definition.
In anomaly detection, if the Z-score (calculated result) exceeds the score set as the threshold value, it is considered as an outlier value, and an error is detected. The following is an example of a threshold reference:
Note that a threshold of 2 or less is generally considered to be a no-problem value, so set the threshold to a value greater than 2.
-
Standard value
The threshold reference value is 3.
Set the threshold value to "3" when detecting anomalies based on analysis and judgment based on the general Z-score method.
-
For more stringent detection than standard
The threshold reference is 2.5.
Subtract 0.5 from the standard value and set the threshold value to "2.5".
-
When the detection is loose than the standard
The threshold reference is 3.5.
Add 0.5 to the standard value and set the threshold value to "3.5".
The threshold values shown above are for reference only. If you set a small value, such as 2, it will detect even a value that is a little away from the average. The smaller the value, the easier it is to detect anomalies. If a large value is set, an error will be detected at a value far from the average.
■ Target metric
Metric used to define the alerts for anomaly detection must be collected in JP1/IM - Agent.
(5) Use Case
- Monitoring Network Transmission Volume
The following shows an example of monitoring when the network transmission volume is set as the monitoring target.
<Assumed configuration>
JP1/IM - Manager and JP1/IM - Agent are installed on the monitored servers and OS monitoring is in progress. In addition, a single server machine has JP1/IM - Manager installed that integrates agent.
The operating status of the system indicates a constant usage rate and is in stable operation.
<Construction>
-
Defining Alerts for anomaly detection
The following shows how to configure anomaly detection settings for network transmission in alert configuration file (jpc_alerting_rules.yml) of JP1/IM - Agent for which OS monitoring is performed.
In this example, the average value and the standard deviation are calculated using the network transmission amount of 7 days as the source data, and the abnormality is detected when the real value is out of the normal range.
groups: - name: windows_exporter rules: - alert: windows_net_packets_sent_total (Windows exporter) expr: 3 < abs((avg_over_time(rate(windows_net_packets_sent_total[2m])[5m:])-avg_over_time(rate(windows_net_packets_sent_total[2m])[7d:]))/stddev_over_time(rate(windows_net_packets_sent_total [2m])[7d:])) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Warning" jp1_pc_eventid: "0401" jp1_pc_metricname: " windows_net_packets_sent_total " annotations: jp1_pc_firing_description: "Network transmit volume is different from normal." jp1_pc_resolved_description: "Network transmit volume has returned to normal." -
Defining Dashboards for Resource Comparison
On each server in AP server group, define a dashboard to check the network volume.
<Operation>
-
Operator action
-
Anomaly detection's alert report
If the monitored network volume exceeds the threshold, the alerts defined in <Construction> will trigger a JP1 event. The operator escalates system outliers to the system administrator if they are detected in email notifications via automatic actions or in alert information on the dashboard screen.
-
System administrator actions
-
Investigating the operational information of related resources
The system administrator reviews the operational information for each resource in the dashboard and investigates the cause.
For example, on AP server 3, check that the amount of network transmission is increasing compared to the amount of transmission during normal hours. If the primary survey shows that the number is increasing due to an increase in the number of users, as a temporary measure, ask the user to restrict the use of the system and confirm whether the value decreases.
-
Checking temporary measures
After the temporary action is complete, the system administrator checks the values on the dashboard again to ensure that the rising values are stable.
-
Investigating the root cause and examining the main countermeasures
Investigate other operational information and refer to the log of the application to investigate the root cause and to investigate the main measures.
- Example of Monitoring the Amount of Free Memory
The following shows an example of monitoring when the amount of free memory is set as the monitoring target.
<Assumed configuration>
This is equivalent to <Assumed configuration> in "Monitoring Network Transmission Volume".
<Construction>
-
Defining Alerts for anomaly detection
The following shows how to configure anomaly detection settings for the amount of free memory in alert configuration file (jpc_alerting_rules.yml) of JP1/IM - Agent for which OS monitoring is performed.
In this example, the average value and the standard deviation are calculated using the amount of free memory for days as the source data, and the abnormality is detected when the real value is out of the normal range.
groups: - name: windows_exporter rules: - alert: windows_memory_available_bytes (Windows exporter) expr: 3 < abs((avg_over_time(windows_memory_available_bytes[5m])-avg_over_time(windows_memory_available_bytes[7d]))/stddev_over_time(windows_memory_available_bytes[7d])) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Alert" jp1_pc_eventid: "0402" jp1_pc_metricname: " windows_memory_available_bytes" annotations: jp1_pc_firing_description: "Free memory is different from normal." jp1_pc_resolved_description: "The available memory has returned to normal." -
Defining Dashboards for Resource Comparison
On each server in AP server group, define a dashboard to check the amount of free memory.
<Operation>
-
Operator action
-
Anomaly detection's alert report
If the estimated amount of free memory being monitored exceeds the threshold value, the alerts defined in <Construction> issues a JP1 event. The operator escalates system outliers to the system administrator if they are detected in email notifications via automatic actions or in alert information on the dashboard screen.
-
System administrator actions
-
Investigating the operational information of related resources
The system administrator reviews the operational information for each resource in the dashboard and investigates the cause.
For example, if AP server 1 shows that the amount of free memory is increasing compared to the amount of free memory in normal hours, and the primary survey shows that access is more intensive than in normal hours, AP server 1 accesses are distributed to other AP servers as a temporary measure to reduce memory consumption.
-
Checking temporary measures
After the temporary action is complete, the system administrator checks the values on the dashboard again to ensure that the rising values are stable.
-
Investigating the root cause and examining the main countermeasures
Investigate other operational information and refer to the log of the application to investigate the root cause and to investigate the main measures.
- When it is desirable to perform threshold monitoring
This section describes the cases in which it is desirable to perform anomaly detection by setting a threshold value instead of anomaly detection.
As an example, a server with 1TB of disk capacity has performed an operation that is not included in normal operations, and a large amount of backup has been made temporarily. Therefore, disk usage has changed as shown in the following figure. In addition, there are no operational problems until the disk requirements are below 20%.
Anomaly detection detects an error when a certain amount of data is increased from the data transition under normal conditions. Therefore, in the above example, even if the amount of increase is permitted for operation, an error will be detected if the value increases due to sudden work.
As described above, if a temporary increase in the value occurs continuously, an error is detected each time, and the number of detected errors increases even within the assumption of operation.
It is desirable to set the threshold monitoring alert definitions for monitoring items that are not limited to disk space but are easy to set (items for which the permissible values for operation are fixed).
(6) Dashboard Verification
Describes the steps to check the baselines, real numbers, averages, and normal ranges used in anomaly detection on the dashboard.
- Predefined
You must define metric in metric definition-file before you can see the values on the dashboard.
The following is an example definition for displaying free memory. The range vector selector you specify for promql (the value you specify in square brackets [ ]) can also specify $stepTime{minSeconds="minimum-seconds"}, which is calculated dynamically to include all data depending on the range of trend data displayed in the dashboard. For details about specifying the range vector selector $stepTime{minSeconds="minimum-seconds"}, see Consolidation display of trend data with dynamic range vectors in 3.15.6(4)(c) About Performance Data to Retrieve.
- Important
-
When a dashboard displays the baseline, real, mean, and normal metric used by anomaly detection, the dashboard might take longer to display than other metric, depending on metric or measurement range from which the values are calculated.
{
"name":"memory_unused_baseline",#1
"default":false,
"promql":"avg_over_time(windows_memory_available_bytes[5m]) / (1024*1024*1024)",#2
"resource_en": {
"category":"platform_windows",
"label":"Memory unused_base",
"description":"Available size in the physical memory area.",
"unit":"GB"
},
"resource_ja":{
"category":"platform_windows",
"label":"空きメモリ量_ベース",
"description":"物理メモリ領域の未使用サイズ",
"unit":"ギガバイト"
}
},
{
"name":"memory_unused_average",#3
"default":false,
"promql":"avg_over_time(windows_memory_available_bytes[7d]) / (1024*1024*1024)",#4
"resource_en":{
"category":"platform_windows",
"label":"Memory unused_average",
"description":"Available size in the physical memory area.",
"unit":"GB"
},
"resource_ja": {
"category":"platform_windows",
"label":"空きメモリ量_平均",
"description":"物理メモリ領域の未使用サイズ",
"unit":"ギガバイト"
}
},
{
"name":"memory_unused_std_up",#5
"default":false,
"promql":"(avg_over_time(windows_memory_available_bytes[7d])+ stddev_over_time(windows_memory_available_bytes[7d])) / (1024*1024*1024)",#6
"resource_en":{
"category":"platform_windows",
"label":"Memory unused_std_up",
"description":"Available size in the physical memory area.",
"unit":"GB"
},
"resource_ja":{
"category":"platform_windows",
"label":"空きメモリ量_正常範囲+",
"description":"物理メモリ領域の未使用サイズ",
"unit":"ギガバイト"
}
},
{
"name":"memory_unused_std_down",#7
"default":false,
"promql":"(avg_over_time(windows_memory_available_bytes[7d])- stddev_over_time(windows_memory_available_bytes[7d])) / (1024*1024*1024)",#8
"resource_en":{
"category":"platform_windows",
"label":"Memory unused_std_down",
"description":"Available size in the physical memory area.",
"unit":"GB"
},
"resource_ja":{
"category":"platform_windows",
"label":"空きメモリ量_正常範囲-",
"description":"物理メモリ領域の未使用サイズ",
"unit":"ギガバイト"
}
}- #1
-
Indicates metric of the baseline.
- #2
-
Indicates a PromQL expression that calculates the 5-minute mean as a baseline.
- #3
-
Indicates metric of the mean.
- #4
-
Indicates a PromQL expression that calculates averages based on seven days of data.
- #5
-
Defines metric of mean + std deviation.
- #6
-
Shows PromQL expression, which calculates the mean and standard deviation based on seven days of data and calculates the sum of them.
- #7
-
Mean - Indicates the definition of the standard-deviation metric.
- #8
-
Here is an PromQL expression that calculates the mean and standard deviation based on seven days of data and calculates the differences between them.
- Working with the Dashboard
Based on metric defined in "Predefined", add the following panel in the Dashboard List dialog box, and arrange the created panel in the left and right.
- Panel A
-
-
y-axis: Amount of free memory_base
-
y2 axis: Amount of free memory_average
-
Minimum value: Minimum value according to the acquisition result
-
Maximum value: Maximum value according to the acquisition result
-
- Panel B
-
-
y-axis: Amount of free memory_Normal range
-
y2 axis: Amount of free memory_Normal range
-
Minimum value: Minimum value according to the acquisition result
-
Maximum value: Maximum value according to the acquisition result
-
When creating the above panel, set the minimum and maximum values on the y-axis of each panel. If it is not set, the variable value will be set for each axis.
|
|