Hitachi

JP1 Version 13 JP1/Integrated Management 3 - Manager Overview and System Design Guide


3.2.5 Anomaly detection

Organization of this subsection

(1) Overview

Anomaly detection is a technique for detecting unusual behavior or events that deviate from normal patterns. JP1/IM calculates the mean and standard deviation based on historical data. It determines that an anomaly occurs when the current value falls outside these ranges and provides a function to alert you. Use the Z score to analyze and determine anomalies.

A Z score, also known as a standard score, represents the relationship between a value and the average of a group of values. Calculated by dividing the difference between the value and the mean by the standard deviation of the groups.

You can use anomaly detection to detect outliers. It is also applicable for detecting errors in cases where there is limited operational data, making it difficult to set threshold values.

Figure 3‒8: Outline of anomaly detection

[Figure]

In the following cases, perform error detection by setting the threshold value.

(2) Prerequisites

- Alert definition function prerequisites for anomaly detection

When using anomaly detection function, make sure that the following conditions are satisfied:

(3) Combinations of available versions

The following combinations of the available versions of anomaly detection functionality are:

Important

Do not use metric for anomaly detection that is constantly constant. If you use such a metric, you will not be able to calculate correctly.

(4) Alert definition function for anomaly detection

You review metric, coverage, and thresholds (Z-scores) and define anomaly detection alerting. Anomaly detection alert definition is specified in the alert conditional expression (PromQL expression) specified in alert configuration file (jpc_alerting_rules.yml) "expr" field of JP1/IM - Agent. Conditional expressions specify the criteria for determining whether the Z-score calculated by anomaly detection exceeds the score set as a threshold value. If it exceeds, it is judged as an outlier value and an error is detected.

Here is the formula for the Z score:

Use avg_over_time() to calculate data points and data averages.

Use stddev_over_time() to calculate the standard-deviation of the data.

avg_over_time() is a function that returns the mean values of the measurement range (range vector) of the specified metric.

stddev_over_time() is a function that returns the value of the standard deviation of the measurement range (range vector) of the specified metric.

Here is a sample alert-definition for anomaly detection:

- Alert definition example

When used in conjunction with rate(), as shown above, the ":" must be appended to an end of the duration specified by avg_over_time() and stddev_over_time().

For details about the alert configuration file, see Alert configuration file (jpc_alerting_rules.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

For details about avg_over_time function, stddev_over_time function, and rate function, see 2.7.4(7)(f) Available functions.

(a) Example of a reference value to be set in a calculation formula for Z-score

This section describes the standard values used for the calculation formulas set in anomaly detection alert-definition.

When anomaly detection is used, it is assumed that the data storage period specified in prometheus command-argument is changed to a period appropriate for the operation. The value you set for storage period must be greater than measurement period for the alerts that be defined by anomaly detection.

■ Setting the measurement period

For avg_over_time() and stddev_over_time(), here is an example of a measure duration reference value that you specify to calculate the mean or standard deviation: Change the settings as appropriate for the system to be operated.

  • Standard value (for a system that has been in stable operation for a while after its operation)

    The reference value for the measurement period is 7d.

    Set the data storage period of prometheus specified by the command argument to "7d" or more, and set [7d] for each measurement period of avg_over_time() and stddev_over_time().

  • For systems that are just running

    The reference value for the measurement period is 3d.

    Set the data storage period of prometheus specified by the command argument to a value greater than or equal to "3d", and set the value between [3d] and [2d] for each measurement period of avg_over_time() and stddev_over_time().

The reference values for the measurement periods shown above are only guidelines. By extending prometheus storage period, you can set a value of "7d" or greater.

■ Setting the Z-score (threshold)

Here is a sample threshold criterion that you set in anomaly detection alert definition.

In anomaly detection, if the Z-score (calculated result) exceeds the score set as the threshold value, it is considered as an outlier value, and an error is detected. The following is an example of a threshold reference:

Note that a threshold of 2 or less is generally considered to be a no-problem value, so set the threshold to a value greater than 2.

  • Standard value

    The threshold reference value is 3.

    Set the threshold value to "3" when detecting anomalies based on analysis and judgment based on the general Z-score method.

  • For more stringent detection than standard

    The threshold reference is 2.5.

    Subtract 0.5 from the standard value and set the threshold value to "2.5".

  • When the detection is loose than the standard

    The threshold reference is 3.5.

    Add 0.5 to the standard value and set the threshold value to "3.5".

The threshold values shown above are for reference only. If you set a small value, such as 2, it will detect even a value that is a little away from the average. The smaller the value, the easier it is to detect anomalies. If a large value is set, an error will be detected at a value far from the average.

■ Target metric

Metric used to define the alerts for anomaly detection must be collected in JP1/IM - Agent.

(5) Use Case

- Monitoring Network Transmission Volume

The following shows an example of monitoring when the network transmission volume is set as the monitoring target.

<Assumed configuration>

JP1/IM - Manager and JP1/IM - Agent are installed on the monitored servers and OS monitoring is in progress. In addition, a single server machine has JP1/IM - Manager installed that integrates agent.

The operating status of the system indicates a constant usage rate and is in stable operation.

<Construction>

<Operation>

  1. Anomaly detection's alert report

    If the monitored network volume exceeds the threshold, the alerts defined in <Construction> will trigger a JP1 event. The operator escalates system outliers to the system administrator if they are detected in email notifications via automatic actions or in alert information on the dashboard screen.

  1. Investigating the operational information of related resources

    The system administrator reviews the operational information for each resource in the dashboard and investigates the cause.

    For example, on AP server 3, check that the amount of network transmission is increasing compared to the amount of transmission during normal hours. If the primary survey shows that the number is increasing due to an increase in the number of users, as a temporary measure, ask the user to restrict the use of the system and confirm whether the value decreases.

    [Figure]

  2. Checking temporary measures

    After the temporary action is complete, the system administrator checks the values on the dashboard again to ensure that the rising values are stable.

  3. Investigating the root cause and examining the main countermeasures

    Investigate other operational information and refer to the log of the application to investigate the root cause and to investigate the main measures.

- Example of Monitoring the Amount of Free Memory

The following shows an example of monitoring when the amount of free memory is set as the monitoring target.

<Assumed configuration>

This is equivalent to <Assumed configuration> in "Monitoring Network Transmission Volume".

<Construction>

<Operation>

  1. Anomaly detection's alert report

    If the estimated amount of free memory being monitored exceeds the threshold value, the alerts defined in <Construction> issues a JP1 event. The operator escalates system outliers to the system administrator if they are detected in email notifications via automatic actions or in alert information on the dashboard screen.

  1. Investigating the operational information of related resources

    The system administrator reviews the operational information for each resource in the dashboard and investigates the cause.

    For example, if AP server 1 shows that the amount of free memory is increasing compared to the amount of free memory in normal hours, and the primary survey shows that access is more intensive than in normal hours, AP server 1 accesses are distributed to other AP servers as a temporary measure to reduce memory consumption.

    [Figure]

  2. Checking temporary measures

    After the temporary action is complete, the system administrator checks the values on the dashboard again to ensure that the rising values are stable.

  3. Investigating the root cause and examining the main countermeasures

    Investigate other operational information and refer to the log of the application to investigate the root cause and to investigate the main measures.

- When it is desirable to perform threshold monitoring

This section describes the cases in which it is desirable to perform anomaly detection by setting a threshold value instead of anomaly detection.

As an example, a server with 1TB of disk capacity has performed an operation that is not included in normal operations, and a large amount of backup has been made temporarily. Therefore, disk usage has changed as shown in the following figure. In addition, there are no operational problems until the disk requirements are below 20%.

[Figure]

Anomaly detection detects an error when a certain amount of data is increased from the data transition under normal conditions. Therefore, in the above example, even if the amount of increase is permitted for operation, an error will be detected if the value increases due to sudden work.

As described above, if a temporary increase in the value occurs continuously, an error is detected each time, and the number of detected errors increases even within the assumption of operation.

It is desirable to set the threshold monitoring alert definitions for monitoring items that are not limited to disk space but are easy to set (items for which the permissible values for operation are fixed).

(6) Dashboard Verification

Describes the steps to check the baselines, real numbers, averages, and normal ranges used in anomaly detection on the dashboard.

- Predefined

You must define metric in metric definition-file before you can see the values on the dashboard.

The following is an example definition for displaying free memory. The range vector selector you specify for promql (the value you specify in square brackets [ ]) can also specify $stepTime{minSeconds="minimum-seconds"}, which is calculated dynamically to include all data depending on the range of trend data displayed in the dashboard. For details about specifying the range vector selector $stepTime{minSeconds="minimum-seconds"}, see Consolidation display of trend data with dynamic range vectors in 3.15.6(4)(c) About Performance Data to Retrieve.

Important

When a dashboard displays the baseline, real, mean, and normal metric used by anomaly detection, the dashboard might take longer to display than other metric, depending on metric or measurement range from which the values are calculated.

{
    "name":"memory_unused_baseline",#1
    "default":false,
    "promql":"avg_over_time(windows_memory_available_bytes[5m]) / (1024*1024*1024)",#2
    "resource_en": {
      "category":"platform_windows",
      "label":"Memory unused_base",
      "description":"Available size in the physical memory area.",
      "unit":"GB"
    },
    "resource_ja":{
      "category":"platform_windows",
      "label":"空きメモリ量_ベース",
      "description":"物理メモリ領域の未使用サイズ",
      "unit":"ギガバイト"
    }
  },
  {
    "name":"memory_unused_average",#3
    "default":false,
    "promql":"avg_over_time(windows_memory_available_bytes[7d]) / (1024*1024*1024)",#4
    "resource_en":{
      "category":"platform_windows",
      "label":"Memory unused_average",
      "description":"Available size in the physical memory area.",
      "unit":"GB"
    },
    "resource_ja": {
      "category":"platform_windows",
      "label":"空きメモリ量_平均",
      "description":"物理メモリ領域の未使用サイズ",
      "unit":"ギガバイト"
    }
  },
  {
    "name":"memory_unused_std_up",#5
    "default":false,
    "promql":"(avg_over_time(windows_memory_available_bytes[7d])+ stddev_over_time(windows_memory_available_bytes[7d])) / (1024*1024*1024)",#6
    "resource_en":{
      "category":"platform_windows",
      "label":"Memory unused_std_up",
      "description":"Available size in the physical memory area.",
      "unit":"GB"
    },
    "resource_ja":{
      "category":"platform_windows",
      "label":"空きメモリ量_正常範囲+",
      "description":"物理メモリ領域の未使用サイズ",
      "unit":"ギガバイト"
    }
  },
  {
    "name":"memory_unused_std_down",#7
    "default":false,
    "promql":"(avg_over_time(windows_memory_available_bytes[7d])- stddev_over_time(windows_memory_available_bytes[7d])) / (1024*1024*1024)",#8
    "resource_en":{
      "category":"platform_windows",
      "label":"Memory unused_std_down",
      "description":"Available size in the physical memory area.",
      "unit":"GB"
    },
    "resource_ja":{
      "category":"platform_windows",
      "label":"空きメモリ量_正常範囲-",
      "description":"物理メモリ領域の未使用サイズ",
      "unit":"ギガバイト"
    }
  }
#1

Indicates metric of the baseline.

#2

Indicates a PromQL expression that calculates the 5-minute mean as a baseline.

#3

Indicates metric of the mean.

#4

Indicates a PromQL expression that calculates averages based on seven days of data.

#5

Defines metric of mean + std deviation.

#6

Shows PromQL expression, which calculates the mean and standard deviation based on seven days of data and calculates the sum of them.

#7

Mean - Indicates the definition of the standard-deviation metric.

#8

Here is an PromQL expression that calculates the mean and standard deviation based on seven days of data and calculates the differences between them.

- Working with the Dashboard

Based on metric defined in "Predefined", add the following panel in the Dashboard List dialog box, and arrange the created panel in the left and right.

Panel A
  • y-axis: Amount of free memory_base

  • y2 axis: Amount of free memory_average

  • Minimum value: Minimum value according to the acquisition result

  • Maximum value: Maximum value according to the acquisition result

Panel B
  • y-axis: Amount of free memory_Normal range

  • y2 axis: Amount of free memory_Normal range

  • Minimum value: Minimum value according to the acquisition result

  • Maximum value: Maximum value according to the acquisition result

When creating the above panel, set the minimum and maximum values on the y-axis of each panel. If it is not set, the variable value will be set for each axis.

Figure 3‒9: Graph display image

[Figure]