Alert configuration file (jpc_alerting_rules.yml)

Organization of this page

Format
File
Storage directory
Description
Character code
Line feed code
When the definitions are applied
Information that is specified
Definition example

Format

Write in YAML format.

groups:
  - name: group-name
    rules:
    - alert: alert-name
      expr: Conditional expressions
      for: Period
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: JP1 event severity
        jp1_pc_eventid: Event ID of the JP1 event
        jp1_pc_metricname: Metric Name
      annotations:
        jp1_pc_firing_description: Message when firing conditions are met
        jp1_pc_resolved_description: Message when a firing condition is no longer met

To Page Top

File

jpc_alerting_rules.yml

jpc_alerting_rules.yml.model (model file)

To Page Top

Storage directory

■Integrated agent host

In Windows:

For a physical host

Agent-path\conf\
For a logical host

shared-folder\jp1ima\conf\

In Linux:

For a physical host

/opt/jp1ima/conf/
For a logical host

shared-directory/jp1ima/conf/

To Page Top

Description

A file that defines the alert evaluation rules that the Prometheus server runs.

To Page Top

Character code

UTF-8 (without BOM)

To Page Top

Line feed code

In Windows: CR+LF

In Linux: LF

To Page Top

When the definitions are applied

Reflected when the Prometheus server is restarted and when you instruct the Prometheus server to reload.

To Page Top

Information that is specified

For definitions of common placeholders used in the table below, see About definition of common placeholders for descriptive items in yml file.

Item				Description	Changeability	What You Setup in Your JP1/IM - Agent	JP1/IM - Agent Defaults Value
groups:				--	N	--	"groups:"
	name: <string>			Specify the alert group name within 255 bytes. The group name must be unique within the monitoring agent host, and you cannot specify multiple names with the same group name. Note that between different monitoring agent hosts, you can specify a name that specifies the same group name for each.	Y	Specify a group name of your choice.	Not specified
	rules:			Configure alert rules. You can specify up to 100^#. # You can configure up to 100 alert rules per agent host.	N	--	Not specified
		alert: <string>		Specify a name for the alert.	Y	Specifies the name of the alert created by the user.	Not specified
		expr: <string>		Specify the alert expression within 255 bytes. Specifies the PromQL statement.	Y	Specifies the PromQL statement to evaluate.^# For notes on PromQL statements, see Note on PromQL expression.	Not specified
		for: <duration>		Specify the duration for an alert to become firing, ranging from 0 seconds to 24 hours. The value is specified in numbers and units. The units that can be specified are s (seconds) and m (minutes). Even if the alert condition expression is applicable, if it no longer applies within the period specified for `for`, it will not be treated as firing.	Y	Specifies the amount of time it takes for an alert to reach a firing state.	Not specified
		labels:		Set labels to add or override for each alert.	N	--	Not specified
			jp1_pc_product_name: <string>	Specify the value to be set for the product name of the JP1 event.	Y	"/HITACHI/JP1/JPCCS2", or "/HITACHI/JP1/JPCCS2/xxxx" You can specify xxxx.	Not specified
			jp1_pc_component: <string>	Specify the value to be set for the component name of the JP1 event.	Y	Depending on the product plug-in that handles the JP1 event, specify the following values. jp1pccs_azure.js:"/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1pccs_kubernetes.js:"/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1pccs.js:"/HITACHI/JP1/JPCCS/CONFINFO"	Not specified
			jp1_pc_severity: <string>	Specify the value to set for the severity of the JP1 event.	Y	Specify one of the following: Emergency Alert Critical Error Warning Notice Information Debug	Not specified
			jp1_pc_eventid: <string>	Specify the value to be set for the event ID of the JP1 event.	Y	Specify any value in the range of "0 to 1FFF,7FFF8000 to 7FFFFFFF" that can be specified as the event ID of the JP1 event.	If the specification is omitted, "00007600" is Setup to Value of ID property of JP1 event.
			jp1_pc_metricname: <string>	Specify the value to be set for the metric name of the JP1 event. In the case of Yet another cloudwatch exporter, the JP1 event is associated with the IM management node in the AWS namespace corresponding to the metric name (or the first metric name if multiple comma-separated values are specified).	Y	Specify the metric names separated by commas.	Not specified
		annotations:		Set the annotations that you want to add to each alert.	N	--	Not specified
			jp1_pc_firing_description: <string>	Specify the value to be set for the message of the JP1 event when the firing condition of the alert is satisfied. If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte. If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = `alert name`)".	Y	Specify an optional message.	If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = `alert name`)".
			jp1_pc_resolved_description: <string>	Specify the value to be set for the JP1 event message when the firing condition of the alert is no longer satisfied. If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte. If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = `alert name`)".	Y	Specify an optional message.	If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = `alert name`)".

Legend:

Y: Changeable, N: Not changeable, --: Not applicable

#

Since the following label is set as an attribute of the JP1 event, do not remove the label by an aggregate operator.

instance
job
jp1_pc_nodelabel
jp1_pc_exporter
jp1_pc_remote_monitor_instance
account
region
dimension_any-string

Note that the labels accout, region, and dimension_any-string apply only when monitoring Yet another cloudwatch exporter metrics.

To Page Top

Definition example

The following shows an example of an alert definition for each metric written in the model file of the metric definition file.

■Alert definition example for metrics in Node exporter metric definition file

cpu_used_rate^#

groups:
  - name: node_exporter
    rules:
    - alert: cpu_used_rate(Node exporter)
      expr: 80 < (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="system"}[2m])) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="user"}[2m]))) * 100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0301"
        jp1_pc_metricname: "node_cpu_seconds_total"
      annotations:
        jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)."

memory_unused^#

groups:
  - name: node_exporter
    rules:
    - alert: memory_unused(Node exporter)
      expr: 1024 > node_memory_MemAvailable_bytes/1024/1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0302"
        jp1_pc_metricname: "node_memory_MemAvailable_bytes"
      annotations:
        jp1_pc_firing_description: "The amount of free memory has fallen below the threshold (1024 megabytes).value={{ $value }}megabytes"
        jp1_pc_resolved_description: "The amount of free memory exceeded the threshold (1024 megabytes)."

memory_unused_rate^#

groups:
  - name: node_exporter
    rules:
    - alert: memory_unused_rate(Node exporter)
      expr: node_memory_MemAvailable_bytes  / node_memory_MemTotal_bytes * 100 < 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0303"
        jp1_pc_metricname: "node_memory_MemAvailable_bytes,node_memory_MemTotal_bytes"
      annotations:
        jp1_pc_firing_description: "Free-memory ratio has fallen below threshold value (10%). value={{$value}} %"
        jp1_pc_resolved_description: "Free-memory ratio exceeded threshold value (10%). "

disk_unused^#

groups:
  - name: node_exporter
    rules:
    - alert: disk_unused(Node exporter)
      expr: 10 > node_filesystem_free_bytes/(1024*1024*1024)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0304"
        jp1_pc_metricname: "node_filesystem_free_bytes"
      annotations:
        jp1_pc_firing_description: "Free disk space has fallen below the threshold (10 gigabytes).value={{ $value }}gigabytes, mountpoint={{ $labels.mountpoint }}"
        jp1_pc_resolved_description: "Free disk space exceeded threshold (10 gigabytes).mountpoint={{ $labels.mountpoint }}"

disk_unused_rate^#

groups:
  - name: node_exporter
    rules:
    - alert: disk_unused_rate(Node exporter)
      expr: node_filesystem_free_bytes / node_filesystem_size_bytes * 100 < 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0305"
        jp1_pc_metricname: "node_filesystem_free_bytes,node_filesystem_size_bytes"
      annotations:
        jp1_pc_firing_description: "Free disk percentage has fallen below threshold value (10%). value={{$value}} %, mountpoint={{$labels.mountpoint}}"
        jp1_pc_resolved_description: "Free disk percentage exceeds threshold value (10%). mountpoint={{$labels.mountpoint}}"

Note

If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of the disk_unused of Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX, as shown in the underlined part below.

10 > node_filesystem_free_bytes{job="jpc_node"}/ (1024*1024*1024)

disk_busy_rate^#

groups:
  - name: node_exporter
    rules:
    - alert: disk_busy_rate(Node exporter)
      expr: 70 < rate(node_disk_io_time_seconds_total[2m])*100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0306"
        jp1_pc_metricname: "node_disk_io_time_seconds_total"
      annotations:
        jp1_pc_firing_description: "Disk busy rate exceeded threshold (70%).value={{ $value }}%, device={{ $labels.device }}"
        jp1_pc_resolved_description: "Disk busy rate has fallen below the threshold (70%).device={{ $labels.device }}"

Note

If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of the disk_unused_rate of Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX, as shown in the underlined part below.

Node_filesystem_free_bytes{job="jpc_node"} / node_filesystem_size_bytes{job="jpc_node"} * 100 < 10

disk_read_latency^#

groups:
  - name: node_exporter
    rules:
    - alert: disk_read_latency(Node exporter)
      expr: rate(node_disk_read_time_seconds_total[2m]) / rate(node_disk_reads_completed_total[2m]) > 0.1 and rate(node_disk_reads_completed_total[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0307"
        jp1_pc_metricname: "node_disk_read_time_seconds_total,node_disk_reads_completed_total"
      annotations:
        jp1_pc_firing_description: "Disk read latency exceeds the threshold Value (0.1 seconds). value={{$value}}s, device={{$labels.device}}"
        jp1_pc_resolved_description: "Disk read latency has fallen below threshold Value (0.1 seconds). device={{$labels.device}}"

disk_write_latency^#

ggroups:
  - name: node_exporter
    rules:
    - alert: disk_write_latency(Node exporter)
      expr: rate(node_disk_write_time_seconds_total[2m]) / rate(node_disk_writes_completed_total[2m]) > 0.1 and rate(node_disk_writes_completed_total[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0308"
        jp1_pc_metricname: "node_disk_write_time_seconds_total,node_disk_writes_completed_total"
      annotations:
        jp1_pc_firing_description: "Disc write latency exceeds the threshold Value (0.1 sec.). value={{$value}}%, device={{$labels.device}}"
        jp1_pc_resolved_description: "Disc write latency has fallen below threshold value (0.1 seconds). device={{$labels.device}}"

disk_io_latency^#

groups:
  - name: node_exporter
    rules:
    - alert: disk_io_latency(Node exporter)
      expr: (rate(node_disk_read_time_seconds_total[2m]) + rate(node_disk_write_time_seconds_total[2m])) / (rate(node_disk_reads_completed_total[2m]) + rate(node_disk_writes_completed_total[2m])) > 0.1 and (rate(node_disk_writes_completed_total[2m]) > 0 or rate(node_disk_read_completed_total[2m]) > 0)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0309"
        jp1_pc_metricname: "node_disk_write_time_seconds_total,node_disk_writes_completed_total,node_disk_read_time_seconds_total,node_disk_reads_completed_total"
      annotations:
        jp1_pc_firing_description: "Disk IO latency exceeded the threshold Value (0.1 seconds). value={{$value}}%, device={{$labels.device}}"
        jp1_pc_resolved_description: "Disc IO latency has fallen below threshold value (0.1 seconds). device={{$labels.device}}"

network_sent^#

groups:
  - name: node_exporter
    rules:
    - alert: network_sent(Node exporter)
      expr: 100 < rate(node_network_transmit_packets_total[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0310"
        jp1_pc_metricname: "node_network_transmit_packets_total"
      annotations:
        jp1_pc_firing_description: "The network transmission speed exceeded the threshold (100 packets per second). value={{ $value }}packets per second, device={{ $labels.device }}"
        jp1_pc_resolved_description: "The network transmission speed has dropped below the threshold (100 packets per second). device={{ $labels.device }}"

network_received^#

groups:
  - name: node_exporter
    rules:
    - alert: network_received(Node exporter)
      expr: 100 < rate(node_network_receive_packets_total[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0311"
        jp1_pc_metricname: "node_network_receive_packets_total"
      annotations:
        jp1_pc_firing_description: "The network receive speed exceeded the threshold (100 packets per second).value={{ $value }}packets per second, device={{ $labels.device }}"
        jp1_pc_resolved_description: "The network receive speed has dropped below the threshold (100 packets per second).device={{ $labels.device }}"

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Process exporter metric definition file

process_pgm_process_count^#

groups:
  - name: process_exporter
    rules:
    - alert: process_pgm_process_count(Processs exporter)
      expr: 1 >  sum by (program, instance, job, jp1_pc_nodelabel, jp1_pc_exporter) (namedprocess_namegroup_num_procs)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1308"
        jp1_pc_metricname: "namedprocess_namegroup_num_procs"
      annotations:
        jp1_pc_firing_description: "The number of processes has fallen below the threshold (1 process)."
        jp1_pc_resolved_description: "The number of processes has exceeded the threshold (1 process)."

#

This uses a threshold value of 1 as an example. Change this value based on the number of monitoring targets.

When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Node exporter (service monitoring) metric definition file

service_state^#

When the auto-start setting of the monitored unit is enabled (systemctl enable is being executed)

groups:
  - name: node_exporter
    rules:
    - alert: service_state(Node exporter)
      expr: node_systemd_unit_state{state="active"} == 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0320"
        jp1_pc_metricname: "node_systemd_unit_state"
      annotations:
        jp1_pc_firing_description: "The status of the service is not running."
        jp1_pc_resolved_description: "The service status is now running."

When the auto-start setting of the monitored unit is disabled

groups:
  - name: node_exporter
    rules:
    - alert: service_state_service-name(Node exporter)
      expr: absent(node_systemd_unit_state{instance="integrated-agent-host-name:port-number-of-the-Node-exporter", job="jpc_node", jp1_pc_exporter="JPC Node exporter", jp1_pc_nodelabel="service-name", state="active"}) == 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0320"
        jp1_pc_metricname: "node_systemd_unit_state"
      annotations:
        jp1_pc_firing_description: "The status of the service is not running."
        jp1_pc_resolved_description: "The service status is now running."

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Windows exporter metric definition file

cpu_used_rate^#

groups:
  - name: windows_exporter
    rules:
    - alert: cpu_used_rate(Windows exporter)
      expr: 80 < 100 - (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0401"
        jp1_pc_metricname: "windows_cpu_time_total"
      annotations:
        jp1_pc_firing_description: "CPU utilization exceeded threshold (80%).value={{ $value }}%"
        jp1_pc_resolved_description: "CPU usage has dropped below the threshold (80%)."

memory_unused^#

groups:
  - name: windows_exporter
    rules:
    - alert: memory_unused(Windows exporter)
      expr: 1 > windows_memory_available_bytes/1024/1024/1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0402"
        jp1_pc_metricname: "windows_memory_available_bytes"
      annotations:
        jp1_pc_firing_description: "The amount of free memory has fallen below the threshold (1 gigabytes).value={{ $value }}Gb"
        jp1_pc_resolved_description: "The amount of free memory exceeded the threshold (1 gigabytes)."

memory_unused_rate^#

groups:
  - name: windows_exporter
    rules:
    - alert: memory_unused_rate(Windows exporter)
      expr: windows_memory_available_bytes / windows_cs_physical_memory_bytes * 100 < 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0403"
        jp1_pc_metricname: "windows_memory_available_bytes,windows_cs_physical_memory_bytes"
      annotations:
        jp1_pc_firing_description: "Free-memory ratio has fallen below threshold value (10%). value={{$value}} %"
        jp1_pc_resolved_description: "Free-memory ratio exceeded threshold value (10%)."

disk_unused^#

groups:
  - name: windows_exporter
    rules:
    - alert: disk_unused(Windows exporter)
      expr: 10 > windows_logical_disk_free_bytes{volume!~"HarddiskVolume.*"} / (1024*1024*1024)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0404"
        jp1_pc_metricname: "windows_logical_disk_free_bytes"
      annotations:
        jp1_pc_firing_description: "Free disk space has fallen below the threshold (10 gigabytes).value={{ $value }}GB, volume={{ $labels.volume }}"
        jp1_pc_resolved_description: "Free disk space exceeded threshold (10 gigabytes).volume={{ $labels.volume }}"

disk_unused_rate^#

groups:
  - name: windows_exporter
    rules:
    - alert: disk_unused_rate(Windows exporter)
      expr: windows_logical_disk_free_bytes{volume!~"HarddiskVolume.*"} / windows_logical_disk_size_bytes * 100 < 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0405"
        jp1_pc_metricname: "windows_logical_disk_free_bytes,windows_logical_disk_size_bytes"
      annotations:
        jp1_pc_firing_description: "Free disk percentage has fallen below threshold value (10%). value={{$value}}%, volume={{$labels.volume}}"
        jp1_pc_resolved_description: "Free disk percentage exceeds threshold value (10%). volume={{$labels.volume}}"

disk_busy_rate^#

groups:
  - name: windows_exporter
    rules:
    - alert: disk_busy_rate(Windows exporter)
      expr: 70 < 100 - rate(windows_logical_disk_idle_seconds_total{volume!~"HarddiskVolume.*"}[2m])/(rate(windows_logical_disk_write_seconds_total[2m]) + rate(windows_logical_disk_read_seconds_total[2m])+rate(windows_logical_disk_idle_seconds_total[2m])) * 100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0406"
        jp1_pc_metricname: "windows_logical_disk_idle_seconds_total,windows_logical_disk_write_seconds_total,windows_logical_disk_read_seconds_total"
      annotations:
        jp1_pc_firing_description: "Disk busy rate exceeded threshold (70%).value={{ $value }}%, volume={{ $labels.volume }}"
        jp1_pc_resolved_description: "The disk busy rate has fallen below the threshold (70%).volume={{ $labels.volume }}"

disk_read_latency^#

groups:
  - name: windows_exporter
    rules:
    - alert: disk_read_latency(Windows exporter)
      expr: rate(windows_logical_disk_read_seconds_total[2m]) / rate(windows_logical_disk_reads_total[2m]) > 0.1 and rate(windows_logical_disk_reads_total[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0407"
        jp1_pc_metricname: "windows_logical_disk_read_seconds_total,windows_logical_disk_reads_total"
      annotations:
        jp1_pc_firing_description: "Disk read latency exceeds the threshold Value (0.1 seconds). value={{$value}}s, volume={{$labels.volume}}"
        jp1_pc_resolved_description: "Disk read latency has fallen below threshold value (0.1 seconds). volume={{$labels.volume}}"

disk_write_latency^#

groups:
  - name: windows_exporter
    rules:
    - alert: disk_write_latency(Windows exporter)
      expr: rate(windows_logical_disk_write_seconds_total[2m]) / rate(windows_logical_disk_writes_total[2m]) > 0.1 and rate(windows_logical_disk_writes_total[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0408"
        jp1_pc_metricname: "windows_logical_disk_write_seconds_total,windows_logical_disk_writes_total"
      annotations:
        jp1_pc_firing_description: "Disk write latency exceeds the threshold Value (0.1 sec.). value={{$value}}s, volume={{$labels.volume}}"
        The jp1_pc_resolved_description: "Disk write latency has fallen below the threshold value (0.1 seconds). volume={{$labels.volume}}"

disk_io_latency^#

groups:
  - name: windows_exporter
    rules:
    - alert: disk_io_latency(Windows exporter)
      expr: (rate(windows_logical_disk_read_seconds_total[2m]) + rate(windows_logical_disk_write_seconds_total[2m])) / (rate(windows_logical_disk_reads_total[2m]) + rate(windows_logical_disk_writes_total[2m])) > 0.1 and (rate(windows_logical_disk_writes_total[2m]) > 0 or rate(windows_logical_disk_reads_total[2m]) > 0)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0409"
        jp1_pc_metricname: "windows_logical_disk_write_seconds_total,windows_logical_disk_writes_total,windows_logical_disk_read_seconds_total,windows_logical_disk_reads_total"
      annotations:
        jp1_pc_firing_description: "Disk IO latency exceeded the threshold Value (0.1 seconds). value={{$value}}s, volume={{ $labels.volume }}"
        jp1_pc_resolved_description: "Disk IO latency has fallen below the threshold value (0.1 seconds). volume={{ $labels.volume }}"

network_sent^#

groups:
  - name: windows_exporter
    rules:
    - alert: network_sent(Windows exporter)
      expr: 100 < rate(windows_net_packets_sent_total[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0410"
        jp1_pc_metricname: "windows_net_packets_sent_total"
      annotations:
        jp1_pc_firing_description: "The network transmission speed exceeded the threshold (100 packets per second).value={{ $value }}Packets/sec, nic={{ $labels.nic }}"
        jp1_pc_resolved_description: "The network transmission speed has dropped below the threshold (100 packets per second).nic={{ $labels.nic }}"

network_received^#

groups:
  - name: windows_exporter
    rules:
    - alert: network_received(Windows exporter)
      expr: 100 < rate(windows_net_packets_received_total[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0411"
        jp1_pc_metricname: "windows_net_packets_received_total"
      annotations:
        jp1_pc_firing_description: "The network receive speed exceeded the threshold (100 packets per second).value={{ $value }}Packets/sec, nic={{ $labels.nic }}"
        jp1_pc_resolved_description: "The network receive speed has dropped below the threshold (100 packets per second).nic={{ $labels.nic }}"

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Windows exporter (process monitoring) metric definition file

process_pgm_process_count^#

groups:
  - name: windows_exporter
    rules:
    - alert: process_pgm_process_count(Windows exporter)
      expr: absent(windows_process_start_time{instance="integrated-agent-host-name:Windows-exporter-port-number", job="jpc_windows", jp1_pc_exporter="JPC Windows exporter", jp1_pc_nodelabel="monitored-process-name",process="monitored-process-name"}) == 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0414"
        jp1_pc_metricname: "windows_process_start_time"
      annotations:
        jp1_pc_firing_description: "The number of processes has fallen below the threshold (1 process)."
        jp1_pc_resolved_description: "The number of processes has exceeded the threshold (1 process)."

#

This uses a threshold value of 1 as an example. Change this value based on the number of monitoring targets.

When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Windows exporter (service monitoring) metric definition file

service_state^#

groups:
  - name: windows_exporter
    rules:
    - alert: service_state(Windows exporter)
      expr: windows_service_state{state="running"} == 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0420"
        jp1_pc_metricname: "windows_service_state"
      annotations:
        jp1_pc_firing_description: "The status of the service is not running. service={{$labels.name}}"
        jp1_pc_resolved_description: "The service status is now running. service={{$labels.name}}"

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Node exporter for AIX metric definition file

cpu_used_rate^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: cpu_used_rate(Node exporter for AIX)
      expr: 80 < ((avg by(instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu{mode="sys"}[2m])))+(avg by(instance,job,jp1_pc_nodelabel,jp1_pc_exporter) ((rate(node_cpu{mode="user"}[2m])))))*100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1801"
        jp1_pc_metricname: "node_cpu"
      annotations:
        jp1_pc_firing_description: "CPU utilization has exceeded threshold (80%).value={{ $value }}%"
        jp1_pc_resolved_description: "CPU utilization has fallen below the threshold (80%)."

memory_unused^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: memory_unused(Node exporter for AIX)
      expr: 1 > aix_memory_real_avail/1024/1024/1024*4096
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1802"
        jp1_pc_metricname: "aix_memory_real_avail"
      annotations:
        jp1_pc_firing_description: "Free memory falls below threshold (1 gigabyte). value={{ $value }}gigabyte"
        jp1_pc_resolved_description: "The amount of free memory has exceeded the threshold value (1 gigabyte)."

memory_unused_rate^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: memory_unused_rate(Node exporter for AIX)
      expr: aix_memory_real_avail / aix_memor_real_total * 100 < 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1803"
        jp1_pc_metricname: "aix_memory_real_avail,aix_memory_real_total"
      annotations:
        jp1_pc_firing_description: "Free-memory percentage dropped below threshold (10%). value={{ $value }}%"
        jp1_pc_resolved_description: "Free-memory percentage exceeded threshold (10%)."

disk_unused^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: disk_unused(Node exporter for AIX))
      expr: 10 > node_filesystem_free_bytes{job="jpc_node_aix"}/(1024*1024*1024)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1804"
        jp1_pc_metricname: "node_filesystem_free_bytes"
      annotations:
        jp1_pc_firing_description: "Free disk space dropped below threshold (10 gigabytes). value={{ $value }}gigabytes, mountpoint={{ $labels.mountpoint }}"
        jp1_pc_resolved_description: "Free disk space exceeds threshold (10 gigabytes). mountpoint={{ $labels.mountpoint }}"

Note: If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of disk_unused in Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX. For details, see disk_unused in Alert definition example for metrics in Node exporter metric definition file above.

disk_unused_rate^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: disk_unused_rate(Node exporter for AIX)
      expr: node_filesystem_free_bytes{job="jpc_node_aix"} / node_filesystem_size_bytes{job="jpc_node_aix"} * 100 < 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1805"
        jp1_pc_metricname: "node_filesystem_free_bytes,node_filesystem_size_bytes"
      annotations:
        jp1_pc_firing_description: "Free disk percentage dropped below threshold (10%). value={{ $value }}%, mountpoint={{ $labels.mountpoint }}"
        jp1_pc_resolved_description: "Free disk percentage exceeds threshold (10%). mountpoint={{ $labels.mountpoint }}"

Note: If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of disk_unused_rate in Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX. For details, see disk_unused_rate in Alert definition example for metrics in Node exporter metric definition file above.

disk_busy_rate^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: disk_busy_rate(Node exporter for AIX)
      expr: 70 < rate(aix_disk_time[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1806"
        jp1_pc_metricname: "aix_disk_time"
      annotations:
        jp1_pc_firing_description: "Disk busy rate exceeded threshold (70%). value={{ $value }}%, disk={{ $labels.disk }}"
        jp1_pc_resolved_description: "Disk busy rate dropped below threshold (70%). disk={{ $labels.disk }}"

disk_read_latency^#

ggroups:
  - name: node_exporter_AIX
    rules:
    - alert: disk_read_latency(Node exporter for AIX)
      expr: rate(aix_disk_rserv[2m]) / rate(aix_disk_xrate[2m])/1000/1000/1000 > 0.1 and rate(aix_disk_xrate[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1807"
        jp1_pc_metricname: "aix_disk_rserv,aix_disk_xrate"
      annotations:
        jp1_pc_firing_description: "Disk read latency exceeds threshold (0.1 sec.), value={{ $value }}sec., disk={{ $labels.disk }}"
        jp1_pc_resolved_description: "Disk read latency dropped below threshold (0.1 sec). disk={{ $labels.disk }}"

disk_write_latency^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: disk_write_latency(Node exporter for AIX)
      expr: rate(aix_disk_wserv[2m]) / (rate(aix_disk_xfers[2m]) - rate(aix_disk_xrate[2m]))/1000/1000/1000 > 0.1 and (rate(aix_disk_xfers[2m]) - rate(aix_disk_xrate[2m])) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1808"
        jp1_pc_metricname: "aix_disk_wserv,aix_disk_xfers,aix_disk_xrate"
      annotations:
        jp1_pc_firing_description: "Disc write latency exceeds threshold (0.1 sec.), value={{ $value }}sec., disk={{ $labels.disk }}"
        jp1_pc_resolved_description: "Disc write latency dropped below threshold (0.1 sec.). disk={{ $labels.disk }}"

disk_io_latency^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: disk_io_latency(Node exporter for AIX)
      expr: (rate(aix_disk_rserv[2m]) + rate(aix_disk_wserv[2m])) / rate(aix_disk_xfers[2m])/1000/1000/1000 > 0.1 and (rate(aix_disk_xfers[2m]) > 0)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1809"
        jp1_pc_metricname: "aix_disk_wserv,aix_disk_rserv,aix_disk_xfers"
      annotations:
        jp1_pc_firing_description: "Disk IO latency exceeded threshold (0.1 sec.), value={{ $value }}sec., disk={{ $labels.disk }}"
        jp1_pc_resolved_description: "Disc IO latency dropped below threshold (0.1 sec.). disk={{ $labels.disk }}"

network_sent^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: network_sent(Node exporter for AIX)
      expr: 100 < rate(aix_netinterface_opackets[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1810"
        jp1_pc_metricname: "aix_netinterface_opackets"
      annotations:
        jp1_pc_firing_description: "Network sending rate exceeds threshold (100 packets/sec.), value={{ $value }}packets/sec, netinterface={{ $labels.netinterface }}"
        jp1_pc_resolved_description: "Network sending rate dropped below threshold (100 packets/sec). netinterface={{ $labels.netinterface }}"

network_received^#

groups:
  - name: node_exporter_AIX
    rules:
    - alert: network_received(Node exporter for AIX)
      expr: 100 < rate(aix_netinterface_ipackets[2m])
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1811"
        jp1_pc_metricname: "aix_netinterface_ipackets"
      annotations:
        jp1_pc_firing_description: "Network-Receive Rate Exceeded Threshold (100 Packets/sec.) value={{ $value }}Packets/sec., netinterface={{ $labels.netinterface }}"
        jp1_pc_resolved_description: "Network receive rate dropped below threshold (100 packets/sec). netinterface={{ $labels.netinterface }}"

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Blackbox exporter metric definition file

probe_success^#

groups:
  - name: blackbox_exporter
    rules:
    - alert: probe_success(Blackbox exporter)
      expr: 0 == probe_success
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0501"
        jp1_pc_metricname: "probe_success"
      annotations:
        jp1_pc_firing_description: "Communication failed. value={{ $value }}"
        jp1_pc_resolved_description: "Communication was successful."

probe_duration_seconds^#

groups:
  - name: blackbox_exporter
    rules:
    - alert: probe_duration_seconds(Blackbox exporter)
      expr: 5 < probe_duration_seconds
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0502"
        jp1_pc_metricname: "probe_duration_seconds"
      annotations:
        jp1_pc_firing_description: "The probe period has exceeded the threshold (5 seconds). value={{ $value }}seconds"
        jp1_pc_resolved_description: "The probe period has fallen below the threshold (5 seconds)."

probe_icmp_duration_seconds^#

groups:
  - name: blackbox_exporter
    rules:
    - alert: probe_icmp_duration_seconds(Blackbox exporter)
      expr: 3 < probe_icmp_duration_seconds
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0503"
        jp1_pc_metricname: "probe_icmp_duration_seconds"
      annotations:
        jp1_pc_firing_description: "The ICMP period has exceeded the threshold (3 seconds). value={{ $value }}seconds, phase={{ $labels.phase }}"
        jp1_pc_resolved_description: "The ICMP period has fallen below the threshold (3 seconds)."

probe_http_duration_seconds^#

groups:
  - name: blackbox_exporter
    rules:
    - alert: probe_http_duration_seconds(Blackbox exporter)
      expr: 3 < probe_http_duration_seconds
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0504"
        jp1_pc_metricname: "probe_http_duration_seconds"
      annotations:
        jp1_pc_firing_description: "The HTTP request period has exceeded the threshold (3 seconds). value={{ $value }}seconds, phase={{ $labels.phase }}"
        jp1_pc_resolved_description: "The HTTP request period has fallen below the threshold (3 seconds)."

probe_http_status_code^#

groups:
  - name: blackbox_exporter
    rules:
    - alert: probe_http_status_code(Blackbox exporter)
      expr: 200 != probe_http_status_code
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0505"
        jp1_pc_metricname: "probe_http_status_code"
      annotations:
        jp1_pc_firing_description: "The HTTP status is not 200. value={{ $value }}"
        jp1_pc_resolved_description: "The HTTP status is now 200."

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Yet another cloudwatch exporter metric definition file

aws_ec2_cpuutilization_average^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_ec2_cpuutilization_average(Yet another cloudwatch exporter)
      expr: 80 < aws_ec2_cpuutilization_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0601"
        jp1_pc_metricname: "aws_ec2_cpuutilization_average"
      annotations:
        jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)."

aws_ec2_disk_read_bytes_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_ec2_disk_read_bytes_sum(Yet another cloudwatch exporter)
      expr: 10240 < aws_ec2_disk_read_bytes_sum / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0602"
        jp1_pc_metricname: "aws_ec2_disk_read_bytes_sum"
      annotations:
        jp1_pc_firing_description: "Amount of data read in KB in instance store volumes has exceeded the threshold (10,240 KB). value={{ $value }}KB"
        jp1_pc_resolved_description: "Amount of data read in KB in instance store volumes has fallen below the threshold (10,240 KB)."

aws_ec2_disk_write_bytes_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_ec2_disk_write_bytes_sum(Yet another cloudwatch exporter)
      expr: 10240 < aws_ec2_disk_write_bytes_sum / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0603"
        jp1_pc_metricname: "aws_ec2_disk_write_bytes_sum"
      annotations:
        jp1_pc_firing_description: "Amount of data written in KB in instance store volumes has exceeded the threshold (10,240 KB). value={{ $value }}KB"
        jp1_pc_resolved_description: "Amount of data written in KB in instance store volumes has fallen below the threshold (10,240 KB)."

aws_lambda_errors_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_lambda_errors_sum(Yet another cloudwatch exporter)
      expr: 0 < aws_lambda_errors_sum{dimension_Resource=""}
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0604"
        jp1_pc_metricname: "aws_lambda_errors_sum"
      annotations:
        jp1_pc_firing_description: "関数エラーが発生した呼び出しの数がしきい値(0個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "関数エラーが発生した呼び出しの数がしきい値(0個)を下回りました。"

aws_lambda_duration_average^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_lambda_duration_average(Yet another cloudwatch exporter)
      expr: 5000 < aws_lambda_duration_average{dimension_Resource=""}
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0605"
        jp1_pc_metricname: "aws_lambda_duration_average"
      annotations:
        jp1_pc_firing_description: "関数コードがイベントの処理に費やす時間がしきい値(5000ミリ秒)を上回りました。value={{ $value }}ミリ秒"
        jp1_pc_resolved_description: "関数コードがイベントの処理に費やす時間がしきい値(5000ミリ秒)を下回りました。"

aws_s3_bucket_size_bytes_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_s3_bucket_size_bytes_sum(Yet another cloudwatch exporter)
      expr: 1024 < aws_s3_bucket_size_bytes_sum / (1024*1024*1024)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0606"
        jp1_pc_metricname: "aws_s3_bucket_size_bytes_sum"
      annotations:
        jp1_pc_firing_description: "バケットの保存データ量がしきい値(1024ギガバイト)を上回りました。value={{ $value }}ギガバイト"
        jp1_pc_resolved_description: "バケットの保存データ量がしきい値(1024ギガバイト)を下回りました。"

aws_s3_5xx_errors_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_s3_5xx_errors_sum(Yet another cloudwatch exporter)
      expr: 0 < aws_s3_5xx_errors_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0607"
        jp1_pc_metricname: "aws_s3_5xx_errors_sum"
      annotations:
        jp1_pc_firing_description: "バケットへのリクエストに対して，HTTP 5xx サーバーエラーステータスコードを返却される数がしきい値(0個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "バケットへのリクエストに対して，HTTP 5xx サーバーエラーステータスコードを返却される数がしきい値(0個)を下回りました。"

aws_dynamodb_consumed_read_capacity_units_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_dynamodb_consumed_read_capacity_units_sum(Yet another cloudwatch exporter)
      expr: 600 < aws_dynamodb_consumed_read_capacity_units_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0608"
        jp1_pc_metricname: "aws_dynamodb_consumed_read_capacity_units_sum"
      annotations:
        jp1_pc_firing_description: "消費された読み込み容量ユニットの合計数がしきい値(600個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "消費された読み込み容量ユニットの合計数がしきい値(600個)を下回りました。"

aws_dynamodb_consumed_write_capacity_units_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_dynamodb_consumed_write_capacity_units_sum(Yet another cloudwatch exporter)
      expr: 600 < aws_dynamodb_consumed_write_capacity_units_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0609"
        jp1_pc_metricname: "aws_dynamodb_consumed_write_capacity_units_sum"
      annotations:
        jp1_pc_firing_description: "消費された書き込み容量ユニットの合計数がしきい値(600個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "消費された書き込み容量ユニットの合計数がしきい値(600個)を下回りました。"

aws_states_execution_time_average^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_states_execution_time_average(Yet another cloudwatch exporter)
      expr: 5000 < aws_states_execution_time_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0610"
        jp1_pc_metricname: "aws_states_execution_time_average"
      annotations:
        jp1_pc_firing_description: "Step Functionsの実行時間がしきい値(5000ミリ秒)を上回りました。value={{ $value }}ミリ秒"
        jp1_pc_resolved_description: "Step Functionsの実行時間がしきい値(5000ミリ秒)を下回りました。"

aws_states_executions_failed_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_states_executions_failed_sum(Yet another cloudwatch exporter)
      expr: 0 < aws_states_executions_failed_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0611"
        jp1_pc_metricname: "aws_states_executions_failed_sum"
      annotations:
        jp1_pc_firing_description: "Step Functionsの実行失敗数がしきい値(0個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "Step Functionsの実行失敗数がしきい値(0個)を下回りました。"

aws_sqs_approximate_number_of_messages_delayed_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_sqs_approximate_number_of_messages_delayed_sum(Yet another cloudwatch exporter)
      expr: 0 < aws_sqs_approximate_number_of_messages_delayed_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0612"
        jp1_pc_metricname: "aws_sqs_approximate_number_of_messages_delayed_sum"
      annotations:
        jp1_pc_firing_description: "遅延キューメッセージ数がしきい値(0個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "遅延キューメッセージ数がしきい値(0個)を下回りました。"

aws_sqs_number_of_messages_deleted_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_sqs_number_of_messages_deleted_sum(Yet another cloudwatch exporter)
      expr:  0 < aws_sqs_number_of_messages_deleted_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0613"
        jp1_pc_metricname: "aws_sqs_number_of_messages_deleted_sum"
      annotations:
        jp1_pc_firing_description: "削除キューメッセージ数がしきい値(0個)を上回りました。value={{ $value }}個"
        jp1_pc_resolved_description: "削除キューメッセージ数がしきい値(0個)を下回りました。"

aws_ecs_cpuutilization_average^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_ecs_cpuutilization_average(Yet another cloudwatch exporter)
      expr: 80 < aws_ecs_cpuutilization_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0614"
        jp1_pc_metricname: "aws_ecs_cpuutilization_average"
      annotations:
        jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)."

aws_ecs_memory_utilization_average^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_ecs_memory_utilization_average(Yet another cloudwatch exporter)
      expr: 80 < aws_ecs_memory_utilization_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0615"
        jp1_pc_metricname: "aws_ecs_memory_utilization_average"
      annotations:
        jp1_pc_firing_description: "Memory usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "Memory usage has fallen below the threshold (80%)."

aws_rds_cpuutilization_average^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_rds_cpuutilization_average(Yet another cloudwatch exporter)
      expr: 80 < aws_rds_cpuutilization_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0616"
        jp1_pc_metricname: "aws_rds_cpuutilization_average"
      annotations:
        jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)."

aws_sns_number_of_notifications_failed_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_sns_number_of_notifications_failed_sum(Yet another cloudwatch exporter)
      expr: 0 < aws_sns_number_of_notifications_failed_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0617"
        jp1_pc_metricname: "aws_sns_number_of_notifications_failed_sum"
      annotations:
        jp1_pc_firing_description: "The number of rejected messages has exceeded the threshold (0 messages). value={{ $value }} messages"
        jp1_pc_resolved_description: "The number of rejected messages has fallen below the threshold (0 messages)."

aws_sns_number_of_notifications_filtered_out_sum^#

groups:
  - name: yet_another_cloudwatch_exporter
    rules:
    - alert: aws_sns_number_of_notifications_filtered_out_sum(Yet another cloudwatch exporter)
      expr: 0 < aws_sns_number_of_notifications_filtered_out_sum
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0618"
        jp1_pc_metricname: "aws_sns_number_of_notifications_filtered_out_sum"
      annotations:
        jp1_pc_firing_description: "The number of rejected messages has exceeded the threshold (0 messages). value={{ $value }} messages"
        jp1_pc_resolved_description: "The number of rejected messages has fallen below the threshold (0 messages)."

#: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Promitor metric definition file

azure_virtual_machine_disk_read_bytes_total^#

groups:
  - name: promitor
    rules:
    - alert: azure_virtual_machine_disk_read_bytes_total(Promitor)
      expr: 10485760 < azure_virtual_machine_disk_read_bytes_total
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0901"
        jp1_pc_metricname: "azure_virtual_machine_disk_read_bytes_total"
      annotations:
        jp1_pc_firing_description: "The number of disk read bytes has exceeded the threshold (10485760 bytes). value={{ $value }} bytes"
        jp1_pc_resolved_description: "The number of disk read bytes has fallen below the threshold (10485760 bytes)."

azure_virtual_machine_disk_write_bytes_total^#

groups:
  - name: promitor
    rules:
    - alert: azure_virtual_machine_disk_write_bytes_total(Promitor)
      expr: 10485760 < azure_virtual_machine_disk_write_bytes_total
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0902"
        jp1_pc_metricname: "azure_virtual_machine_disk_write_bytes_total"
      annotations:
        jp1_pc_firing_description: "The number of disk write bytes has exceeded the threshold (10485760 bytes). value={{ $value }} bytes"
        jp1_pc_resolved_description: "The number of disk write bytes has fallen below the threshold (10485760 bytes)."

azure_virtual_machine_percentage_cpu_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_virtual_machine_percentage_cpu_average(Promitor)
      expr: 80 < azure_virtual_machine_percentage_cpu_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0903"
        jp1_pc_metricname: "azure_virtual_machine_percentage_cpu_average"
      annotations:
        jp1_pc_firing_description: "The percentage of allocated compute units has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "The percentage of allocated compute units has fallen below the threshold (80%)."

azure_blob_storage_availability_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_blob_storage_availability_average(Promitor)
      expr: 100 > azure_blob_storage_availability_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0904"
        jp1_pc_metricname: "azure_blob_storage_availability_average"
      annotations:
        jp1_pc_firing_description: "The percentage of availability has fallen below the threshold (100%). value={{ $value }}%"
        jp1_pc_resolved_description: "The percentage of availability has exceeded the threshold (100%)."

azure_blob_storage_blob_capacity_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_blob_storage_blob_capacity_average(Promitor)
      expr: 1099511627776 < azure_blob_storage_blob_capacity_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0905"
        jp1_pc_metricname: "azure_blob_storage_blob_capacity_average"
      annotations:
        jp1_pc_firing_description: "The storage capacity has exceeded the threshold (1099511627776 bytes). value={{ $value }} bytes"
        jp1_pc_resolved_description: "The storage capacity has fallen below the threshold (1099511627776 bytes)."

azure_function_app_http5xx_total^#

groups:
  - name: promitor
    rules:
    - alert: azure_function_app_http5xx_total(Promitor)
      expr: 0 < azure_function_app_http5xx_total
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0906"
        jp1_pc_metricname: "azure_function_app_http5xx_total"
      annotations:
        jp1_pc_firing_description: "The number of 5xx server errors has exceeded the threshold (0 errors). value={{ $value }} errors"
        jp1_pc_resolved_description: "The number of 5xx server errors has fallen below the threshold (0 errors)."

azure_function_app_http_response_time_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_function_app_http_response_time_average(Promitor)
      expr: 5 < azure_function_app_http_response_time_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0907"
        jp1_pc_metricname: "azure_function_app_http_response_time_average"
      annotations:
        jp1_pc_firing_description: "The response time has exceeded the threshold (5 seconds). value={{ $value }} seconds"
        jp1_pc_resolved_description: "The response time has fallen below the threshold (5 seconds)."

azure_cosmos_db_total_request_units_total^#

groups:
  - name: promitor
    rules:
    - alert: azure_cosmos_db_total_request_units_total(Promitor)
      expr: 600 < azure_cosmos_db_total_request_units_total
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0908"
        jp1_pc_metricname: "azure_cosmos_db_total_request_units_total"
      annotations:
        jp1_pc_firing_description: "The number of consumed request units has exceeded the threshold (600 units). value={{ $value }} units, collectionname={{ $labels.collectionname }}"
        jp1_pc_resolved_description: "The number of consumed request units has fallen below the threshold (600 units). collectionname={{ $labels.collectionname }}"

azure_logic_app_runs_failed_total^#

groups:
  - name: promitor
    rules:
    - alert: azure_logic_app_runs_failed_total(Promitor)
      expr: 0 < azure_logic_app_runs_failed_total
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0910"
        jp1_pc_metricname: "azure_logic_app_runs_failed_total"
      annotations:
        jp1_pc_firing_description: "The number of workflow errors has exceeded the threshold (0 errors). value={{ $value }} errors"
        jp1_pc_resolved_description: "The number of workflow errors has fallen below the threshold (0 errors)."

azure_container_instance_cpu_usage_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_container_instance_cpu_usage_average(Promitor)
      expr: 800 < azure_container_instance_cpu_usage_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0911"
        jp1_pc_firing_description: "CPU usage (millicores) has exceeded the threshold (800 millicores). value={{ $value }} millicores"
        jp1_pc_resolved_description: "CPU usage (millicores) has fallen below the threshold (800 millicores)."

azure_kubernetes_service_kube_pod_status_phase_average_failed^#

groups:
  - name: promitor
    rules:
    - alert: azure_kubernetes_service_kube_pod_status_phase_average_failed(Promitor)
      expr: 0 < azure_kubernetes_service_kube_pod_status_phase_average_failed
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0912"
        jp1_pc_metricname: "azure_kubernetes_service_kube_pod_status_phase_average_failed"
      annotations:
        jp1_pc_firing_description: "The number of failed pods has exceeded the threshold (0 pods). value={{ $value }} pods"
        jp1_pc_resolved_description: "The number of failed pods has fallen below the threshold (0 pods)."

azure_kubernetes_service_kube_pod_status_phase_average_pending^#

groups:
  - name: promitor
    rules:
    - alert: azure_kubernetes_service_kube_pod_status_phase_average_pending(Promitor)
      expr: 0 < azure_kubernetes_service_kube_pod_status_phase_average_pending
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0913"
        jp1_pc_metricname: "azure_kubernetes_service_kube_pod_status_phase_average_pending"
      annotations:
        jp1_pc_firing_description: "The number of pending pods has exceeded the threshold (0 pods). value={{ $value }} pods"
        jp1_pc_resolved_description: "The number of pending pods has fallen below the threshold (0 pods)."

azure_kubernetes_service_kube_pod_status_phase_average_unknown^#

groups:
  - name: promitor
    rules:
    - alert: azure_kubernetes_service_kube_pod_status_phase_average_unknown(Promitor)
      expr: 0 < azure_kubernetes_service_kube_pod_status_phase_average_unknown
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0914"
        jp1_pc_metricname: "azure_kubernetes_service_kube_pod_status_phase_average_unknown"
      annotations:
        jp1_pc_firing_description: "The number of unknown pods has exceeded the threshold (0 pods). value={{ $value }} pods"
        jp1_pc_resolved_description: "The number of unknown pods has fallen below the threshold (0 pods)."

azure_file_storage_availability_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_file_storage_availability_average(Promitor)
      expr: 100 > azure_file_storage_availability_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0915"
        jp1_pc_metricname: "azure_file_storage_availability_average"
      annotations:
        jp1_pc_firing_description: "The percentage of availability has fallen below the threshold (100%). value={{ $value }}%, fileshare={{ $labels.fileshare }}"
        jp1_pc_resolved_description: "The percentage of availability has exceeded the threshold (100%). fileshare={{ $labels.fileshare }}"

azure_service_bus_namespace_deadlettered_messages_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_service_bus_namespace_deadlettered_messages_average(Promitor)
      expr: 0 < azure_service_bus_namespace_deadlettered_messages_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0916"
        jp1_pc_metricname: "azure_service_bus_namespace_deadlettered_messages_average"
      annotations:
        jp1_pc_firing_description: "The number of dead-lettered messages has exceeded the threshold (0 messages). value={{ $value }} messages, entity_name={{ $labels.entity_name }}"
        jp1_pc_resolved_description: "The number of dead-lettered messages has fallen below the threshold (0 messages). entity_name={{ $labels.entity_name }}"

azure_sql_database_cpu_percent_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_sql_database_cpu_percent_average(Promitor)
      expr: 80 < azure_sql_database_cpu_percent_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0917"
        jp1_pc_metricname: "azure_sql_database_cpu_percent_average"
      annotations:
        jp1_pc_firing_description: "CPU percentage has exceeded the threshold (80%). value={{ $value }}%, server={{ $labels.server }}"
        jp1_pc_resolved_description: "CPU percentage has fallen below the threshold (80%). server={{ $labels.server }}"

azure_sql_elastic_pool_cpu_percent_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_sql_elastic_pool_cpu_percent_average(Promitor)
      expr: 80 < azure_sql_elastic_pool_cpu_percent_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0920"
        jp1_pc_metricname: "azure_sql_elastic_pool_cpu_percent_average"
      annotations:
        jp1_pc_firing_description: "CPU percentage has exceeded the threshold (80%). value={{ $value }}%, server={{ $labels.server }}"
        jp1_pc_resolved_description: "CPU percentage has fallen below the threshold (80%). server={{ $labels.server }}"

azure_sql_managed_instance_avg_cpu_percent_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_sql_managed_instance_avg_cpu_percent_average(Promitor)
      expr: 80 < azure_sql_managed_instance_avg_cpu_percent_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0922"
        jp1_pc_metricname: "azure_sql_managed_instance_avg_cpu_percent_average"
      annotations:
        jp1_pc_firing_description: "Average CPU percentage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "Average CPU percentage has fallen below the threshold (80%)."

azure_sql_managed_instance_io_bytes_read_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_sql_managed_instance_io_bytes_read_average(Promitor)
      expr: 10485760 < azure_sql_managed_instance_io_bytes_read_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0923"
        jp1_pc_metricname: "azure_sql_managed_instance_io_bytes_read_average"
      annotations:
        jp1_pc_firing_description: "The number of IO bytes read has exceeded the threshold (10485760 bytes). value={{ $value }} bytes"
        jp1_pc_resolved_description: "The number of disk read bytes has fallen below the threshold (10485760 bytes)."

azure_sql_managed_instance_io_bytes_written_average^#

groups:
  - name: promitor
    rules:
    - alert: azure_sql_managed_instance_io_bytes_written_average(Promitor)
      expr: 10485760 < azure_sql_managed_instance_io_bytes_written_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0924"
        jp1_pc_metricname: "azure_sql_managed_instance_io_bytes_written_average"
      annotations:
        jp1_pc_firing_description: "The number of IO bytes written has exceeded the threshold (10485760 bytes). value={{ $value }} bytes"
        jp1_pc_resolved_description: "The number of IO bytes written has fallen below the threshold (10485760 bytes)."

#: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Script exporter metric definition file

azure_virtual_machine_disk_read_bytes_total^#1

groups:
  - name: script_exporter
    rules:
    - alert: script_success(Script exporter)
      expr: 0 == script_success
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1401"
        jp1_pc_metricname: "script_success"
      annotations:
        jp1_pc_firing_description: "Failed to execute script. value={{ $value }}"
        jp1_pc_resolved_description: "Script successfully executed."

script_duration_seconds^{#1, #2}

groups:
  - name: script_exporter
    rules:
    - alert: script_duration_seconds(Script exporter)
      expr: 60 < script_duration_seconds
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1402"
        jp1_pc_metricname: "script_duration_seconds"
      annotations:
        jp1_pc_firing_description: "The script execution time has exceeded the threshold (60 seconds). value={{ $value }} seconds"
        jp1_pc_resolved_description: "The script execution time has fallen below the threshold (60 seconds)."

script_exit_code^#1

groups:
  - name: script_exporter
    rules:
    - alert: script_exit_code(Script exporter)
      expr: 0 != script_exit_code
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1403"
        jp1_pc_metricname: "script_exit_code"
      annotations:
        jp1_pc_firing_description: "Failed to execute script. value={{ $value }}"
        jp1_pc_resolved_description: "Script successfully executed."

#1: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
#2: This uses a threshold value of 60 as an example. Change this value based on the number of monitoring targets.

■Alert definition example for metrics in OracleDB exporter metric definition file

oracledb_up^#

groups:
  - name: oracledb_exporter
    rules:
    - alert: oracledb_down(OracleDB exporter)
      expr: oracledb_up != 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0801"
        jp1_pc_metricname: "oracledb_up"
      annotations:
        jp1_pc_firing_description: "OracleDB stopped.instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "OracleDB started. instance={{ $labels.instance }}"

cache_hit_ratio_percent^#

groups:
  - name: oracledb_exporter
    rules:
    - alert: cache_hit_ratio_percentage_under_60(OracleDB exporter)
      expr: (1 - (rate(oracledb_activity_physical_reads_cache[2m]) / (rate(oracledb_activity_consistent_gets_from_cache[2m])+rate(oracledb_activity_db_block_gets_from_cache[2m]))))*100  < 60
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0802"
        jp1_pc_metricname: "oracledb_activity_physical_reads_cache,oracledb_activity_consistent_gets_from_cache,oracledb_activity_db_block_gets_from_cache"
      annotations:
        jp1_pc_firing_description: "Cache hit rate for OracleDB dropped below 60%. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "OracleDB cache hit rate is now over 60%. instance={{ $labels.instance }}"

tablespace_used_percent^#

groups:
  - name: oracledb_exporter
    rules:
    - alert: oracledb_tablespace_used_percent_over_90(OracleDB exporter)
      expr: oracledb_tablespace_used_percent > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0803"
        jp1_pc_metricname: "oracledb_tablespace_used_percent"
      annotations:
        jp1_pc_firing_description: "Tablespace usage for OracleDB exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "Tablespace usage for OracleDB is 90% or less. instance={{ $labels.instance }}"

execute_count^#

groups:
  - name: oracledb_exporter
    rules:
    - alert: oracledb_activity_execute_count_over_1000(OracleDB exporter)
      expr: rate(oracledb_activity_execute_count[2m])*60 > 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "0804"
        jp1_pc_metricname: "oracledb_activity_execute_count"
      annotations:
        jp1_pc_firing_description: "SQL statement executed more than 1000 times. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "The number of executions of SQL statement is less than 1000. instance={{ $labels.instance }}"

parse_count^#

Please see the execute_count to create it.
user_commit_count^#

Please see the execute_count to create it.
user_rollback_count^#

Please see the execute_count to create it.
resource_used^#

Please see the tablespace_used_percent to create it.
session_count^#

Please see the tablespace_used_percent to create it.

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in Web exporter metric definition file

probe_webscena_success^#

groups:
  - name: web_exporter
    rules:
    - alert: probe_webscena_success(Web exporter)
      expr: 0 == probe_webscena_success
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1901"
        jp1_pc_metricname: "probe_webscena_success"
      annotations:
        jp1_pc_firing_description: "Communication failed. value={{ $value }}"
        jp1_pc_resolved_description: "Communication was successful."

probe_webscena_duration_seconds^#

groups:
  - name: web_exporter
    rules:
    - alert: probe_webscena_duration_seconds(Web exporter)
      expr: 150 < probe_webscena_duration_seconds
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1902"
        jp1_pc_metricname: "probe_webscena_duration_seconds"
      annotations:
        jp1_pc_firing_description: "The period (in seconds) taken by the web scenario probe has exceeded threshold (150 seconds). value={{$value}} seconds"
        jp1_pc_resolved_description: "The period (in seconds) taken by the web scenario probe has fallen below the threshold (150 seconds)"

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.

■Alert definition example for metrics in VMware exporter metric definition file for host

vmware_host_size^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_size(VMware exporter)
      expr: 900 > sum(vmware_datastore_capacity_size) without(ds_name)/1024/1024/1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1101"
        jp1_pc_metricname: "vmware_datastore_capacity_size"
      annotations:
        jp1_pc_firing_description: "The physical disk size has fallen below the threshold (900 gigabytes).value={{ $value }}gigabytes, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The physical disk size has exceeded threshold (900 gigabytes). instance={{ $labels.instance }}"

vmware_host_used^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_used(VMware exporter)
      expr: 800 < ((sum(vmware_datastore_capacity_size) without(ds_name)) /1024/1024/1024)-((sum(vmware_datastore_freespace_size) without(ds_name)) /1024/1024/1024 )
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1112"
        jp1_pc_metricname: "vmware_datastore_capacity_size,vmware_datastore_freespace_size"
      annotations:
        jp1_pc_firing_description: "The physical disk usage size has exceeded threshold (800 gigabytes) value={{ $value }} gigabytes, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The physical disk usage size has fallen below the threshold (800 gigabytes). instance={{ $labels.instance }}"

vmware_host_free^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_free(VMware exporter)
      expr: 10 > (sum(vmware_datastore_freespace_size) without(ds_name))/1024/1024/1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1102"
        jp1_pc_metricname: "vmware_datastore_freespace_size"
      annotations:
        jp1_pc_firing_description: "The size of free physical disk space has fallen below the threshold (10 gigabytes). value={{ $value }} gigabytes, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The size of free physical disk space has exceeded threshold (10 gigabytes). instance={{ $labels.instance }}"

vmware_datastore_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_datastore_used_percent(VMware exporter)
      expr: (((sum(vmware_datastore_capacity_size) without(ds_name))/1024/1024) - ((sum(vmware_datastore_freespace_size) without(ds_name))/1024/1024 ))/ ((sum(vmware_datastore_capacity_size) without(ds_name))/1024/1024) * 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1103"
        jp1_pc_metricname: "vmware_datastore_capacity_size,vmware_datastore_freespace_size"
      annotations:
        jp1_pc_firing_description: "The physical disk usage rate has exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "The physical disk usage rate is 90% or less. instance={{ $labels.instance }}"

vmware_host_memory_max^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_memory_max(VMware exporter)
      expr: 16 > vmware_host_memory_max / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1104"
        jp1_pc_metricname: "vmware_host_memory_max"
      annotations:
        jp1_pc_firing_description: "The total size of physical memory has fallen below the threshold (16 gigabytes). value={{ $value }} gigabytes"
        jp1_pc_resolved_description: "The total size of physical memory has exceeded the threshold (16 gigabytes)."

vmware_host_memory_used^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_memory_used(VMware exporter)
      expr: 15 < vmware_host_memory_usage / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1117"
        jp1_pc_metricname: "vmware_host_memory_usage"
      annotations:
        jp1_pc_firing_description: "The amount of physical memory used has exceeded threshold (15 gigabytes). value={{ $value }} gigabytes"
        jp1_pc_resolved_description: "The amount of physical memory used has fallen below the threshold (15 gigabytes)."

vmware_host_memory_unused^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_memory_unused(VMware exporter)
      expr: 1 > (vmware_host_memory_max / 1024) - (vmware_host_memory_usage /1024)
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1105"
        jp1_pc_metricname: "vmware_host_memory_max,vmware_host_memory_usage"
      annotations:
        jp1_pc_firing_description: "The amount of unused physical memory has fallen below the threshold (1 gigabyte). value={{ $value }} gigabytes"
        jp1_pc_resolved_description: "The amount of unused physical memory has exceeded the threshold (1 gigabyte)."

vmware_host_mem_vmmemctl_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_mem_vmmemctl_average(VMware exporter)
      expr: 15 < vmware_host_mem_vmmemctl_average / 1024 /1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1118"
        jp1_pc_metricname: "vmware_host_mem_vmmemctl_average"
      annotations:
        jp1_pc_firing_description: "The amount of internal swaps used has exceeded the threshold (15 gigabytes). value={{ $value }} gigabytes"
        jp1_pc_resolved_description: "The amount of internal swaps used has fallen below the threshold (15 gigabyte)."

vmware_host_memory_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_memory_used_percent(VMware exporter)
      expr: (vmware_host_memory_usage / vmware_host_memory_max) * 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1106"
        jp1_pc_metricname: "vmware_host_memory_usage,vmware_host_memory_max"
      annotations:
        jp1_pc_firing_description: "The ratio of physical memory has exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "The ratio of physical memory is 90% or less. instance={{ $labels.instance }}"

vmware_host_swap_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_swap_used_percent(VMware exporter)
      expr: ((vmware_host_mem_vmmemctl_average / 1024)/ vmware_host_memory_max) * 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1116"
        jp1_pc_metricname: "vmware_host_mem_vmmemctl_average,vmware_host_memory_max"
      annotations:
        jp1_pc_firing_description: "The internal swap rate has exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "The internal swap rate is 90% or less. instance={{ $labels.instance }}"

vmware_vm_net_rate^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_vm_net_rate(VMware exporter)
      expr: 10 > vmware_host_net_bytesTx_average + vmware_host_net_bytesRx_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1107"
        jp1_pc_metricname: "vmware_host_net_bytesTx_average,vmware_host_net_bytesRx_average"
      annotations:
        jp1_pc_firing_description: "The network transmission/reception speed has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The network transmission/reception speed has exceeded threshold (10KB per seconds). instance={{ $labels.instance }}"

vmware_host_net_bytesTx_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_net_bytesTx_average(VMware exporter)
      expr: 10 > vmware_host_net_bytesTx_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1108"
        jp1_pc_metricname: "vmware_host_net_bytesTx_average"
      annotations:
        jp1_pc_firing_description: "The network transmission speed has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The network transmission speed has exceeded threshold (10KB per seconds). instance={{ $labels.instance }}"

vmware_host_net_bytesRx_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_net_bytesRx_average(VMware exporter)
      expr: 10 > vmware_host_net_bytesRx_average
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1109"
        jp1_pc_metricname: "vmware_host_net_bytesRx_average"
      annotations:
        jp1_pc_firing_description: "The network reception speed has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The network reception speed has exceeded threshold (10KB per seconds). instance={{ $labels.instance }}"

vmware_host_num_cpu^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_num_cpu(VMware exporter)
      expr: 2 > vmware_host_num_cpu
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1113"
        jp1_pc_metricname: "vmware_host_num_cpu"
      annotations:
        jp1_pc_firing_description: "The Number of CPU cores has fallen below threshold value (2)."
        jp1_pc_resolved_description: "The Number of CPU cores has exceeded threshold value (2)."

vmware_host_cpu_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_cpu_used_percent(VMware exporter)
      expr: vmware_host_cpu_usage_average / 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1110"
        jp1_pc_metricname: "vmware_host_cpu_usage_average"
      annotations:
        jp1_pc_firing_description: "The CPU usage rate has exceeded threshold value (90%).instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "The CPU usage rate has fallen below threshold value (90%). instance={{ $labels.instance }}"

vmware_host_disk_write_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_disk_write_average(VMware exporter)
      expr: 10 > vmware_host_disk_write_average / 8
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1119"
        jp1_pc_metricname: "vmware_host_disk_write_average"
      annotations:
        jp1_pc_firing_description: "The Write data transfer speed has fallen below threshold value (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The Write data transfer speed has exceeded threshold value (10KB per seconds). instance={{ $labels.instance }}"

vmware_host_disk_read_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_host_disk_read_average(VMware exporter)
      expr: 10 > vmware_host_disk_read_average / 8
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1120"
        jp1_pc_metricname: "vmware_host_disk_read_average"
      annotations:
        jp1_pc_firing_description: "The Read data transfer speed has fallen below threshold value (10KB per seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "The Read data transfer speed has exceeded threshold value (10KB per seconds). instance={{ $labels.instance }}"

#: If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group name.

■Alert definition example for metrics in VMware exporter metric definition file for VM

vmware_vm_cpu_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_vm_cpu_used_percent(VMware exporter)
      expr: vmware_vm_cpu_usage_average / (20 * 1000) * 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1121"
        jp1_pc_metricname: "vmware_vm_cpu_usage_average"
      annotations:
        jp1_pc_firing_description: "The virtual CPU usage rate has exceeded 90%. vm_name={{ $labels.vm_name }}, value={{ $value }}"
        jp1_pc_resolved_description: "The virtual CPU usage rate is 90% or less. vm_name={{ $labels.vm_name }}"

vmware_vm_mem_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_vm_mem_used_percent(VMware exporter)
      expr: (((vmware_vm_mem_consumed_average/ 1024) + (vmware_vm_mem_vmmemctl_average / 1024) + (vmware_vm_mem_swapped_average / 1024))/ vmware_vm_memory_max) * 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1111"
        jp1_pc_metricname: "vmware_vm_mem_consumed_average,vmware_vm_mem_vmmemctl_average,vmware_vm_mem_swapped_average,vmware_vm_memory_max"
      annotations:
        jp1_pc_firing_description: "The Virtual memory usage rate has exceeded 90%. vm_name={{ $labels.vm_name }}, value={{ $value }}"
        jp1_pc_resolved_description: "The Virtual memory usage rate is 90% or less. vm_name={{ $labels.vm_name }}"

vmware_vm_disk_write_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_vm_disk_write_average(VMware exporter)
      expr: 10 > vmware_vm_disk_write_average / 8
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1122"
        jp1_pc_metricname: "vmware_vm_disk_write_average"
      annotations:
        jp1_pc_firing_description: "The Write data transfer speed of the virtual machine has fallen below threshold (10KB per seconds) value={{ $value }} KB per seconds, vm_name={{ $labels.vm_name }}"
        jp1_pc_resolved_description: "The Write data transfer speed of the virtual machine has exceeded threshold (10KB per seconds). vm_name={{ $labels.vm_name }}"

vmware_vm_disk_read_average^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_vm_disk_read_average(VMware exporter)
      expr: 10 > vmware_vm_disk_read_average / 8
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1123"
        jp1_pc_metricname: "vmware_vm_disk_read_average"
      annotations:
        jp1_pc_firing_description: "The Read data transfer speed of the virtual machine has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, vm_name={{ $labels.vm_name }}"
        jp1_pc_resolved_description: "The Read data transfer speed of the virtual machine has exceeded threshold (10KB per seconds). vm_name={{ $labels.vm_name }}"

vmware_vm_disk_used_percent^#

groups:
  - name: vmware_exporter
    rules:
    - alert: vmware_vm_disk_used_percent(VMware exporter)
      expr: (((sum(vmware_vm_guest_disk_capacity) without(partition) / (1024 * 1024)) - ((sum(vmware_vm_guest_disk_free) without(partition))/(1024 * 1024))) / ((sum(vmware_vm_guest_disk_capacity) without(partition))/(1024*1024))) * 100 > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1124"
        jp1_pc_metricname: "vmware_vm_guest_disk_capacity,vmware_vm_guest_disk_free"
      annotations:
        jp1_pc_firing_description: "The Disk usage rate of the virtual machine has exceeded 90%. vm_name={{ $labels.vm_name }}, value={{ $value }}"
        jp1_pc_resolved_description: "The Disk usage rate of the virtual machine is 90% or less. vm_name={{ $labels.vm_name }}"

#: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Windows exporter (Hyper-V monitoring) metric definition file for host

hyperv_host_cpu_resources_used_percent^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_vm_cpu_resources_used_percent(Windows exporter)
      expr: 90 < (sum by(instance,job,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_prome_hostname,vm)(rate(windows_hyperv_vm_cpu_hypervisor_run_time[2m]))) / ignoring(instance,job,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_prome_hostname,vm) group_left max (windows_cs_logical_processors) / 100000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1601"
        jp1_pc_metricname: "windows_hyperv_vm_cpu_hypervisor_run_time,windows_cs_logical_processors"
      annotations:
        jp1_pc_firing_description: "VM used more than 90% of the physical CPU. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "VM now uses 90% or less of the physical CPU. instance={{ $labels.instance }}"

hyperv_host_cpu_used_percent^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_host_cpu_used_percent(Windows exporter)
      expr: 90 < (sum by (instance,job,jp1_pc_category,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_trendname,jp1_pc_prome_hostname)(rate(windows_hyperv_host_cpu_total_run_time[2m]))) / sum by (instance,job,jp1_pc_category,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_trendname,jp1_pc_prome_hostname)(windows_cs_logical_processors) / 100000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1604"
        jp1_pc_metricname: windows_hyperv_host_cpu_total_run_time,windows_cs_logical_processors)"
      annotations:
        jp1_pc_firing_description: "CPU usage of physical servers exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}"
        jp1_pc_resolved_description: "The CPU utilization of the physical server has exceeded 90%. instance={{ $labels.instance }}"

hyperv_vswitch_sent_received^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_vswitch_sent_received(Windows exporter)
      expr: 10 > ((rate(windows_hyperv_vswitch_bytes_received_total[2m])) + (rate(windows_hyperv_vswitch_bytes_sent_total[2m]))) /1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1605"
        jp1_pc_metricname: "windows_hyperv_vswitch_bytes_received_total,windows_hyperv_vswitch_bytes_sent_total"
      annotations:
        jp1_pc_firing_description: "Network-incoming/outgoing rate dropped below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "Network transmission/reception rate exceeded the threshold (10KB/ seconds). instance={{ $labels.instance }}"

hyperv_vswitch_received^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_vswitch_received(Windows exporter)
      expr: 10 > (rate(windows_hyperv_vswitch_bytes_received_total[2m])) / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1606"
        jp1_pc_metricname: "windows_hyperv_vswitch_bytes_received_total"
      annotations:
        jp1_pc_firing_description: "Network-receive rate is below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "Network receive rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}"

hyperv_vswitch_sent^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_vswitch_sent(Windows exporter)
      expr: 10 > (rate(windows_hyperv_vswitch_bytes_sent_total[2m]) and $jp1im_TrendData_labels) / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1607"
        jp1_pc_metricname: "windows_hyperv_vswitch_bytes_sent_total"
      annotations:
        jp1_pc_firing_description: "Network-send rate is below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "Network sending rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}"

#: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Windows exporter (Hyper-V monitoring) metric definition file for VM

hyperv_vm_device_written^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_vm_device_written(Windows exporter)
      expr: 10 > (rate(windows_hyperv_vm_device_bytes_written[2m])) / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1602"
        jp1_pc_metricname: "windows_hyperv_vm_device_bytes_written"
      annotations:
        jp1_pc_firing_description: "Write data rate dropped below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "Write data rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}"

hyperv_vm_device_read^#

groups:
  - name: windows_exporter_hyperv
    rules:
    - alert: hyperv_vm_device_read(Windows exporter)
      expr: 10 > (rate(windows_hyperv_vm_device_bytes_read[2m])) / 1024
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1603"
        jp1_pc_metricname: "windows_hyperv_vm_device_bytes_read"
      annotations:
        jp1_pc_firing_description: "Read data rate dropped below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}"
        jp1_pc_resolved_description: "Read data rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}"

#: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in SQL exporter metric definition file

connections^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_connections(SQL exporter)
      expr: mssql_connections > 50
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1701"
        jp1_pc_metricname: "mssql_connections"
      annotations:
        jp1_pc_firing_description: "Number of connections exceeds threshold (50 connections). value={{ $value }}"
        jp1_pc_resolved_description: "Number of connections dropped below threshold (50 units). value={{ $value }}"

deadlocks^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_deadlocks (SQL exporter)
      expr: rate(mssql_deadlocks[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1702"
        jp1_pc_metricname: "mssql_deadlocks"
      annotations:
        jp1_pc_firing_description: "Deadlock occurred. value={{$value}value={{ $value }}"
        jp1_pc_resolved_description: "No deadlock detected. value={{ $value }}"

user_errors^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_user_errors(SQL exporter)
      expr: rate(mssql_user_errors[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1703"
        jp1_pc_metricname: "mssql_user_errors"
      annotations:
        jp1_pc_firing_description: "User error occurred. value={{ $value }}"
        jp1_pc_resolved_description: "No user error detected. value={{ $value }}"

kill_connection_errors^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_kill_connection_errors(SQL exporter)
      expr: rate(mssql_kill_connection_errors[2m]) > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1704"
        jp1_pc_metricname: "mssql_kill_connection_errors"
      annotations:
        jp1_pc_firing_description: "Critical failure. value={{ $value }}"
        jp1_pc_resolved_description: "No critical failure detected. value={{ $value }}"

page_life_expectancy_seconds^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_life_expectancy_seconds(SQL exporter)
      expr: rate(mssql_page_life_expectancy_seconds[2m]) > 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1705"
        jp1_pc_metricname: "mssql_page_life_expectancy_seconds"
      annotations:
        jp1_pc_firing_description: "Bufferpool remaining time exceeded threshold (1 second). value={{ $value }}"
        jp1_pc_resolved_description: "Bufferpool latency less than threshold (1 second). value={{ $value }}"

batch_requests^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_command_batch(SQL exporter)
      expr: rate(mssql_batch_requests[2m]) > 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1706"
        jp1_pc_metricname: "mssql_batch_requests"
      annotations:
        jp1_pc_firing_description: "The number of command batches received exceeds the threshold (1000). value={{ $value }}"
        jp1_pc_resolved_description: "The number of command batches received dropped below the threshold (1000). value={{ $value }}"

log_growths^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_log_growths(SQL exporter)
      expr: mssql_log_growths > 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1707"
        jp1_pc_metricname: "mssql_log_growths"
      annotations:
        jp1_pc_firing_description: "Transaction-log extension count exceeded threshold (1 time). value={{ $value }}"
        jp1_pc_resolved_description: "Transactional logging extended less than threshold (1 time). value={{ $value }}"

checkpoint_pages_sec^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_checkpoint_pages_sec(SQL exporter)
      expr: mssql_checkpoint_pages_sec > 0
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1709"
        jp1_pc_metricname: "mssql_checkpoint_pages_sec"
      annotations:
        jp1_pc_firing_description: "Checkpoint pages per second exceeded threshold (1). value={{ $value }}"
        jp1_pc_resolved_description: "Checkpoint pages per second dropped below threshold (1). value={{ $value }}"

io_stall_seconds^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_io_stall_seconds(SQL exporter)
      expr: avg(mssql_io_stall_seconds)without(operation) > 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1710"
        jp1_pc_metricname: "mssql_io_stall_seconds"
      annotations:
        jp1_pc_firing_description: "Stall time per operation exceeded threshold (1 second). value={{ $value }}"
        jp1_pc_resolved_description: "Stall duration per operation dropped below threshold (1 second). value={{ $value }}"

io_stall_read_seconds^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_io_stall_read_seconds(SQL exporter)
      expr: mssql_io_stall_seconds{operation="read"} > 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1711"
        jp1_pc_metricname: "mssql_io_stall_seconds"
      annotations:
        jp1_pc_firing_description: "Stall time per operation exceeded threshold (1 second). value={{ $value }}"
        jp1_pc_resolved_description: "Stall duration per operation dropped below threshold (1 second). value={{ $value }}"

io_stall_write_seconds^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_io_stall_write_seconds(SQL exporter)
      expr: mssql_io_stall_seconds{operation="write"} > 1
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1712"
        jp1_pc_metricname: "mssql_io_stall_seconds"
      annotations:
        jp1_pc_firing_description: "Stall time per operation exceeded threshold (1 second). value={{ $value }}"
        jp1_pc_resolved_description: "Stall duration per operation dropped below threshold (1 second). value={{ $value }}"

io_stall_total_seconds^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_io_stall_total_seconds(SQL exporter)
      expr: mssql_io_stall_total_seconds > 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1713"
        jp1_pc_metricname: "mssql_io_stall_total_seconds"
      annotations:
        jp1_pc_firing_description: "Total downtime per database exceeded threshold (10 seconds). value={{ $value }}"
        jp1_pc_resolved_description: "Total downtime per database was below threshold (10 seconds). value={{ $value }}"

resident_memory_mbytes^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_resident_memory_bytes(SQL exporter)
      expr: mssql_resident_memory_bytes /1024/1024 > 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1714"
        jp1_pc_metricname: "mssql_resident_memory_bytes"
      annotations:
        jp1_pc_firing_description: "Resident memory size exceeded threshold (1000MB). value={{ $value }}"
        jp1_pc_resolved_description: "Resident memory size less than threshold (1000MB). value={{ $value }}"

virtual_memory_mbytes^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_virtual_memory_bytes(SQL exporter)
      expr: mssql_virtual_memory_bytes /1024/1024 > 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1715"
        jp1_pc_metricname: "mssql_virtual_memory_bytes"
      annotations:
        jp1_pc_firing_description: "Commit size for virtual memory exceeded threshold (1000MB). value={{ $value }}"
        jp1_pc_resolved_description: "Commit size for virtual memory dropped below threshold (1000MB). value={{ $value }}"

memory_utilization_percentage^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_memory_utilization_percentage(SQL exporter)
      expr: mssql_memory_utilization_percentage > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1716"
        jp1_pc_metricname: "mssql_memory_utilization_percentage"
      annotations:
        jp1_pc_firing_description: "The percentage of committed memory in the working set exceeded the threshold (90%). value={{ $value }}"
        jp1_pc_resolved_description: "The percentage of committed memory in the working set is below the threshold (90%). value={{ $value }}"

page_fault_count^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_page_fault(SQL exporter)
      expr: rate(mssql_page_fault_count[2m]) > 10
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1717"
        jp1_pc_metricname: "mssql_page_fault_count "
      annotations:
        jp1_pc_firing_description: "The number of page faults caused by value={{$value} exceeded the threshold (10 times). value={{ $value }}"
        jp1_pc_resolved_description: "The number of page faults caused by value={{$value} is less than the threshold (10 times). value={{ $value }}"

os_memory_mbytes^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_os_memory_mbytes(SQL exporter)
      expr: sum(mssql_os_memory)without(state) /1024/1024 < 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1718"
        jp1_pc_metricname: "mssql_os_memory"
      annotations:
        jp1_pc_firing_description: "The OS physical memory (overall) has fallen below the threshold (1000MB). value={{ $value }}"
        jp1_pc_resolved_description: "OS physical memory (overall) exceeded threshold (1000MB). value={{ $value }}"

os_memory_available_mbytes^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_os_memory_available_mbytes(SQL exporter)
      expr: mssql_os_memory{state="available"} /1024/1024 < 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1719"
        jp1_pc_metricname: "mssql_os_memory"
      annotations:
        jp1_pc_firing_description: "The operating system physical memory (available) has fallen below the threshold (1000MB). value={{ $value }}"
        jp1_pc_resolved_description: "Operating system physical memory (available) exceeded threshold (1000MB). value={{ $value }}"

os_memory_used_mbytes^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_os_memory_used_mbytes(SQL exporter)
      expr: mssql_os_memory{state="used"} /1024/1024 > 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1720"
        jp1_pc_metricname: "mssql_os_memory"
      annotations:
        jp1_pc_firing_description: "Operating system physical memory (used) exceeded threshold (1000MB). value={{ $value }}"
        jp1_pc_resolved_description: "The OS physical memory (used) has fallen below the threshold (1000MB). value={{ $value }}"

os_page_file^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_os_page_file(SQL exporter)
      expr: sum(mssql_os_page_file)without(state) < 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1721"
        jp1_pc_metricname: "mssql_os_page_file"
      annotations:
        jp1_pc_firing_description: "The total number of OS pagefiles has fallen below the threshold (1000 files). value={{ $value }}"
        jp1_pc_resolved_description: "Total number of OS pagefiles exceeded the threshold (1000 files). value={{ $value }}"

os_page_file_available^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_os_page_file_available(SQL exporter)
      expr: mssql_os_page_file{state="available"} < 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1722"
        jp1_pc_metricname: "mssql_os_page_file"
      annotations:
        jp1_pc_firing_description: "Number of OS pagefiles (available) dropped below threshold (1000). value={{ $value }}"
        jp1_pc_resolved_description: "Number of OS pagefiles (available) exceeds threshold (1000). value={{ $value }}"

os_page_file_used^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_os_page_file_used(SQL exporter)
      expr: mssql_os_page_file{state="used"} > 1000
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1723"
        jp1_pc_metricname: "mssql_os_page_file"
      annotations:
        jp1_pc_firing_description: "Number of OS pagefiles (used) exceeded threshold (1000). value={{ $value }}"
        jp1_pc_resolved_description: "Number of OS pagefiles (used) dropped below threshold (1000). value={{ $value }}"

process_count^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_process_count(SQL exporter)
      expr: mssql_database_detail_process_count > 50
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1724"
        jp1_pc_metricname: "mssql_database_detail_process_count"
      annotations:
        jp1_pc_firing_description: "The total number of processes exceeded the threshold (50 processes). value={{ $value }}"
        jp1_pc_resolved_description: "Total number of processes dropped below threshold (50). value={{ $value }}"

perc_busy_percentage^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_perc_busy(SQL exporter)
      expr: mssql_global_server_summary_perc_busy > 90
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1725"
        jp1_pc_metricname: "mssql_global_server_summary_perc_busy"
      annotations:
        jp1_pc_firing_description: "The percentage of CPU busy hour exceeded the threshold (90%). value={{ $value }}"
        jp1_pc_resolved_description: "The percentage of CPU busy hour has fallen below the threshold (90%). value={{ $value }}"

server_summary_packet_errors^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_global_server_summary_packet_errors(SQL exporter)
      expr: mssql_global_server_summary_packet_errors > 2
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1726"
        jp1_pc_metricname: "mssql_global_server_summary_packet_errors"
      annotations:
        jp1_pc_firing_description: "Number of packet errors exceeded threshold (2). value={{ $value }}"
        jp1_pc_resolved_description: "Number of packet errors dropped below threshold (2). value={{ $value }}"

blocked_processes^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_blocked_processes(SQL exporter)
      expr: mssql_server_detail_blocked_processes > 2
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1727"
        jp1_pc_metricname: "mssql_server_detail_blocked_processes"
      annotations:
        jp1_pc_firing_description: "The number of waiting processes exceeded the threshold (2 processes). value={{ $value }}"
        jp1_pc_resolved_description: "The number of processes waiting is less than the threshold (2 processes). value={{ $value }}"

server_overview_cache_hit^#

groups:
  - name: sql_exporter
    rules:
    - alert: sql_server_overview_cache_hit(SQL exporter)
      expr: mssql_server_overview_cache_hit < 85
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1728"
        jp1_pc_metricname: "mssql_server_overview_cache_hit"
      annotations:
        jp1_pc_firing_description: "The percentage of data pages found in the data cache was below the threshold (85%). value={{ $value }}"
        jp1_pc_resolved_description: "The percentage of data pages found in the data cache exceeded the threshold (85%). value={{ $value }}"

#: When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.

■Alert definition example for metrics in Container monitoring metric definition file

kube_job_status_failed^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_job_status_failed(Kube state metrics)
     expr: 0 < kube_job_status_failed * on(job_name, namespace) group_left() kube_job_owner{owner_kind="<none>", owner_name="<none>"}
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1201"
        jp1_pc_metricname: "kube_job_status_failed, kube_job_owner"
        jp1_pc_nodelabel: "{{ $labels.namespace }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of failed pods has exceeded the threshold (0 pods). value={{ $value }} pods, job_name={{ $labels.job_name }}"
        jp1_pc_resolved_description: "The number of failed pods has fallen below the threshold (0 pods). job_name={{ $labels.job_name }}"

kube_pod_status_pending^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_pod_status_pending(Kube state metrics)
     expr: 0 < sum by (pod, namespace, instance, job) (kube_pod_status_phase{phase="Pending"})
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1202"
        jp1_pc_metricname: "kube_pod_status_pending"
        jp1_pc_nodelabel: "{{ $labels.namespace }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of pending pods has exceeded the threshold (0 pods). value={{ $value }} pods, pod={{ $labels.pod }}"
        jp1_pc_resolved_description: "The number of pending pods has fallen below the threshold (0 pods). pod={{ $labels.pod }}"

kube_pod_status_failed^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_pod_status_failed(Kube state metrics)
     expr: 0 < sum by (pod, namespace, instance, job) (kube_pod_status_phase{phase="Failed"}
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1203"
        jp1_pc_metricname: "kube_pod_status_phase"
        jp1_pc_nodelabel: "{{ $labels.namespace }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of failed pods has exceeded the threshold (0 pods). value={{ $value }} pods, pod={{ $labels.pod }}"
        jp1_pc_resolved_description: "The number of failed pods has fallen below the threshold (0 pods). pod={{ $labels.pod }}"

kube_pod_status_unknown^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_pod_status_unknown(Kube state metrics)
     expr: 0 < sum by (pod, namespace, instance) (kube_pod_status_phase{phase="Unknown"}
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1204"
        jp1_pc_metricname: "kube_pod_status_phase"
        jp1_pc_nodelabel: "{{ $labels.namespace }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of unknown pods has exceeded the threshold (0 pods). value={{ $value }} pods, pod={{ $labels.pod }}"
        jp1_pc_resolved_description: "The number of unknown pods has fallen below the threshold (0 pods). pod={{ $labels.pod }}"

kube_daemonset_failed_number_scheduled^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_daemonset_failed_number_scheduled(Kube state metrics)
     expr: 0 < kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1205"
        jp1_pc_metricname: "kube_daemonset_status_desired_number_scheduled, kube_daemonset_status_current_number_scheduled"
        jp1_pc_nodelabel: "{{ $labels.daemonset }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of nodes that failed to execute has exceeded the threshold (0 nodes). value={{ $value }} nodes"
        jp1_pc_resolved_description: "The number of nodes that failed to execute has fallen below the threshold (0 nodes)."

kube_deployment_failed_replicas^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_deployment_failed_replicas(Kube state metrics)
     expr: 0 < kube_deployment_spec_replicas - kube_deployment_status_replicas_available
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1206"
        jp1_pc_metricname: "kube_deployment_spec_replicas, kube_deployment_status_replicas_available"
        jp1_pc_nodelabel: "{{ $labels.deployment }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of pods that failed to execute on each deployment has exceeded the threshold (0 pods). value={{ $value }} pods"
        jp1_pc_resolved_description: "The number of pods that failed to execute on each deployment has fallen below the threshold (0 pods)."

kube_replicaset_failed_replicas^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_replicaset_failed_replicas(Kube state metrics)
     expr: 0 < kube_replicaset_spec_replicas - kube_replicaset_status_ready_replicas
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1207"
        jp1_pc_metricname: "kube_replicaset_spec_replicas, kube_replicaset_status_ready_replicas"
        jp1_pc_nodelabel: "{{ $labels.replicaset }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of pods that failed to execute on each ReplicaSet has exceeded the threshold (0 pods). value={{ $value }} pods"
        jp1_pc_resolved_description: "The number of pods that failed to execute on each ReplicaSet has fallen below the threshold (0 pods)."

kube_statefulset_failed_replicas^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_statefulset_failed_replicas(Kube state metrics)
     expr: 0 < kube_statefulset_replicas - kube_statefulset_status_replicas_ready
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1208"
        jp1_pc_metricname: "kube_statefulset_replicas, kube_statefulset_status_replicas_ready"
        jp1_pc_nodelabel: "{{ $labels.statefulset }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of pods that failed to execute on each deployment has exceeded the threshold (0 pods). value={{ $value }} pods"
        jp1_pc_resolved_description: "The number of pods that failed to execute on each deployment has fallen below the threshold (0 pods)."

kube_cron_job_status_failed^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_cron_job_status_failed(Kube state metrics)
     expr: 0 < kube_job_status_failed * on(job_name, namespace) group_left(owner_name) kube_job_owner{owner_kind="CronJob", owner_name!="<none>"}
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1209"
        jp1_pc_metricname: "kube_job_status_failed, kube_job_owner"
        jp1_pc_nodelabel: "{{ $labels.owner_name }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The number of pods that failed to execute within a CronJob has exceeded the threshold (0 pods). value={{ $value }}%"
        jp1_pc_resolved_description: "The number of pods that failed to execute within a CronJob has fallen below the threshold (0 pods)."

kube_node_status_condition_not_ready^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_node_status_condition_not_ready(Kube state metrics)
     expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="Ready",status=~"false|unknown"})
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1210"
        jp1_pc_metricname: "kube_node_status_condition"
        jp1_pc_nodelabel: "{{ $labels.node }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The node is in an error state. value={{ $value }} node"
        jp1_pc_resolved_description: "The node has recovered from its error state."

kube_node_status_condition_memory_pressure^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_node_status_condition_memory_pressure(Kube state metrics)
     expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="MemoryPressure",status~="true|unknown"}})
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1211"
        jp1_pc_metricname: "kube_node_status_condition"
        jp1_pc_nodelabel: "{{ $labels.node }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The node is in a memory-constrained state. value={{ $value }} node"
        jp1_pc_resolved_description: "The node has recovered from its memory-constrained state."

kube_node_status_condition_disk_pressure^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_node_status_condition_disk_pressure(Kube state metrics)
     expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="DiskPressure",status=~"true|unknown"})
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1212"
        jp1_pc_metricname: "kube_node_status_condition"
        jp1_pc_nodelabel: "{{ $labels.node }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The node is in a disk-constrained state. value={{ $value }} node"
        jp1_pc_resolved_description: "The node has recovered from its disk-constrained state."

kube_node_status_condition_pid_pressure^#

groups:
 - name: kube_state_metrics
  rules:
   - alert: kube_node_status_condition_pid_pressure(Kube state metrics)
     expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="PIDPressure",status=~"true|unknown"})
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1213"
        jp1_pc_metricname: "kube_node_status_condition"
        jp1_pc_nodelabel: "{{ $labels.node }}"
        jp1_pc_exporter: "JPC Kube state metrics"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kube_state"
      annotations:
        jp1_pc_firing_description: "The node is in a PID assignment-constrained state. value={{ $value }} node"
        jp1_pc_resolved_description: "The node has recovered from its PID assignment-constrained state."

kube_namespace_cpu_percent_used^#

groups:
 - name: kubelet
  rules:
   - alert: kube_namespace_cpu_percent_used(Kubelet)
     expr: 80 < sum by (namespace, job) (rate(container_cpu_usage_seconds_total{name!=""}[2m])) * 100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1222"
        jp1_pc_metricname: "container_cpu_usage_seconds_total"
        jp1_pc_nodelabel: "{{ $externalLabels.jp1_pc_prome_clustername }}"
        jp1_pc_exporter: "JPC Kubelet"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kubelet"
        instance: "{{ $externalLabels.jp1_pc_prome_hostname }}"
      annotations:
        jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%, namespace={{ $labels.namespace }}"
        jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%). namespace={{ $labels.namespace }}"

kube_namespace_memory_percent_used^#

groups:
 - name: kubelet
  rules:
   - alert: kube_namespace_memory_percent_used(Kubelet)
     expr: 80 < sum by (namespace, job) (container_memory_working_set_bytes and (container_spec_memory_limit_bytes{name!=""} > 0)) / sum by (namespace, job) ((container_spec_memory_limit_bytes{name!=""} > 0) and container_memory_working_set_bytes) * 100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1223"
        jp1_pc_metricname: "container_memory_working_set_bytes, container_spec_memory_limit_bytes"
        jp1_pc_nodelabel: "{{ $externalLabels.jp1_pc_prome_clustername }}"
        jp1_pc_exporter: "JPC Kubelet"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kubelet"
        instance: "{{ $externalLabels.jp1_pc_prome_hostname }}"
      annotations:
        jp1_pc_firing_description: "Memory usage has exceeded the threshold (80%). value={{ $value }}%, namespace={{ $labels.namespace }}"
        jp1_pc_resolved_description: "Memory usage has fallen below the threshold (80%). namespace={{ $labels.namespace }}"

kube_pod_cpu_percent_used_pod^#

groups:
 - name: kubelet
  rules:
   - alert: kube_pod_cpu_percent_used_pod(Kubelet)
     expr: 80 < sum by (pod, namespace, instance, job) (rate(container_cpu_usage_seconds_total{name!=""}[2m])) * 100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1220"
        jp1_pc_metricname: "container_cpu_usage_seconds_total"
        jp1_pc_nodelabel: "{{ $labels.pod }}"
        jp1_pc_exporter: "JPC Kubelet"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kubelet"
      annotations:
        jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)"

kube_pod_memory_percent_used_pod^#

groups:
 - name: kubelet
  rules:
   - alert: kube_pod_cpu_percent_used_pod(Kubelet)
     expr: 80 < sum by (pod, namespace, instance, job) (container_memory_working_set_bytes and (container_spec_memory_limit_bytes{name!=""} > 0)) / sum by (pod, namespace, instance, job) ((container_spec_memory_limit_bytes{name!=""} > 0) and container_memory_working_set_bytes) * 100
      for: 3m
      labels:
        jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
        jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO"
        jp1_pc_severity: "Error"
        jp1_pc_eventid: "1221"
        jp1_pc_metricname: "container_memory_working_set_bytes, container_spec_memory_limit_bytes"
        jp1_pc_nodelabel: "{{ $labels.pod }}"
        jp1_pc_exporter: "JPC Kubelet"
        jp1_pc_trendname: "kubernetes"
        job: "jpc_kubelet"
      annotations:
        jp1_pc_firing_description: "Memory usage has exceeded the threshold (80%). value={{ $value }}%"
        jp1_pc_resolved_description: "Memory usage has fallen below the threshold (80%)"

#

When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate. In "alert", specify the value of the alert definition while following the naming rule given below. If you do not follow the rule, the JP1 event will not be created.

alert: metric-definition-name(exporter-name)any-value

To Page Top