Alert configuration file (jpc_alerting_rules.yml)
- Organization of this page
Format
Write in YAML format.
groups:
- name: group-name
rules:
- alert: alert-name
expr: Conditional expressions
for: Period
labels:
jp1_pc_product_name: "/HITACHI/JP1/JPCCS2"
jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO"
jp1_pc_severity: JP1 event severity
jp1_pc_eventid: Event ID of the JP1 event
jp1_pc_metricname: Metric Name
annotations:
jp1_pc_firing_description: Message when firing conditions are met
jp1_pc_resolved_description: Message when a firing condition is no longer met
File
jpc_alerting_rules.yml
jpc_alerting_rules.yml.model (model file)
Storage directory
■Integrated agent host
- In Windows:
-
-
For a physical host
Agent-path\conf\
-
For a logical host
shared-folder\jp1ima\conf\
-
- In Linux:
-
-
For a physical host
/opt/jp1ima/conf/
-
For a logical host
shared-directory/jp1ima/conf/
-
Description
A file that defines the alert evaluation rules that the Prometheus server runs.
Character code
UTF-8 (without BOM)
Line feed code
In Windows: CR+LF
In Linux: LF
When the definitions are applied
Reflected when the Prometheus server is restarted and when you instruct the Prometheus server to reload.
Information that is specified
For definitions of common placeholders used in the table below, see About definition of common placeholders for descriptive items in yml file.
|
Item |
Description |
Changeability |
What You Setup in Your JP1/IM - Agent |
JP1/IM - Agent Defaults Value |
|||
|---|---|---|---|---|---|---|---|
|
groups: |
-- |
N |
-- |
"groups:" |
|||
|
name: <string> |
Specify the alert group name within 255 bytes. The group name must be unique within the monitoring agent host, and you cannot specify multiple names with the same group name. Note that between different monitoring agent hosts, you can specify a name that specifies the same group name for each. |
Y |
Specify a group name of your choice. |
Not specified |
|||
|
rules: |
Configure alert rules. You can specify up to 100#.
|
N |
-- |
Not specified |
|||
|
alert: <string> |
Specify a name for the alert. |
Y |
Specifies the name of the alert created by the user. |
Not specified |
|||
|
expr: <string> |
Specify the alert expression within 255 bytes. Specifies the PromQL statement. |
Y |
Specifies the PromQL statement to evaluate.# For notes on PromQL statements, see Note on PromQL expression. |
Not specified |
|||
|
for: <duration> |
Specify the duration for an alert to become firing, ranging from 0 seconds to 24 hours. The value is specified in numbers and units. The units that can be specified are s (seconds) and m (minutes). Even if the alert condition expression is applicable, if it no longer applies within the period specified for for, it will not be treated as firing. |
Y |
Specifies the amount of time it takes for an alert to reach a firing state. |
Not specified |
|||
|
labels: |
Set labels to add or override for each alert. |
N |
-- |
Not specified |
|||
|
jp1_pc_product_name: <string> |
Specify the value to be set for the product name of the JP1 event. |
Y |
"/HITACHI/JP1/JPCCS2", or "/HITACHI/JP1/JPCCS2/xxxx" You can specify xxxx. |
Not specified |
|||
|
jp1_pc_component: <string> |
Specify the value to be set for the component name of the JP1 event. |
Y |
Depending on the product plug-in that handles the JP1 event, specify the following values. jp1pccs_azure.js:"/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1pccs_kubernetes.js:"/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1pccs.js:"/HITACHI/JP1/JPCCS/CONFINFO" |
Not specified |
|||
|
jp1_pc_severity: <string> |
Specify the value to set for the severity of the JP1 event. |
Y |
Specify one of the following:
|
Not specified |
|||
|
jp1_pc_eventid: <string> |
Specify the value to be set for the event ID of the JP1 event. |
Y |
Specify any value in the range of "0 to 1FFF,7FFF8000 to 7FFFFFFF" that can be specified as the event ID of the JP1 event. |
If the specification is omitted, "00007600" is Setup to Value of ID property of JP1 event. |
|||
|
jp1_pc_metricname: <string> |
Specify the value to be set for the metric name of the JP1 event. In the case of Yet another cloudwatch exporter, the JP1 event is associated with the IM management node in the AWS namespace corresponding to the metric name (or the first metric name if multiple comma-separated values are specified). |
Y |
Specify the metric names separated by commas. |
Not specified |
|||
|
annotations: |
Set the annotations that you want to add to each alert. |
N |
-- |
Not specified |
|||
|
jp1_pc_firing_description: <string> |
Specify the value to be set for the message of the JP1 event when the firing condition of the alert is satisfied. If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte. If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = alert name)". |
Y |
Specify an optional message. |
If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = alert name)". |
|||
|
jp1_pc_resolved_description: <string> |
Specify the value to be set for the JP1 event message when the firing condition of the alert is no longer satisfied. If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte. If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = alert name)". |
Y |
Specify an optional message. |
If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = alert name)". |
|||
- Legend:
-
Y: Changeable, N: Not changeable, --: Not applicable
- #
-
Since the following label is set as an attribute of the JP1 event, do not remove the label by an aggregate operator.
-
instance
-
job
-
jp1_pc_nodelabel
-
jp1_pc_exporter
-
jp1_pc_remote_monitor_instance
-
account
-
region
-
dimension_any-string
Note that the labels accout, region, and dimension_any-string apply only when monitoring Yet another cloudwatch exporter metrics.
-
Definition example
The following shows an example of an alert definition for each metric written in the model file of the metric definition file.
■Alert definition example for metrics in Node exporter metric definition file
-
cpu_used_rate#
groups: - name: node_exporter rules: - alert: cpu_used_rate(Node exporter) expr: 80 < (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="system"}[2m])) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="user"}[2m]))) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0301" jp1_pc_metricname: "node_cpu_seconds_total" annotations: jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)." -
memory_unused#
groups: - name: node_exporter rules: - alert: memory_unused(Node exporter) expr: 1024 > node_memory_MemAvailable_bytes/1024/1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0302" jp1_pc_metricname: "node_memory_MemAvailable_bytes" annotations: jp1_pc_firing_description: "The amount of free memory has fallen below the threshold (1024 megabytes).value={{ $value }}megabytes" jp1_pc_resolved_description: "The amount of free memory exceeded the threshold (1024 megabytes)." -
memory_unused_rate#
groups: - name: node_exporter rules: - alert: memory_unused_rate(Node exporter) expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0303" jp1_pc_metricname: "node_memory_MemAvailable_bytes,node_memory_MemTotal_bytes" annotations: jp1_pc_firing_description: "Free-memory ratio has fallen below threshold value (10%). value={{$value}} %" jp1_pc_resolved_description: "Free-memory ratio exceeded threshold value (10%). " -
disk_unused#
groups: - name: node_exporter rules: - alert: disk_unused(Node exporter) expr: 10 > node_filesystem_free_bytes/(1024*1024*1024) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0304" jp1_pc_metricname: "node_filesystem_free_bytes" annotations: jp1_pc_firing_description: "Free disk space has fallen below the threshold (10 gigabytes).value={{ $value }}gigabytes, mountpoint={{ $labels.mountpoint }}" jp1_pc_resolved_description: "Free disk space exceeded threshold (10 gigabytes).mountpoint={{ $labels.mountpoint }}" -
disk_unused_rate#
groups: - name: node_exporter rules: - alert: disk_unused_rate(Node exporter) expr: node_filesystem_free_bytes / node_filesystem_size_bytes * 100 < 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0305" jp1_pc_metricname: "node_filesystem_free_bytes,node_filesystem_size_bytes" annotations: jp1_pc_firing_description: "Free disk percentage has fallen below threshold value (10%). value={{$value}} %, mountpoint={{$labels.mountpoint}}" jp1_pc_resolved_description: "Free disk percentage exceeds threshold value (10%). mountpoint={{$labels.mountpoint}}"- Note
-
If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of the disk_unused of Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX, as shown in the underlined part below.
10 > node_filesystem_free_bytes{job="jpc_node"}/ (1024*1024*1024)
-
disk_busy_rate#
groups: - name: node_exporter rules: - alert: disk_busy_rate(Node exporter) expr: 70 < rate(node_disk_io_time_seconds_total[2m])*100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0306" jp1_pc_metricname: "node_disk_io_time_seconds_total" annotations: jp1_pc_firing_description: "Disk busy rate exceeded threshold (70%).value={{ $value }}%, device={{ $labels.device }}" jp1_pc_resolved_description: "Disk busy rate has fallen below the threshold (70%).device={{ $labels.device }}"- Note
-
If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of the disk_unused_rate of Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX, as shown in the underlined part below.
Node_filesystem_free_bytes{job="jpc_node"} / node_filesystem_size_bytes{job="jpc_node"} * 100 < 10
-
disk_read_latency#
groups: - name: node_exporter rules: - alert: disk_read_latency(Node exporter) expr: rate(node_disk_read_time_seconds_total[2m]) / rate(node_disk_reads_completed_total[2m]) > 0.1 and rate(node_disk_reads_completed_total[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0307" jp1_pc_metricname: "node_disk_read_time_seconds_total,node_disk_reads_completed_total" annotations: jp1_pc_firing_description: "Disk read latency exceeds the threshold Value (0.1 seconds). value={{$value}}s, device={{$labels.device}}" jp1_pc_resolved_description: "Disk read latency has fallen below threshold Value (0.1 seconds). device={{$labels.device}}" -
disk_write_latency#
ggroups: - name: node_exporter rules: - alert: disk_write_latency(Node exporter) expr: rate(node_disk_write_time_seconds_total[2m]) / rate(node_disk_writes_completed_total[2m]) > 0.1 and rate(node_disk_writes_completed_total[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0308" jp1_pc_metricname: "node_disk_write_time_seconds_total,node_disk_writes_completed_total" annotations: jp1_pc_firing_description: "Disc write latency exceeds the threshold Value (0.1 sec.). value={{$value}}%, device={{$labels.device}}" jp1_pc_resolved_description: "Disc write latency has fallen below threshold value (0.1 seconds). device={{$labels.device}}" -
disk_io_latency#
groups: - name: node_exporter rules: - alert: disk_io_latency(Node exporter) expr: (rate(node_disk_read_time_seconds_total[2m]) + rate(node_disk_write_time_seconds_total[2m])) / (rate(node_disk_reads_completed_total[2m]) + rate(node_disk_writes_completed_total[2m])) > 0.1 and (rate(node_disk_writes_completed_total[2m]) > 0 or rate(node_disk_read_completed_total[2m]) > 0) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0309" jp1_pc_metricname: "node_disk_write_time_seconds_total,node_disk_writes_completed_total,node_disk_read_time_seconds_total,node_disk_reads_completed_total" annotations: jp1_pc_firing_description: "Disk IO latency exceeded the threshold Value (0.1 seconds). value={{$value}}%, device={{$labels.device}}" jp1_pc_resolved_description: "Disc IO latency has fallen below threshold value (0.1 seconds). device={{$labels.device}}" -
network_sent#
groups: - name: node_exporter rules: - alert: network_sent(Node exporter) expr: 100 < rate(node_network_transmit_packets_total[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0310" jp1_pc_metricname: "node_network_transmit_packets_total" annotations: jp1_pc_firing_description: "The network transmission speed exceeded the threshold (100 packets per second). value={{ $value }}packets per second, device={{ $labels.device }}" jp1_pc_resolved_description: "The network transmission speed has dropped below the threshold (100 packets per second). device={{ $labels.device }}" -
network_received#
groups: - name: node_exporter rules: - alert: network_received(Node exporter) expr: 100 < rate(node_network_receive_packets_total[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0311" jp1_pc_metricname: "node_network_receive_packets_total" annotations: jp1_pc_firing_description: "The network receive speed exceeded the threshold (100 packets per second).value={{ $value }}packets per second, device={{ $labels.device }}" jp1_pc_resolved_description: "The network receive speed has dropped below the threshold (100 packets per second).device={{ $labels.device }}"
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Process exporter metric definition file
-
process_pgm_process_count#
groups: - name: process_exporter rules: - alert: process_pgm_process_count(Processs exporter) expr: 1 > sum by (program, instance, job, jp1_pc_nodelabel, jp1_pc_exporter) (namedprocess_namegroup_num_procs) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "1308" jp1_pc_metricname: "namedprocess_namegroup_num_procs" annotations: jp1_pc_firing_description: "The number of processes has fallen below the threshold (1 process)." jp1_pc_resolved_description: "The number of processes has exceeded the threshold (1 process)."
- #
-
This uses a threshold value of 1 as an example. Change this value based on the number of monitoring targets.
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Node exporter (service monitoring) metric definition file
-
service_state#
When the auto-start setting of the monitored unit is enabled (systemctl enable is being executed)
groups: - name: node_exporter rules: - alert: service_state(Node exporter) expr: node_systemd_unit_state{state="active"} == 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0320" jp1_pc_metricname: "node_systemd_unit_state" annotations: jp1_pc_firing_description: "The status of the service is not running." jp1_pc_resolved_description: "The service status is now running."When the auto-start setting of the monitored unit is disabled
groups: - name: node_exporter rules: - alert: service_state_service-name(Node exporter) expr: absent(node_systemd_unit_state{instance="integrated-agent-host-name:port-number-of-the-Node-exporter", job="jpc_node", jp1_pc_exporter="JPC Node exporter", jp1_pc_nodelabel="service-name", state="active"}) == 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0320" jp1_pc_metricname: "node_systemd_unit_state" annotations: jp1_pc_firing_description: "The status of the service is not running." jp1_pc_resolved_description: "The service status is now running."
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Windows exporter metric definition file
-
cpu_used_rate#
groups: - name: windows_exporter rules: - alert: cpu_used_rate(Windows exporter) expr: 80 < 100 - (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0401" jp1_pc_metricname: "windows_cpu_time_total" annotations: jp1_pc_firing_description: "CPU utilization exceeded threshold (80%).value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has dropped below the threshold (80%)." -
memory_unused#
groups: - name: windows_exporter rules: - alert: memory_unused(Windows exporter) expr: 1 > windows_memory_available_bytes/1024/1024/1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0402" jp1_pc_metricname: "windows_memory_available_bytes" annotations: jp1_pc_firing_description: "The amount of free memory has fallen below the threshold (1 gigabytes).value={{ $value }}Gb" jp1_pc_resolved_description: "The amount of free memory exceeded the threshold (1 gigabytes)." -
memory_unused_rate#
groups: - name: windows_exporter rules: - alert: memory_unused_rate(Windows exporter) expr: windows_memory_available_bytes / windows_cs_physical_memory_bytes * 100 < 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0403" jp1_pc_metricname: "windows_memory_available_bytes,windows_cs_physical_memory_bytes" annotations: jp1_pc_firing_description: "Free-memory ratio has fallen below threshold value (10%). value={{$value}} %" jp1_pc_resolved_description: "Free-memory ratio exceeded threshold value (10%)." -
disk_unused#
groups: - name: windows_exporter rules: - alert: disk_unused(Windows exporter) expr: 10 > windows_logical_disk_free_bytes{volume!~"HarddiskVolume.*"} / (1024*1024*1024) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0404" jp1_pc_metricname: "windows_logical_disk_free_bytes" annotations: jp1_pc_firing_description: "Free disk space has fallen below the threshold (10 gigabytes).value={{ $value }}GB, volume={{ $labels.volume }}" jp1_pc_resolved_description: "Free disk space exceeded threshold (10 gigabytes).volume={{ $labels.volume }}" -
disk_unused_rate#
groups: - name: windows_exporter rules: - alert: disk_unused_rate(Windows exporter) expr: windows_logical_disk_free_bytes{volume!~"HarddiskVolume.*"} / windows_logical_disk_size_bytes * 100 < 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0405" jp1_pc_metricname: "windows_logical_disk_free_bytes,windows_logical_disk_size_bytes" annotations: jp1_pc_firing_description: "Free disk percentage has fallen below threshold value (10%). value={{$value}}%, volume={{$labels.volume}}" jp1_pc_resolved_description: "Free disk percentage exceeds threshold value (10%). volume={{$labels.volume}}" -
disk_busy_rate#
groups: - name: windows_exporter rules: - alert: disk_busy_rate(Windows exporter) expr: 70 < 100 - rate(windows_logical_disk_idle_seconds_total{volume!~"HarddiskVolume.*"}[2m])/(rate(windows_logical_disk_write_seconds_total[2m]) + rate(windows_logical_disk_read_seconds_total[2m])+rate(windows_logical_disk_idle_seconds_total[2m])) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0406" jp1_pc_metricname: "windows_logical_disk_idle_seconds_total,windows_logical_disk_write_seconds_total,windows_logical_disk_read_seconds_total" annotations: jp1_pc_firing_description: "Disk busy rate exceeded threshold (70%).value={{ $value }}%, volume={{ $labels.volume }}" jp1_pc_resolved_description: "The disk busy rate has fallen below the threshold (70%).volume={{ $labels.volume }}" -
disk_read_latency#
groups: - name: windows_exporter rules: - alert: disk_read_latency(Windows exporter) expr: rate(windows_logical_disk_read_seconds_total[2m]) / rate(windows_logical_disk_reads_total[2m]) > 0.1 and rate(windows_logical_disk_reads_total[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0407" jp1_pc_metricname: "windows_logical_disk_read_seconds_total,windows_logical_disk_reads_total" annotations: jp1_pc_firing_description: "Disk read latency exceeds the threshold Value (0.1 seconds). value={{$value}}s, volume={{$labels.volume}}" jp1_pc_resolved_description: "Disk read latency has fallen below threshold value (0.1 seconds). volume={{$labels.volume}}" -
disk_write_latency#
groups: - name: windows_exporter rules: - alert: disk_write_latency(Windows exporter) expr: rate(windows_logical_disk_write_seconds_total[2m]) / rate(windows_logical_disk_writes_total[2m]) > 0.1 and rate(windows_logical_disk_writes_total[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0408" jp1_pc_metricname: "windows_logical_disk_write_seconds_total,windows_logical_disk_writes_total" annotations: jp1_pc_firing_description: "Disk write latency exceeds the threshold Value (0.1 sec.). value={{$value}}s, volume={{$labels.volume}}" The jp1_pc_resolved_description: "Disk write latency has fallen below the threshold value (0.1 seconds). volume={{$labels.volume}}" -
disk_io_latency#
groups: - name: windows_exporter rules: - alert: disk_io_latency(Windows exporter) expr: (rate(windows_logical_disk_read_seconds_total[2m]) + rate(windows_logical_disk_write_seconds_total[2m])) / (rate(windows_logical_disk_reads_total[2m]) + rate(windows_logical_disk_writes_total[2m])) > 0.1 and (rate(windows_logical_disk_writes_total[2m]) > 0 or rate(windows_logical_disk_reads_total[2m]) > 0) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0409" jp1_pc_metricname: "windows_logical_disk_write_seconds_total,windows_logical_disk_writes_total,windows_logical_disk_read_seconds_total,windows_logical_disk_reads_total" annotations: jp1_pc_firing_description: "Disk IO latency exceeded the threshold Value (0.1 seconds). value={{$value}}s, volume={{ $labels.volume }}" jp1_pc_resolved_description: "Disk IO latency has fallen below the threshold value (0.1 seconds). volume={{ $labels.volume }}" -
network_sent#
groups: - name: windows_exporter rules: - alert: network_sent(Windows exporter) expr: 100 < rate(windows_net_packets_sent_total[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0410" jp1_pc_metricname: "windows_net_packets_sent_total" annotations: jp1_pc_firing_description: "The network transmission speed exceeded the threshold (100 packets per second).value={{ $value }}Packets/sec, nic={{ $labels.nic }}" jp1_pc_resolved_description: "The network transmission speed has dropped below the threshold (100 packets per second).nic={{ $labels.nic }}" -
network_received#
groups: - name: windows_exporter rules: - alert: network_received(Windows exporter) expr: 100 < rate(windows_net_packets_received_total[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0411" jp1_pc_metricname: "windows_net_packets_received_total" annotations: jp1_pc_firing_description: "The network receive speed exceeded the threshold (100 packets per second).value={{ $value }}Packets/sec, nic={{ $labels.nic }}" jp1_pc_resolved_description: "The network receive speed has dropped below the threshold (100 packets per second).nic={{ $labels.nic }}"
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Windows exporter (process monitoring) metric definition file
-
process_pgm_process_count#
groups: - name: windows_exporter rules: - alert: process_pgm_process_count(Windows exporter) expr: absent(windows_process_start_time{instance="integrated-agent-host-name:Windows-exporter-port-number", job="jpc_windows", jp1_pc_exporter="JPC Windows exporter", jp1_pc_nodelabel="monitored-process-name",process="monitored-process-name"}) == 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0414" jp1_pc_metricname: "windows_process_start_time" annotations: jp1_pc_firing_description: "The number of processes has fallen below the threshold (1 process)." jp1_pc_resolved_description: "The number of processes has exceeded the threshold (1 process)."
- #
-
This uses a threshold value of 1 as an example. Change this value based on the number of monitoring targets.
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Windows exporter (service monitoring) metric definition file
-
service_state#
groups: - name: windows_exporter rules: - alert: service_state(Windows exporter) expr: windows_service_state{state="running"} == 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0420" jp1_pc_metricname: "windows_service_state" annotations: jp1_pc_firing_description: "The status of the service is not running. service={{$labels.name}}" jp1_pc_resolved_description: "The service status is now running. service={{$labels.name}}"
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Node exporter for AIX metric definition file
-
cpu_used_rate#
groups: - name: node_exporter_AIX rules: - alert: cpu_used_rate(Node exporter for AIX) expr: 80 < ((avg by(instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu{mode="sys"}[2m])))+(avg by(instance,job,jp1_pc_nodelabel,jp1_pc_exporter) ((rate(node_cpu{mode="user"}[2m])))))*100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1801" jp1_pc_metricname: "node_cpu" annotations: jp1_pc_firing_description: "CPU utilization has exceeded threshold (80%).value={{ $value }}%" jp1_pc_resolved_description: "CPU utilization has fallen below the threshold (80%)." -
memory_unused#
groups: - name: node_exporter_AIX rules: - alert: memory_unused(Node exporter for AIX) expr: 1 > aix_memory_real_avail/1024/1024/1024*4096 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1802" jp1_pc_metricname: "aix_memory_real_avail" annotations: jp1_pc_firing_description: "Free memory falls below threshold (1 gigabyte). value={{ $value }}gigabyte" jp1_pc_resolved_description: "The amount of free memory has exceeded the threshold value (1 gigabyte)." -
memory_unused_rate#
groups: - name: node_exporter_AIX rules: - alert: memory_unused_rate(Node exporter for AIX) expr: aix_memory_real_avail / aix_memor_real_total * 100 < 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1803" jp1_pc_metricname: "aix_memory_real_avail,aix_memory_real_total" annotations: jp1_pc_firing_description: "Free-memory percentage dropped below threshold (10%). value={{ $value }}%" jp1_pc_resolved_description: "Free-memory percentage exceeded threshold (10%)." -
disk_unused#
groups: - name: node_exporter_AIX rules: - alert: disk_unused(Node exporter for AIX)) expr: 10 > node_filesystem_free_bytes{job="jpc_node_aix"}/(1024*1024*1024) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1804" jp1_pc_metricname: "node_filesystem_free_bytes" annotations: jp1_pc_firing_description: "Free disk space dropped below threshold (10 gigabytes). value={{ $value }}gigabytes, mountpoint={{ $labels.mountpoint }}" jp1_pc_resolved_description: "Free disk space exceeds threshold (10 gigabytes). mountpoint={{ $labels.mountpoint }}"- Note
-
If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of disk_unused in Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX. For details, see disk_unused in Alert definition example for metrics in Node exporter metric definition file above.
-
disk_unused_rate#
groups: - name: node_exporter_AIX rules: - alert: disk_unused_rate(Node exporter for AIX) expr: node_filesystem_free_bytes{job="jpc_node_aix"} / node_filesystem_size_bytes{job="jpc_node_aix"} * 100 < 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1805" jp1_pc_metricname: "node_filesystem_free_bytes,node_filesystem_size_bytes" annotations: jp1_pc_firing_description: "Free disk percentage dropped below threshold (10%). value={{ $value }}%, mountpoint={{ $labels.mountpoint }}" jp1_pc_resolved_description: "Free disk percentage exceeds threshold (10%). mountpoint={{ $labels.mountpoint }}"- Note
-
If you want to monitor both Node exporter and Node exporter for AIX on a single Prometheus, specify job labels in expr of disk_unused_rate in Node exporter alert definition to distinguish between metric for Node exporter and Node exporter for AIX. For details, see disk_unused_rate in Alert definition example for metrics in Node exporter metric definition file above.
-
disk_busy_rate#
groups: - name: node_exporter_AIX rules: - alert: disk_busy_rate(Node exporter for AIX) expr: 70 < rate(aix_disk_time[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1806" jp1_pc_metricname: "aix_disk_time" annotations: jp1_pc_firing_description: "Disk busy rate exceeded threshold (70%). value={{ $value }}%, disk={{ $labels.disk }}" jp1_pc_resolved_description: "Disk busy rate dropped below threshold (70%). disk={{ $labels.disk }}" -
disk_read_latency#
ggroups: - name: node_exporter_AIX rules: - alert: disk_read_latency(Node exporter for AIX) expr: rate(aix_disk_rserv[2m]) / rate(aix_disk_xrate[2m])/1000/1000/1000 > 0.1 and rate(aix_disk_xrate[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1807" jp1_pc_metricname: "aix_disk_rserv,aix_disk_xrate" annotations: jp1_pc_firing_description: "Disk read latency exceeds threshold (0.1 sec.), value={{ $value }}sec., disk={{ $labels.disk }}" jp1_pc_resolved_description: "Disk read latency dropped below threshold (0.1 sec). disk={{ $labels.disk }}" -
disk_write_latency#
groups: - name: node_exporter_AIX rules: - alert: disk_write_latency(Node exporter for AIX) expr: rate(aix_disk_wserv[2m]) / (rate(aix_disk_xfers[2m]) - rate(aix_disk_xrate[2m]))/1000/1000/1000 > 0.1 and (rate(aix_disk_xfers[2m]) - rate(aix_disk_xrate[2m])) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1808" jp1_pc_metricname: "aix_disk_wserv,aix_disk_xfers,aix_disk_xrate" annotations: jp1_pc_firing_description: "Disc write latency exceeds threshold (0.1 sec.), value={{ $value }}sec., disk={{ $labels.disk }}" jp1_pc_resolved_description: "Disc write latency dropped below threshold (0.1 sec.). disk={{ $labels.disk }}" -
disk_io_latency#
groups: - name: node_exporter_AIX rules: - alert: disk_io_latency(Node exporter for AIX) expr: (rate(aix_disk_rserv[2m]) + rate(aix_disk_wserv[2m])) / rate(aix_disk_xfers[2m])/1000/1000/1000 > 0.1 and (rate(aix_disk_xfers[2m]) > 0) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1809" jp1_pc_metricname: "aix_disk_wserv,aix_disk_rserv,aix_disk_xfers" annotations: jp1_pc_firing_description: "Disk IO latency exceeded threshold (0.1 sec.), value={{ $value }}sec., disk={{ $labels.disk }}" jp1_pc_resolved_description: "Disc IO latency dropped below threshold (0.1 sec.). disk={{ $labels.disk }}" -
network_sent#
groups: - name: node_exporter_AIX rules: - alert: network_sent(Node exporter for AIX) expr: 100 < rate(aix_netinterface_opackets[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1810" jp1_pc_metricname: "aix_netinterface_opackets" annotations: jp1_pc_firing_description: "Network sending rate exceeds threshold (100 packets/sec.), value={{ $value }}packets/sec, netinterface={{ $labels.netinterface }}" jp1_pc_resolved_description: "Network sending rate dropped below threshold (100 packets/sec). netinterface={{ $labels.netinterface }}" -
network_received#
groups: - name: node_exporter_AIX rules: - alert: network_received(Node exporter for AIX) expr: 100 < rate(aix_netinterface_ipackets[2m]) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1811" jp1_pc_metricname: "aix_netinterface_ipackets" annotations: jp1_pc_firing_description: "Network-Receive Rate Exceeded Threshold (100 Packets/sec.) value={{ $value }}Packets/sec., netinterface={{ $labels.netinterface }}" jp1_pc_resolved_description: "Network receive rate dropped below threshold (100 packets/sec). netinterface={{ $labels.netinterface }}"
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Blackbox exporter metric definition file
-
probe_success#
groups: - name: blackbox_exporter rules: - alert: probe_success(Blackbox exporter) expr: 0 == probe_success for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0501" jp1_pc_metricname: "probe_success" annotations: jp1_pc_firing_description: "Communication failed. value={{ $value }}" jp1_pc_resolved_description: "Communication was successful." -
probe_duration_seconds#
groups: - name: blackbox_exporter rules: - alert: probe_duration_seconds(Blackbox exporter) expr: 5 < probe_duration_seconds for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0502" jp1_pc_metricname: "probe_duration_seconds" annotations: jp1_pc_firing_description: "The probe period has exceeded the threshold (5 seconds). value={{ $value }}seconds" jp1_pc_resolved_description: "The probe period has fallen below the threshold (5 seconds)." -
probe_icmp_duration_seconds#
groups: - name: blackbox_exporter rules: - alert: probe_icmp_duration_seconds(Blackbox exporter) expr: 3 < probe_icmp_duration_seconds for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0503" jp1_pc_metricname: "probe_icmp_duration_seconds" annotations: jp1_pc_firing_description: "The ICMP period has exceeded the threshold (3 seconds). value={{ $value }}seconds, phase={{ $labels.phase }}" jp1_pc_resolved_description: "The ICMP period has fallen below the threshold (3 seconds)." -
probe_http_duration_seconds#
groups: - name: blackbox_exporter rules: - alert: probe_http_duration_seconds(Blackbox exporter) expr: 3 < probe_http_duration_seconds for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0504" jp1_pc_metricname: "probe_http_duration_seconds" annotations: jp1_pc_firing_description: "The HTTP request period has exceeded the threshold (3 seconds). value={{ $value }}seconds, phase={{ $labels.phase }}" jp1_pc_resolved_description: "The HTTP request period has fallen below the threshold (3 seconds)." -
probe_http_status_code#
groups: - name: blackbox_exporter rules: - alert: probe_http_status_code(Blackbox exporter) expr: 200 != probe_http_status_code for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0505" jp1_pc_metricname: "probe_http_status_code" annotations: jp1_pc_firing_description: "The HTTP status is not 200. value={{ $value }}" jp1_pc_resolved_description: "The HTTP status is now 200."
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Yet another cloudwatch exporter metric definition file
-
aws_ec2_cpuutilization_average#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_ec2_cpuutilization_average(Yet another cloudwatch exporter) expr: 80 < aws_ec2_cpuutilization_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0601" jp1_pc_metricname: "aws_ec2_cpuutilization_average" annotations: jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)." -
aws_ec2_disk_read_bytes_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_ec2_disk_read_bytes_sum(Yet another cloudwatch exporter) expr: 10240 < aws_ec2_disk_read_bytes_sum / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0602" jp1_pc_metricname: "aws_ec2_disk_read_bytes_sum" annotations: jp1_pc_firing_description: "Amount of data read in KB in instance store volumes has exceeded the threshold (10,240 KB). value={{ $value }}KB" jp1_pc_resolved_description: "Amount of data read in KB in instance store volumes has fallen below the threshold (10,240 KB)." -
aws_ec2_disk_write_bytes_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_ec2_disk_write_bytes_sum(Yet another cloudwatch exporter) expr: 10240 < aws_ec2_disk_write_bytes_sum / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0603" jp1_pc_metricname: "aws_ec2_disk_write_bytes_sum" annotations: jp1_pc_firing_description: "Amount of data written in KB in instance store volumes has exceeded the threshold (10,240 KB). value={{ $value }}KB" jp1_pc_resolved_description: "Amount of data written in KB in instance store volumes has fallen below the threshold (10,240 KB)." -
aws_lambda_errors_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_lambda_errors_sum(Yet another cloudwatch exporter) expr: 0 < aws_lambda_errors_sum{dimension_Resource=""} for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0604" jp1_pc_metricname: "aws_lambda_errors_sum" annotations: jp1_pc_firing_description: "関数エラーが発生した呼び出しの数がしきい値(0個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "関数エラーが発生した呼び出しの数がしきい値(0個)を下回りました。" -
aws_lambda_duration_average#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_lambda_duration_average(Yet another cloudwatch exporter) expr: 5000 < aws_lambda_duration_average{dimension_Resource=""} for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0605" jp1_pc_metricname: "aws_lambda_duration_average" annotations: jp1_pc_firing_description: "関数コードがイベントの処理に費やす時間がしきい値(5000ミリ秒)を上回りました。value={{ $value }}ミリ秒" jp1_pc_resolved_description: "関数コードがイベントの処理に費やす時間がしきい値(5000ミリ秒)を下回りました。" -
aws_s3_bucket_size_bytes_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_s3_bucket_size_bytes_sum(Yet another cloudwatch exporter) expr: 1024 < aws_s3_bucket_size_bytes_sum / (1024*1024*1024) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0606" jp1_pc_metricname: "aws_s3_bucket_size_bytes_sum" annotations: jp1_pc_firing_description: "バケットの保存データ量がしきい値(1024ギガバイト)を上回りました。value={{ $value }}ギガバイト" jp1_pc_resolved_description: "バケットの保存データ量がしきい値(1024ギガバイト)を下回りました。" -
aws_s3_5xx_errors_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_s3_5xx_errors_sum(Yet another cloudwatch exporter) expr: 0 < aws_s3_5xx_errors_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0607" jp1_pc_metricname: "aws_s3_5xx_errors_sum" annotations: jp1_pc_firing_description: "バケットへのリクエストに対して,HTTP 5xx サーバーエラーステータスコードを返却される数がしきい値(0個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "バケットへのリクエストに対して,HTTP 5xx サーバーエラーステータスコードを返却される数がしきい値(0個)を下回りました。" -
aws_dynamodb_consumed_read_capacity_units_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_dynamodb_consumed_read_capacity_units_sum(Yet another cloudwatch exporter) expr: 600 < aws_dynamodb_consumed_read_capacity_units_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0608" jp1_pc_metricname: "aws_dynamodb_consumed_read_capacity_units_sum" annotations: jp1_pc_firing_description: "消費された読み込み容量ユニットの合計数がしきい値(600個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "消費された読み込み容量ユニットの合計数がしきい値(600個)を下回りました。" -
aws_dynamodb_consumed_write_capacity_units_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_dynamodb_consumed_write_capacity_units_sum(Yet another cloudwatch exporter) expr: 600 < aws_dynamodb_consumed_write_capacity_units_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0609" jp1_pc_metricname: "aws_dynamodb_consumed_write_capacity_units_sum" annotations: jp1_pc_firing_description: "消費された書き込み容量ユニットの合計数がしきい値(600個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "消費された書き込み容量ユニットの合計数がしきい値(600個)を下回りました。" -
aws_states_execution_time_average#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_states_execution_time_average(Yet another cloudwatch exporter) expr: 5000 < aws_states_execution_time_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0610" jp1_pc_metricname: "aws_states_execution_time_average" annotations: jp1_pc_firing_description: "Step Functionsの実行時間がしきい値(5000ミリ秒)を上回りました。value={{ $value }}ミリ秒" jp1_pc_resolved_description: "Step Functionsの実行時間がしきい値(5000ミリ秒)を下回りました。" -
aws_states_executions_failed_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_states_executions_failed_sum(Yet another cloudwatch exporter) expr: 0 < aws_states_executions_failed_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0611" jp1_pc_metricname: "aws_states_executions_failed_sum" annotations: jp1_pc_firing_description: "Step Functionsの実行失敗数がしきい値(0個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "Step Functionsの実行失敗数がしきい値(0個)を下回りました。" -
aws_sqs_approximate_number_of_messages_delayed_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_sqs_approximate_number_of_messages_delayed_sum(Yet another cloudwatch exporter) expr: 0 < aws_sqs_approximate_number_of_messages_delayed_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0612" jp1_pc_metricname: "aws_sqs_approximate_number_of_messages_delayed_sum" annotations: jp1_pc_firing_description: "遅延キューメッセージ数がしきい値(0個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "遅延キューメッセージ数がしきい値(0個)を下回りました。" -
aws_sqs_number_of_messages_deleted_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_sqs_number_of_messages_deleted_sum(Yet another cloudwatch exporter) expr: 0 < aws_sqs_number_of_messages_deleted_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0613" jp1_pc_metricname: "aws_sqs_number_of_messages_deleted_sum" annotations: jp1_pc_firing_description: "削除キューメッセージ数がしきい値(0個)を上回りました。value={{ $value }}個" jp1_pc_resolved_description: "削除キューメッセージ数がしきい値(0個)を下回りました。" -
aws_ecs_cpuutilization_average#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_ecs_cpuutilization_average(Yet another cloudwatch exporter) expr: 80 < aws_ecs_cpuutilization_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0614" jp1_pc_metricname: "aws_ecs_cpuutilization_average" annotations: jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)." -
aws_ecs_memory_utilization_average#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_ecs_memory_utilization_average(Yet another cloudwatch exporter) expr: 80 < aws_ecs_memory_utilization_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0615" jp1_pc_metricname: "aws_ecs_memory_utilization_average" annotations: jp1_pc_firing_description: "Memory usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "Memory usage has fallen below the threshold (80%)." -
aws_rds_cpuutilization_average#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_rds_cpuutilization_average(Yet another cloudwatch exporter) expr: 80 < aws_rds_cpuutilization_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0616" jp1_pc_metricname: "aws_rds_cpuutilization_average" annotations: jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)." -
aws_sns_number_of_notifications_failed_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_sns_number_of_notifications_failed_sum(Yet another cloudwatch exporter) expr: 0 < aws_sns_number_of_notifications_failed_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0617" jp1_pc_metricname: "aws_sns_number_of_notifications_failed_sum" annotations: jp1_pc_firing_description: "The number of rejected messages has exceeded the threshold (0 messages). value={{ $value }} messages" jp1_pc_resolved_description: "The number of rejected messages has fallen below the threshold (0 messages)." -
aws_sns_number_of_notifications_filtered_out_sum#
groups: - name: yet_another_cloudwatch_exporter rules: - alert: aws_sns_number_of_notifications_filtered_out_sum(Yet another cloudwatch exporter) expr: 0 < aws_sns_number_of_notifications_filtered_out_sum for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "0618" jp1_pc_metricname: "aws_sns_number_of_notifications_filtered_out_sum" annotations: jp1_pc_firing_description: "The number of rejected messages has exceeded the threshold (0 messages). value={{ $value }} messages" jp1_pc_resolved_description: "The number of rejected messages has fallen below the threshold (0 messages)."
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Promitor metric definition file
-
azure_virtual_machine_disk_read_bytes_total#
groups: - name: promitor rules: - alert: azure_virtual_machine_disk_read_bytes_total(Promitor) expr: 10485760 < azure_virtual_machine_disk_read_bytes_total for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0901" jp1_pc_metricname: "azure_virtual_machine_disk_read_bytes_total" annotations: jp1_pc_firing_description: "The number of disk read bytes has exceeded the threshold (10485760 bytes). value={{ $value }} bytes" jp1_pc_resolved_description: "The number of disk read bytes has fallen below the threshold (10485760 bytes)." -
azure_virtual_machine_disk_write_bytes_total#
groups: - name: promitor rules: - alert: azure_virtual_machine_disk_write_bytes_total(Promitor) expr: 10485760 < azure_virtual_machine_disk_write_bytes_total for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0902" jp1_pc_metricname: "azure_virtual_machine_disk_write_bytes_total" annotations: jp1_pc_firing_description: "The number of disk write bytes has exceeded the threshold (10485760 bytes). value={{ $value }} bytes" jp1_pc_resolved_description: "The number of disk write bytes has fallen below the threshold (10485760 bytes)." -
azure_virtual_machine_percentage_cpu_average#
groups: - name: promitor rules: - alert: azure_virtual_machine_percentage_cpu_average(Promitor) expr: 80 < azure_virtual_machine_percentage_cpu_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0903" jp1_pc_metricname: "azure_virtual_machine_percentage_cpu_average" annotations: jp1_pc_firing_description: "The percentage of allocated compute units has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "The percentage of allocated compute units has fallen below the threshold (80%)." -
azure_blob_storage_availability_average#
groups: - name: promitor rules: - alert: azure_blob_storage_availability_average(Promitor) expr: 100 > azure_blob_storage_availability_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0904" jp1_pc_metricname: "azure_blob_storage_availability_average" annotations: jp1_pc_firing_description: "The percentage of availability has fallen below the threshold (100%). value={{ $value }}%" jp1_pc_resolved_description: "The percentage of availability has exceeded the threshold (100%)." -
azure_blob_storage_blob_capacity_average#
groups: - name: promitor rules: - alert: azure_blob_storage_blob_capacity_average(Promitor) expr: 1099511627776 < azure_blob_storage_blob_capacity_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0905" jp1_pc_metricname: "azure_blob_storage_blob_capacity_average" annotations: jp1_pc_firing_description: "The storage capacity has exceeded the threshold (1099511627776 bytes). value={{ $value }} bytes" jp1_pc_resolved_description: "The storage capacity has fallen below the threshold (1099511627776 bytes)." -
azure_function_app_http5xx_total#
groups: - name: promitor rules: - alert: azure_function_app_http5xx_total(Promitor) expr: 0 < azure_function_app_http5xx_total for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0906" jp1_pc_metricname: "azure_function_app_http5xx_total" annotations: jp1_pc_firing_description: "The number of 5xx server errors has exceeded the threshold (0 errors). value={{ $value }} errors" jp1_pc_resolved_description: "The number of 5xx server errors has fallen below the threshold (0 errors)." -
azure_function_app_http_response_time_average#
groups: - name: promitor rules: - alert: azure_function_app_http_response_time_average(Promitor) expr: 5 < azure_function_app_http_response_time_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0907" jp1_pc_metricname: "azure_function_app_http_response_time_average" annotations: jp1_pc_firing_description: "The response time has exceeded the threshold (5 seconds). value={{ $value }} seconds" jp1_pc_resolved_description: "The response time has fallen below the threshold (5 seconds)." -
azure_cosmos_db_total_request_units_total#
groups: - name: promitor rules: - alert: azure_cosmos_db_total_request_units_total(Promitor) expr: 600 < azure_cosmos_db_total_request_units_total for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0908" jp1_pc_metricname: "azure_cosmos_db_total_request_units_total" annotations: jp1_pc_firing_description: "The number of consumed request units has exceeded the threshold (600 units). value={{ $value }} units, collectionname={{ $labels.collectionname }}" jp1_pc_resolved_description: "The number of consumed request units has fallen below the threshold (600 units). collectionname={{ $labels.collectionname }}" -
azure_logic_app_runs_failed_total#
groups: - name: promitor rules: - alert: azure_logic_app_runs_failed_total(Promitor) expr: 0 < azure_logic_app_runs_failed_total for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0910" jp1_pc_metricname: "azure_logic_app_runs_failed_total" annotations: jp1_pc_firing_description: "The number of workflow errors has exceeded the threshold (0 errors). value={{ $value }} errors" jp1_pc_resolved_description: "The number of workflow errors has fallen below the threshold (0 errors)." -
azure_container_instance_cpu_usage_average#
groups: - name: promitor rules: - alert: azure_container_instance_cpu_usage_average(Promitor) expr: 800 < azure_container_instance_cpu_usage_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0911" jp1_pc_firing_description: "CPU usage (millicores) has exceeded the threshold (800 millicores). value={{ $value }} millicores" jp1_pc_resolved_description: "CPU usage (millicores) has fallen below the threshold (800 millicores)." -
azure_kubernetes_service_kube_pod_status_phase_average_failed#
groups: - name: promitor rules: - alert: azure_kubernetes_service_kube_pod_status_phase_average_failed(Promitor) expr: 0 < azure_kubernetes_service_kube_pod_status_phase_average_failed for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0912" jp1_pc_metricname: "azure_kubernetes_service_kube_pod_status_phase_average_failed" annotations: jp1_pc_firing_description: "The number of failed pods has exceeded the threshold (0 pods). value={{ $value }} pods" jp1_pc_resolved_description: "The number of failed pods has fallen below the threshold (0 pods)." -
azure_kubernetes_service_kube_pod_status_phase_average_pending#
groups: - name: promitor rules: - alert: azure_kubernetes_service_kube_pod_status_phase_average_pending(Promitor) expr: 0 < azure_kubernetes_service_kube_pod_status_phase_average_pending for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0913" jp1_pc_metricname: "azure_kubernetes_service_kube_pod_status_phase_average_pending" annotations: jp1_pc_firing_description: "The number of pending pods has exceeded the threshold (0 pods). value={{ $value }} pods" jp1_pc_resolved_description: "The number of pending pods has fallen below the threshold (0 pods)." -
azure_kubernetes_service_kube_pod_status_phase_average_unknown#
groups: - name: promitor rules: - alert: azure_kubernetes_service_kube_pod_status_phase_average_unknown(Promitor) expr: 0 < azure_kubernetes_service_kube_pod_status_phase_average_unknown for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0914" jp1_pc_metricname: "azure_kubernetes_service_kube_pod_status_phase_average_unknown" annotations: jp1_pc_firing_description: "The number of unknown pods has exceeded the threshold (0 pods). value={{ $value }} pods" jp1_pc_resolved_description: "The number of unknown pods has fallen below the threshold (0 pods)." -
azure_file_storage_availability_average#
groups: - name: promitor rules: - alert: azure_file_storage_availability_average(Promitor) expr: 100 > azure_file_storage_availability_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0915" jp1_pc_metricname: "azure_file_storage_availability_average" annotations: jp1_pc_firing_description: "The percentage of availability has fallen below the threshold (100%). value={{ $value }}%, fileshare={{ $labels.fileshare }}" jp1_pc_resolved_description: "The percentage of availability has exceeded the threshold (100%). fileshare={{ $labels.fileshare }}" -
azure_service_bus_namespace_deadlettered_messages_average#
groups: - name: promitor rules: - alert: azure_service_bus_namespace_deadlettered_messages_average(Promitor) expr: 0 < azure_service_bus_namespace_deadlettered_messages_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0916" jp1_pc_metricname: "azure_service_bus_namespace_deadlettered_messages_average" annotations: jp1_pc_firing_description: "The number of dead-lettered messages has exceeded the threshold (0 messages). value={{ $value }} messages, entity_name={{ $labels.entity_name }}" jp1_pc_resolved_description: "The number of dead-lettered messages has fallen below the threshold (0 messages). entity_name={{ $labels.entity_name }}" -
azure_sql_database_cpu_percent_average#
groups: - name: promitor rules: - alert: azure_sql_database_cpu_percent_average(Promitor) expr: 80 < azure_sql_database_cpu_percent_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0917" jp1_pc_metricname: "azure_sql_database_cpu_percent_average" annotations: jp1_pc_firing_description: "CPU percentage has exceeded the threshold (80%). value={{ $value }}%, server={{ $labels.server }}" jp1_pc_resolved_description: "CPU percentage has fallen below the threshold (80%). server={{ $labels.server }}" -
azure_sql_elastic_pool_cpu_percent_average#
groups: - name: promitor rules: - alert: azure_sql_elastic_pool_cpu_percent_average(Promitor) expr: 80 < azure_sql_elastic_pool_cpu_percent_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0920" jp1_pc_metricname: "azure_sql_elastic_pool_cpu_percent_average" annotations: jp1_pc_firing_description: "CPU percentage has exceeded the threshold (80%). value={{ $value }}%, server={{ $labels.server }}" jp1_pc_resolved_description: "CPU percentage has fallen below the threshold (80%). server={{ $labels.server }}" -
azure_sql_managed_instance_avg_cpu_percent_average#
groups: - name: promitor rules: - alert: azure_sql_managed_instance_avg_cpu_percent_average(Promitor) expr: 80 < azure_sql_managed_instance_avg_cpu_percent_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0922" jp1_pc_metricname: "azure_sql_managed_instance_avg_cpu_percent_average" annotations: jp1_pc_firing_description: "Average CPU percentage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "Average CPU percentage has fallen below the threshold (80%)." -
azure_sql_managed_instance_io_bytes_read_average#
groups: - name: promitor rules: - alert: azure_sql_managed_instance_io_bytes_read_average(Promitor) expr: 10485760 < azure_sql_managed_instance_io_bytes_read_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0923" jp1_pc_metricname: "azure_sql_managed_instance_io_bytes_read_average" annotations: jp1_pc_firing_description: "The number of IO bytes read has exceeded the threshold (10485760 bytes). value={{ $value }} bytes" jp1_pc_resolved_description: "The number of disk read bytes has fallen below the threshold (10485760 bytes)." -
azure_sql_managed_instance_io_bytes_written_average#
groups: - name: promitor rules: - alert: azure_sql_managed_instance_io_bytes_written_average(Promitor) expr: 10485760 < azure_sql_managed_instance_io_bytes_written_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/AZURE/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0924" jp1_pc_metricname: "azure_sql_managed_instance_io_bytes_written_average" annotations: jp1_pc_firing_description: "The number of IO bytes written has exceeded the threshold (10485760 bytes). value={{ $value }} bytes" jp1_pc_resolved_description: "The number of IO bytes written has fallen below the threshold (10485760 bytes)."
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Script exporter metric definition file
-
azure_virtual_machine_disk_read_bytes_total#1
groups: - name: script_exporter rules: - alert: script_success(Script exporter) expr: 0 == script_success for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "1401" jp1_pc_metricname: "script_success" annotations: jp1_pc_firing_description: "Failed to execute script. value={{ $value }}" jp1_pc_resolved_description: "Script successfully executed." -
script_duration_seconds#1, #2
groups: - name: script_exporter rules: - alert: script_duration_seconds(Script exporter) expr: 60 < script_duration_seconds for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "1402" jp1_pc_metricname: "script_duration_seconds" annotations: jp1_pc_firing_description: "The script execution time has exceeded the threshold (60 seconds). value={{ $value }} seconds" jp1_pc_resolved_description: "The script execution time has fallen below the threshold (60 seconds)." -
script_exit_code#1
groups: - name: script_exporter rules: - alert: script_exit_code(Script exporter) expr: 0 != script_exit_code for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_severity: "Error" jp1_pc_eventid: "1403" jp1_pc_metricname: "script_exit_code" annotations: jp1_pc_firing_description: "Failed to execute script. value={{ $value }}" jp1_pc_resolved_description: "Script successfully executed."
- #1
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
- #2
-
This uses a threshold value of 60 as an example. Change this value based on the number of monitoring targets.
■Alert definition example for metrics in OracleDB exporter metric definition file
-
oracledb_up#
groups: - name: oracledb_exporter rules: - alert: oracledb_down(OracleDB exporter) expr: oracledb_up != 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0801" jp1_pc_metricname: "oracledb_up" annotations: jp1_pc_firing_description: "OracleDB stopped.instance={{ $labels.instance }}" jp1_pc_resolved_description: "OracleDB started. instance={{ $labels.instance }}" -
cache_hit_ratio_percent#
groups: - name: oracledb_exporter rules: - alert: cache_hit_ratio_percentage_under_60(OracleDB exporter) expr: (1 - (rate(oracledb_activity_physical_reads_cache[2m]) / (rate(oracledb_activity_consistent_gets_from_cache[2m])+rate(oracledb_activity_db_block_gets_from_cache[2m]))))*100 < 60 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0802" jp1_pc_metricname: "oracledb_activity_physical_reads_cache,oracledb_activity_consistent_gets_from_cache,oracledb_activity_db_block_gets_from_cache" annotations: jp1_pc_firing_description: "Cache hit rate for OracleDB dropped below 60%. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "OracleDB cache hit rate is now over 60%. instance={{ $labels.instance }}" -
tablespace_used_percent#
groups: - name: oracledb_exporter rules: - alert: oracledb_tablespace_used_percent_over_90(OracleDB exporter) expr: oracledb_tablespace_used_percent > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0803" jp1_pc_metricname: "oracledb_tablespace_used_percent" annotations: jp1_pc_firing_description: "Tablespace usage for OracleDB exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "Tablespace usage for OracleDB is 90% or less. instance={{ $labels.instance }}" -
execute_count#
groups: - name: oracledb_exporter rules: - alert: oracledb_activity_execute_count_over_1000(OracleDB exporter) expr: rate(oracledb_activity_execute_count[2m])*60 > 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "0804" jp1_pc_metricname: "oracledb_activity_execute_count" annotations: jp1_pc_firing_description: "SQL statement executed more than 1000 times. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "The number of executions of SQL statement is less than 1000. instance={{ $labels.instance }}" -
parse_count#
Please see the execute_count to create it.
-
user_commit_count#
Please see the execute_count to create it.
-
user_rollback_count#
Please see the execute_count to create it.
-
resource_used#
Please see the tablespace_used_percent to create it.
-
session_count#
Please see the tablespace_used_percent to create it.
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in Web exporter metric definition file
-
probe_webscena_success#
groups: - name: web_exporter rules: - alert: probe_webscena_success(Web exporter) expr: 0 == probe_webscena_success for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1901" jp1_pc_metricname: "probe_webscena_success" annotations: jp1_pc_firing_description: "Communication failed. value={{ $value }}" jp1_pc_resolved_description: "Communication was successful." -
probe_webscena_duration_seconds#
groups: - name: web_exporter rules: - alert: probe_webscena_duration_seconds(Web exporter) expr: 150 < probe_webscena_duration_seconds for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1902" jp1_pc_metricname: "probe_webscena_duration_seconds" annotations: jp1_pc_firing_description: "The period (in seconds) taken by the web scenario probe has exceeded threshold (150 seconds). value={{$value}} seconds" jp1_pc_resolved_description: "The period (in seconds) taken by the web scenario probe has fallen below the threshold (150 seconds)"
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group-name.
■Alert definition example for metrics in VMware exporter metric definition file for host
-
vmware_host_size#
groups: - name: vmware_exporter rules: - alert: vmware_host_size(VMware exporter) expr: 900 > sum(vmware_datastore_capacity_size) without(ds_name)/1024/1024/1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1101" jp1_pc_metricname: "vmware_datastore_capacity_size" annotations: jp1_pc_firing_description: "The physical disk size has fallen below the threshold (900 gigabytes).value={{ $value }}gigabytes, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The physical disk size has exceeded threshold (900 gigabytes). instance={{ $labels.instance }}" -
vmware_host_used#
groups: - name: vmware_exporter rules: - alert: vmware_host_used(VMware exporter) expr: 800 < ((sum(vmware_datastore_capacity_size) without(ds_name)) /1024/1024/1024)-((sum(vmware_datastore_freespace_size) without(ds_name)) /1024/1024/1024 ) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1112" jp1_pc_metricname: "vmware_datastore_capacity_size,vmware_datastore_freespace_size" annotations: jp1_pc_firing_description: "The physical disk usage size has exceeded threshold (800 gigabytes) value={{ $value }} gigabytes, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The physical disk usage size has fallen below the threshold (800 gigabytes). instance={{ $labels.instance }}" -
vmware_host_free#
groups: - name: vmware_exporter rules: - alert: vmware_host_free(VMware exporter) expr: 10 > (sum(vmware_datastore_freespace_size) without(ds_name))/1024/1024/1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1102" jp1_pc_metricname: "vmware_datastore_freespace_size" annotations: jp1_pc_firing_description: "The size of free physical disk space has fallen below the threshold (10 gigabytes). value={{ $value }} gigabytes, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The size of free physical disk space has exceeded threshold (10 gigabytes). instance={{ $labels.instance }}" -
vmware_datastore_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_datastore_used_percent(VMware exporter) expr: (((sum(vmware_datastore_capacity_size) without(ds_name))/1024/1024) - ((sum(vmware_datastore_freespace_size) without(ds_name))/1024/1024 ))/ ((sum(vmware_datastore_capacity_size) without(ds_name))/1024/1024) * 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1103" jp1_pc_metricname: "vmware_datastore_capacity_size,vmware_datastore_freespace_size" annotations: jp1_pc_firing_description: "The physical disk usage rate has exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "The physical disk usage rate is 90% or less. instance={{ $labels.instance }}" -
vmware_host_memory_max#
groups: - name: vmware_exporter rules: - alert: vmware_host_memory_max(VMware exporter) expr: 16 > vmware_host_memory_max / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1104" jp1_pc_metricname: "vmware_host_memory_max" annotations: jp1_pc_firing_description: "The total size of physical memory has fallen below the threshold (16 gigabytes). value={{ $value }} gigabytes" jp1_pc_resolved_description: "The total size of physical memory has exceeded the threshold (16 gigabytes)." -
vmware_host_memory_used#
groups: - name: vmware_exporter rules: - alert: vmware_host_memory_used(VMware exporter) expr: 15 < vmware_host_memory_usage / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1117" jp1_pc_metricname: "vmware_host_memory_usage" annotations: jp1_pc_firing_description: "The amount of physical memory used has exceeded threshold (15 gigabytes). value={{ $value }} gigabytes" jp1_pc_resolved_description: "The amount of physical memory used has fallen below the threshold (15 gigabytes)." -
vmware_host_memory_unused#
groups: - name: vmware_exporter rules: - alert: vmware_host_memory_unused(VMware exporter) expr: 1 > (vmware_host_memory_max / 1024) - (vmware_host_memory_usage /1024) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1105" jp1_pc_metricname: "vmware_host_memory_max,vmware_host_memory_usage" annotations: jp1_pc_firing_description: "The amount of unused physical memory has fallen below the threshold (1 gigabyte). value={{ $value }} gigabytes" jp1_pc_resolved_description: "The amount of unused physical memory has exceeded the threshold (1 gigabyte)." -
vmware_host_mem_vmmemctl_average#
groups: - name: vmware_exporter rules: - alert: vmware_host_mem_vmmemctl_average(VMware exporter) expr: 15 < vmware_host_mem_vmmemctl_average / 1024 /1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1118" jp1_pc_metricname: "vmware_host_mem_vmmemctl_average" annotations: jp1_pc_firing_description: "The amount of internal swaps used has exceeded the threshold (15 gigabytes). value={{ $value }} gigabytes" jp1_pc_resolved_description: "The amount of internal swaps used has fallen below the threshold (15 gigabyte)." -
vmware_host_memory_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_host_memory_used_percent(VMware exporter) expr: (vmware_host_memory_usage / vmware_host_memory_max) * 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1106" jp1_pc_metricname: "vmware_host_memory_usage,vmware_host_memory_max" annotations: jp1_pc_firing_description: "The ratio of physical memory has exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "The ratio of physical memory is 90% or less. instance={{ $labels.instance }}" -
vmware_host_swap_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_host_swap_used_percent(VMware exporter) expr: ((vmware_host_mem_vmmemctl_average / 1024)/ vmware_host_memory_max) * 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1116" jp1_pc_metricname: "vmware_host_mem_vmmemctl_average,vmware_host_memory_max" annotations: jp1_pc_firing_description: "The internal swap rate has exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "The internal swap rate is 90% or less. instance={{ $labels.instance }}" -
vmware_vm_net_rate#
groups: - name: vmware_exporter rules: - alert: vmware_vm_net_rate(VMware exporter) expr: 10 > vmware_host_net_bytesTx_average + vmware_host_net_bytesRx_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1107" jp1_pc_metricname: "vmware_host_net_bytesTx_average,vmware_host_net_bytesRx_average" annotations: jp1_pc_firing_description: "The network transmission/reception speed has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The network transmission/reception speed has exceeded threshold (10KB per seconds). instance={{ $labels.instance }}" -
vmware_host_net_bytesTx_average#
groups: - name: vmware_exporter rules: - alert: vmware_host_net_bytesTx_average(VMware exporter) expr: 10 > vmware_host_net_bytesTx_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1108" jp1_pc_metricname: "vmware_host_net_bytesTx_average" annotations: jp1_pc_firing_description: "The network transmission speed has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The network transmission speed has exceeded threshold (10KB per seconds). instance={{ $labels.instance }}" -
vmware_host_net_bytesRx_average#
groups: - name: vmware_exporter rules: - alert: vmware_host_net_bytesRx_average(VMware exporter) expr: 10 > vmware_host_net_bytesRx_average for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1109" jp1_pc_metricname: "vmware_host_net_bytesRx_average" annotations: jp1_pc_firing_description: "The network reception speed has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The network reception speed has exceeded threshold (10KB per seconds). instance={{ $labels.instance }}" -
vmware_host_num_cpu#
groups: - name: vmware_exporter rules: - alert: vmware_host_num_cpu(VMware exporter) expr: 2 > vmware_host_num_cpu for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1113" jp1_pc_metricname: "vmware_host_num_cpu" annotations: jp1_pc_firing_description: "The Number of CPU cores has fallen below threshold value (2)." jp1_pc_resolved_description: "The Number of CPU cores has exceeded threshold value (2)." -
vmware_host_cpu_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_host_cpu_used_percent(VMware exporter) expr: vmware_host_cpu_usage_average / 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1110" jp1_pc_metricname: "vmware_host_cpu_usage_average" annotations: jp1_pc_firing_description: "The CPU usage rate has exceeded threshold value (90%).instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "The CPU usage rate has fallen below threshold value (90%). instance={{ $labels.instance }}" -
vmware_host_disk_write_average#
groups: - name: vmware_exporter rules: - alert: vmware_host_disk_write_average(VMware exporter) expr: 10 > vmware_host_disk_write_average / 8 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1119" jp1_pc_metricname: "vmware_host_disk_write_average" annotations: jp1_pc_firing_description: "The Write data transfer speed has fallen below threshold value (10KB per seconds). value={{ $value }} KB per seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The Write data transfer speed has exceeded threshold value (10KB per seconds). instance={{ $labels.instance }}" -
vmware_host_disk_read_average#
groups: - name: vmware_exporter rules: - alert: vmware_host_disk_read_average(VMware exporter) expr: 10 > vmware_host_disk_read_average / 8 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1120" jp1_pc_metricname: "vmware_host_disk_read_average" annotations: jp1_pc_firing_description: "The Read data transfer speed has fallen below threshold value (10KB per seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "The Read data transfer speed has exceeded threshold value (10KB per seconds). instance={{ $labels.instance }}"
- #
-
If you define more than one alert in the same monitoring agent host, be careful not to specify duplicate "groups:" or duplicate name with the same group name.
■Alert definition example for metrics in VMware exporter metric definition file for VM
-
vmware_vm_cpu_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_vm_cpu_used_percent(VMware exporter) expr: vmware_vm_cpu_usage_average / (20 * 1000) * 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1121" jp1_pc_metricname: "vmware_vm_cpu_usage_average" annotations: jp1_pc_firing_description: "The virtual CPU usage rate has exceeded 90%. vm_name={{ $labels.vm_name }}, value={{ $value }}" jp1_pc_resolved_description: "The virtual CPU usage rate is 90% or less. vm_name={{ $labels.vm_name }}" -
vmware_vm_mem_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_vm_mem_used_percent(VMware exporter) expr: (((vmware_vm_mem_consumed_average/ 1024) + (vmware_vm_mem_vmmemctl_average / 1024) + (vmware_vm_mem_swapped_average / 1024))/ vmware_vm_memory_max) * 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1111" jp1_pc_metricname: "vmware_vm_mem_consumed_average,vmware_vm_mem_vmmemctl_average,vmware_vm_mem_swapped_average,vmware_vm_memory_max" annotations: jp1_pc_firing_description: "The Virtual memory usage rate has exceeded 90%. vm_name={{ $labels.vm_name }}, value={{ $value }}" jp1_pc_resolved_description: "The Virtual memory usage rate is 90% or less. vm_name={{ $labels.vm_name }}" -
vmware_vm_disk_write_average#
groups: - name: vmware_exporter rules: - alert: vmware_vm_disk_write_average(VMware exporter) expr: 10 > vmware_vm_disk_write_average / 8 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1122" jp1_pc_metricname: "vmware_vm_disk_write_average" annotations: jp1_pc_firing_description: "The Write data transfer speed of the virtual machine has fallen below threshold (10KB per seconds) value={{ $value }} KB per seconds, vm_name={{ $labels.vm_name }}" jp1_pc_resolved_description: "The Write data transfer speed of the virtual machine has exceeded threshold (10KB per seconds). vm_name={{ $labels.vm_name }}" -
vmware_vm_disk_read_average#
groups: - name: vmware_exporter rules: - alert: vmware_vm_disk_read_average(VMware exporter) expr: 10 > vmware_vm_disk_read_average / 8 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1123" jp1_pc_metricname: "vmware_vm_disk_read_average" annotations: jp1_pc_firing_description: "The Read data transfer speed of the virtual machine has fallen below threshold (10KB per seconds). value={{ $value }} KB per seconds, vm_name={{ $labels.vm_name }}" jp1_pc_resolved_description: "The Read data transfer speed of the virtual machine has exceeded threshold (10KB per seconds). vm_name={{ $labels.vm_name }}" -
vmware_vm_disk_used_percent#
groups: - name: vmware_exporter rules: - alert: vmware_vm_disk_used_percent(VMware exporter) expr: (((sum(vmware_vm_guest_disk_capacity) without(partition) / (1024 * 1024)) - ((sum(vmware_vm_guest_disk_free) without(partition))/(1024 * 1024))) / ((sum(vmware_vm_guest_disk_capacity) without(partition))/(1024*1024))) * 100 > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1124" jp1_pc_metricname: "vmware_vm_guest_disk_capacity,vmware_vm_guest_disk_free" annotations: jp1_pc_firing_description: "The Disk usage rate of the virtual machine has exceeded 90%. vm_name={{ $labels.vm_name }}, value={{ $value }}" jp1_pc_resolved_description: "The Disk usage rate of the virtual machine is 90% or less. vm_name={{ $labels.vm_name }}"
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Windows exporter (Hyper-V monitoring) metric definition file for host
-
hyperv_host_cpu_resources_used_percent#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_vm_cpu_resources_used_percent(Windows exporter) expr: 90 < (sum by(instance,job,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_prome_hostname,vm)(rate(windows_hyperv_vm_cpu_hypervisor_run_time[2m]))) / ignoring(instance,job,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_prome_hostname,vm) group_left max (windows_cs_logical_processors) / 100000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1601" jp1_pc_metricname: "windows_hyperv_vm_cpu_hypervisor_run_time,windows_cs_logical_processors" annotations: jp1_pc_firing_description: "VM used more than 90% of the physical CPU. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "VM now uses 90% or less of the physical CPU. instance={{ $labels.instance }}" -
hyperv_host_cpu_used_percent#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_host_cpu_used_percent(Windows exporter) expr: 90 < (sum by (instance,job,jp1_pc_category,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_trendname,jp1_pc_prome_hostname)(rate(windows_hyperv_host_cpu_total_run_time[2m]))) / sum by (instance,job,jp1_pc_category,jp1_pc_exporter,jp1_pc_nodelabel,jp1_pc_trendname,jp1_pc_prome_hostname)(windows_cs_logical_processors) / 100000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1604" jp1_pc_metricname: windows_hyperv_host_cpu_total_run_time,windows_cs_logical_processors)" annotations: jp1_pc_firing_description: "CPU usage of physical servers exceeded 90%. instance={{ $labels.instance }}, value={{ $value }}" jp1_pc_resolved_description: "The CPU utilization of the physical server has exceeded 90%. instance={{ $labels.instance }}" -
hyperv_vswitch_sent_received#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_vswitch_sent_received(Windows exporter) expr: 10 > ((rate(windows_hyperv_vswitch_bytes_received_total[2m])) + (rate(windows_hyperv_vswitch_bytes_sent_total[2m]))) /1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1605" jp1_pc_metricname: "windows_hyperv_vswitch_bytes_received_total,windows_hyperv_vswitch_bytes_sent_total" annotations: jp1_pc_firing_description: "Network-incoming/outgoing rate dropped below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "Network transmission/reception rate exceeded the threshold (10KB/ seconds). instance={{ $labels.instance }}" -
hyperv_vswitch_received#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_vswitch_received(Windows exporter) expr: 10 > (rate(windows_hyperv_vswitch_bytes_received_total[2m])) / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1606" jp1_pc_metricname: "windows_hyperv_vswitch_bytes_received_total" annotations: jp1_pc_firing_description: "Network-receive rate is below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "Network receive rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}" -
hyperv_vswitch_sent#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_vswitch_sent(Windows exporter) expr: 10 > (rate(windows_hyperv_vswitch_bytes_sent_total[2m]) and $jp1im_TrendData_labels) / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1607" jp1_pc_metricname: "windows_hyperv_vswitch_bytes_sent_total" annotations: jp1_pc_firing_description: "Network-send rate is below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "Network sending rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}"
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Windows exporter (Hyper-V monitoring) metric definition file for VM
-
hyperv_vm_device_written#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_vm_device_written(Windows exporter) expr: 10 > (rate(windows_hyperv_vm_device_bytes_written[2m])) / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1602" jp1_pc_metricname: "windows_hyperv_vm_device_bytes_written" annotations: jp1_pc_firing_description: "Write data rate dropped below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "Write data rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}" -
hyperv_vm_device_read#
groups: - name: windows_exporter_hyperv rules: - alert: hyperv_vm_device_read(Windows exporter) expr: 10 > (rate(windows_hyperv_vm_device_bytes_read[2m])) / 1024 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1603" jp1_pc_metricname: "windows_hyperv_vm_device_bytes_read" annotations: jp1_pc_firing_description: "Read data rate dropped below threshold (10KB/ seconds). value={{ $value }}KB/seconds, instance={{ $labels.instance }}" jp1_pc_resolved_description: "Read data rate exceeds threshold (10KB/ seconds). instance={{ $labels.instance }}"
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in SQL exporter metric definition file
-
connections#
groups: - name: sql_exporter rules: - alert: sql_connections(SQL exporter) expr: mssql_connections > 50 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1701" jp1_pc_metricname: "mssql_connections" annotations: jp1_pc_firing_description: "Number of connections exceeds threshold (50 connections). value={{ $value }}" jp1_pc_resolved_description: "Number of connections dropped below threshold (50 units). value={{ $value }}" -
deadlocks#
groups: - name: sql_exporter rules: - alert: sql_deadlocks (SQL exporter) expr: rate(mssql_deadlocks[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1702" jp1_pc_metricname: "mssql_deadlocks" annotations: jp1_pc_firing_description: "Deadlock occurred. value={{$value}value={{ $value }}" jp1_pc_resolved_description: "No deadlock detected. value={{ $value }}" -
user_errors#
groups: - name: sql_exporter rules: - alert: sql_user_errors(SQL exporter) expr: rate(mssql_user_errors[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1703" jp1_pc_metricname: "mssql_user_errors" annotations: jp1_pc_firing_description: "User error occurred. value={{ $value }}" jp1_pc_resolved_description: "No user error detected. value={{ $value }}" -
kill_connection_errors#
groups: - name: sql_exporter rules: - alert: sql_kill_connection_errors(SQL exporter) expr: rate(mssql_kill_connection_errors[2m]) > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1704" jp1_pc_metricname: "mssql_kill_connection_errors" annotations: jp1_pc_firing_description: "Critical failure. value={{ $value }}" jp1_pc_resolved_description: "No critical failure detected. value={{ $value }}" -
page_life_expectancy_seconds#
groups: - name: sql_exporter rules: - alert: sql_life_expectancy_seconds(SQL exporter) expr: rate(mssql_page_life_expectancy_seconds[2m]) > 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1705" jp1_pc_metricname: "mssql_page_life_expectancy_seconds" annotations: jp1_pc_firing_description: "Bufferpool remaining time exceeded threshold (1 second). value={{ $value }}" jp1_pc_resolved_description: "Bufferpool latency less than threshold (1 second). value={{ $value }}" -
batch_requests#
groups: - name: sql_exporter rules: - alert: sql_command_batch(SQL exporter) expr: rate(mssql_batch_requests[2m]) > 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1706" jp1_pc_metricname: "mssql_batch_requests" annotations: jp1_pc_firing_description: "The number of command batches received exceeds the threshold (1000). value={{ $value }}" jp1_pc_resolved_description: "The number of command batches received dropped below the threshold (1000). value={{ $value }}" -
log_growths#
groups: - name: sql_exporter rules: - alert: sql_log_growths(SQL exporter) expr: mssql_log_growths > 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1707" jp1_pc_metricname: "mssql_log_growths" annotations: jp1_pc_firing_description: "Transaction-log extension count exceeded threshold (1 time). value={{ $value }}" jp1_pc_resolved_description: "Transactional logging extended less than threshold (1 time). value={{ $value }}" -
checkpoint_pages_sec#
groups: - name: sql_exporter rules: - alert: sql_checkpoint_pages_sec(SQL exporter) expr: mssql_checkpoint_pages_sec > 0 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1709" jp1_pc_metricname: "mssql_checkpoint_pages_sec" annotations: jp1_pc_firing_description: "Checkpoint pages per second exceeded threshold (1). value={{ $value }}" jp1_pc_resolved_description: "Checkpoint pages per second dropped below threshold (1). value={{ $value }}" -
io_stall_seconds#
groups: - name: sql_exporter rules: - alert: sql_io_stall_seconds(SQL exporter) expr: avg(mssql_io_stall_seconds)without(operation) > 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1710" jp1_pc_metricname: "mssql_io_stall_seconds" annotations: jp1_pc_firing_description: "Stall time per operation exceeded threshold (1 second). value={{ $value }}" jp1_pc_resolved_description: "Stall duration per operation dropped below threshold (1 second). value={{ $value }}" -
io_stall_read_seconds#
groups: - name: sql_exporter rules: - alert: sql_io_stall_read_seconds(SQL exporter) expr: mssql_io_stall_seconds{operation="read"} > 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1711" jp1_pc_metricname: "mssql_io_stall_seconds" annotations: jp1_pc_firing_description: "Stall time per operation exceeded threshold (1 second). value={{ $value }}" jp1_pc_resolved_description: "Stall duration per operation dropped below threshold (1 second). value={{ $value }}" -
io_stall_write_seconds#
groups: - name: sql_exporter rules: - alert: sql_io_stall_write_seconds(SQL exporter) expr: mssql_io_stall_seconds{operation="write"} > 1 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1712" jp1_pc_metricname: "mssql_io_stall_seconds" annotations: jp1_pc_firing_description: "Stall time per operation exceeded threshold (1 second). value={{ $value }}" jp1_pc_resolved_description: "Stall duration per operation dropped below threshold (1 second). value={{ $value }}" -
io_stall_total_seconds#
groups: - name: sql_exporter rules: - alert: sql_io_stall_total_seconds(SQL exporter) expr: mssql_io_stall_total_seconds > 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1713" jp1_pc_metricname: "mssql_io_stall_total_seconds" annotations: jp1_pc_firing_description: "Total downtime per database exceeded threshold (10 seconds). value={{ $value }}" jp1_pc_resolved_description: "Total downtime per database was below threshold (10 seconds). value={{ $value }}" -
resident_memory_mbytes#
groups: - name: sql_exporter rules: - alert: sql_resident_memory_bytes(SQL exporter) expr: mssql_resident_memory_bytes /1024/1024 > 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1714" jp1_pc_metricname: "mssql_resident_memory_bytes" annotations: jp1_pc_firing_description: "Resident memory size exceeded threshold (1000MB). value={{ $value }}" jp1_pc_resolved_description: "Resident memory size less than threshold (1000MB). value={{ $value }}" -
virtual_memory_mbytes#
groups: - name: sql_exporter rules: - alert: sql_virtual_memory_bytes(SQL exporter) expr: mssql_virtual_memory_bytes /1024/1024 > 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1715" jp1_pc_metricname: "mssql_virtual_memory_bytes" annotations: jp1_pc_firing_description: "Commit size for virtual memory exceeded threshold (1000MB). value={{ $value }}" jp1_pc_resolved_description: "Commit size for virtual memory dropped below threshold (1000MB). value={{ $value }}" -
memory_utilization_percentage#
groups: - name: sql_exporter rules: - alert: sql_memory_utilization_percentage(SQL exporter) expr: mssql_memory_utilization_percentage > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1716" jp1_pc_metricname: "mssql_memory_utilization_percentage" annotations: jp1_pc_firing_description: "The percentage of committed memory in the working set exceeded the threshold (90%). value={{ $value }}" jp1_pc_resolved_description: "The percentage of committed memory in the working set is below the threshold (90%). value={{ $value }}" -
page_fault_count#
groups: - name: sql_exporter rules: - alert: sql_page_fault(SQL exporter) expr: rate(mssql_page_fault_count[2m]) > 10 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1717" jp1_pc_metricname: "mssql_page_fault_count " annotations: jp1_pc_firing_description: "The number of page faults caused by value={{$value} exceeded the threshold (10 times). value={{ $value }}" jp1_pc_resolved_description: "The number of page faults caused by value={{$value} is less than the threshold (10 times). value={{ $value }}" -
os_memory_mbytes#
groups: - name: sql_exporter rules: - alert: sql_os_memory_mbytes(SQL exporter) expr: sum(mssql_os_memory)without(state) /1024/1024 < 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1718" jp1_pc_metricname: "mssql_os_memory" annotations: jp1_pc_firing_description: "The OS physical memory (overall) has fallen below the threshold (1000MB). value={{ $value }}" jp1_pc_resolved_description: "OS physical memory (overall) exceeded threshold (1000MB). value={{ $value }}" -
os_memory_available_mbytes#
groups: - name: sql_exporter rules: - alert: sql_os_memory_available_mbytes(SQL exporter) expr: mssql_os_memory{state="available"} /1024/1024 < 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1719" jp1_pc_metricname: "mssql_os_memory" annotations: jp1_pc_firing_description: "The operating system physical memory (available) has fallen below the threshold (1000MB). value={{ $value }}" jp1_pc_resolved_description: "Operating system physical memory (available) exceeded threshold (1000MB). value={{ $value }}" -
os_memory_used_mbytes#
groups: - name: sql_exporter rules: - alert: sql_os_memory_used_mbytes(SQL exporter) expr: mssql_os_memory{state="used"} /1024/1024 > 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1720" jp1_pc_metricname: "mssql_os_memory" annotations: jp1_pc_firing_description: "Operating system physical memory (used) exceeded threshold (1000MB). value={{ $value }}" jp1_pc_resolved_description: "The OS physical memory (used) has fallen below the threshold (1000MB). value={{ $value }}" -
os_page_file#
groups: - name: sql_exporter rules: - alert: sql_os_page_file(SQL exporter) expr: sum(mssql_os_page_file)without(state) < 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1721" jp1_pc_metricname: "mssql_os_page_file" annotations: jp1_pc_firing_description: "The total number of OS pagefiles has fallen below the threshold (1000 files). value={{ $value }}" jp1_pc_resolved_description: "Total number of OS pagefiles exceeded the threshold (1000 files). value={{ $value }}" -
os_page_file_available#
groups: - name: sql_exporter rules: - alert: sql_os_page_file_available(SQL exporter) expr: mssql_os_page_file{state="available"} < 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1722" jp1_pc_metricname: "mssql_os_page_file" annotations: jp1_pc_firing_description: "Number of OS pagefiles (available) dropped below threshold (1000). value={{ $value }}" jp1_pc_resolved_description: "Number of OS pagefiles (available) exceeds threshold (1000). value={{ $value }}" -
os_page_file_used#
groups: - name: sql_exporter rules: - alert: sql_os_page_file_used(SQL exporter) expr: mssql_os_page_file{state="used"} > 1000 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1723" jp1_pc_metricname: "mssql_os_page_file" annotations: jp1_pc_firing_description: "Number of OS pagefiles (used) exceeded threshold (1000). value={{ $value }}" jp1_pc_resolved_description: "Number of OS pagefiles (used) dropped below threshold (1000). value={{ $value }}" -
process_count#
groups: - name: sql_exporter rules: - alert: sql_process_count(SQL exporter) expr: mssql_database_detail_process_count > 50 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1724" jp1_pc_metricname: "mssql_database_detail_process_count" annotations: jp1_pc_firing_description: "The total number of processes exceeded the threshold (50 processes). value={{ $value }}" jp1_pc_resolved_description: "Total number of processes dropped below threshold (50). value={{ $value }}" -
perc_busy_percentage#
groups: - name: sql_exporter rules: - alert: sql_perc_busy(SQL exporter) expr: mssql_global_server_summary_perc_busy > 90 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1725" jp1_pc_metricname: "mssql_global_server_summary_perc_busy" annotations: jp1_pc_firing_description: "The percentage of CPU busy hour exceeded the threshold (90%). value={{ $value }}" jp1_pc_resolved_description: "The percentage of CPU busy hour has fallen below the threshold (90%). value={{ $value }}" -
server_summary_packet_errors#
groups: - name: sql_exporter rules: - alert: sql_global_server_summary_packet_errors(SQL exporter) expr: mssql_global_server_summary_packet_errors > 2 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1726" jp1_pc_metricname: "mssql_global_server_summary_packet_errors" annotations: jp1_pc_firing_description: "Number of packet errors exceeded threshold (2). value={{ $value }}" jp1_pc_resolved_description: "Number of packet errors dropped below threshold (2). value={{ $value }}" -
blocked_processes#
groups: - name: sql_exporter rules: - alert: sql_blocked_processes(SQL exporter) expr: mssql_server_detail_blocked_processes > 2 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1727" jp1_pc_metricname: "mssql_server_detail_blocked_processes" annotations: jp1_pc_firing_description: "The number of waiting processes exceeded the threshold (2 processes). value={{ $value }}" jp1_pc_resolved_description: "The number of processes waiting is less than the threshold (2 processes). value={{ $value }}" -
server_overview_cache_hit#
groups: - name: sql_exporter rules: - alert: sql_server_overview_cache_hit(SQL exporter) expr: mssql_server_overview_cache_hit < 85 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1728" jp1_pc_metricname: "mssql_server_overview_cache_hit" annotations: jp1_pc_firing_description: "The percentage of data pages found in the data cache was below the threshold (85%). value={{ $value }}" jp1_pc_resolved_description: "The percentage of data pages found in the data cache exceeded the threshold (85%). value={{ $value }}"
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate.
■Alert definition example for metrics in Container monitoring metric definition file
-
kube_job_status_failed#
groups: - name: kube_state_metrics rules: - alert: kube_job_status_failed(Kube state metrics) expr: 0 < kube_job_status_failed * on(job_name, namespace) group_left() kube_job_owner{owner_kind="<none>", owner_name="<none>"} for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1201" jp1_pc_metricname: "kube_job_status_failed, kube_job_owner" jp1_pc_nodelabel: "{{ $labels.namespace }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of failed pods has exceeded the threshold (0 pods). value={{ $value }} pods, job_name={{ $labels.job_name }}" jp1_pc_resolved_description: "The number of failed pods has fallen below the threshold (0 pods). job_name={{ $labels.job_name }}" -
kube_pod_status_pending#
groups: - name: kube_state_metrics rules: - alert: kube_pod_status_pending(Kube state metrics) expr: 0 < sum by (pod, namespace, instance, job) (kube_pod_status_phase{phase="Pending"}) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1202" jp1_pc_metricname: "kube_pod_status_pending" jp1_pc_nodelabel: "{{ $labels.namespace }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of pending pods has exceeded the threshold (0 pods). value={{ $value }} pods, pod={{ $labels.pod }}" jp1_pc_resolved_description: "The number of pending pods has fallen below the threshold (0 pods). pod={{ $labels.pod }}" -
kube_pod_status_failed#
groups: - name: kube_state_metrics rules: - alert: kube_pod_status_failed(Kube state metrics) expr: 0 < sum by (pod, namespace, instance, job) (kube_pod_status_phase{phase="Failed"} for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1203" jp1_pc_metricname: "kube_pod_status_phase" jp1_pc_nodelabel: "{{ $labels.namespace }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of failed pods has exceeded the threshold (0 pods). value={{ $value }} pods, pod={{ $labels.pod }}" jp1_pc_resolved_description: "The number of failed pods has fallen below the threshold (0 pods). pod={{ $labels.pod }}" -
kube_pod_status_unknown#
groups: - name: kube_state_metrics rules: - alert: kube_pod_status_unknown(Kube state metrics) expr: 0 < sum by (pod, namespace, instance) (kube_pod_status_phase{phase="Unknown"} for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1204" jp1_pc_metricname: "kube_pod_status_phase" jp1_pc_nodelabel: "{{ $labels.namespace }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of unknown pods has exceeded the threshold (0 pods). value={{ $value }} pods, pod={{ $labels.pod }}" jp1_pc_resolved_description: "The number of unknown pods has fallen below the threshold (0 pods). pod={{ $labels.pod }}" -
kube_daemonset_failed_number_scheduled#
groups: - name: kube_state_metrics rules: - alert: kube_daemonset_failed_number_scheduled(Kube state metrics) expr: 0 < kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1205" jp1_pc_metricname: "kube_daemonset_status_desired_number_scheduled, kube_daemonset_status_current_number_scheduled" jp1_pc_nodelabel: "{{ $labels.daemonset }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of nodes that failed to execute has exceeded the threshold (0 nodes). value={{ $value }} nodes" jp1_pc_resolved_description: "The number of nodes that failed to execute has fallen below the threshold (0 nodes)." -
kube_deployment_failed_replicas#
groups: - name: kube_state_metrics rules: - alert: kube_deployment_failed_replicas(Kube state metrics) expr: 0 < kube_deployment_spec_replicas - kube_deployment_status_replicas_available for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1206" jp1_pc_metricname: "kube_deployment_spec_replicas, kube_deployment_status_replicas_available" jp1_pc_nodelabel: "{{ $labels.deployment }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of pods that failed to execute on each deployment has exceeded the threshold (0 pods). value={{ $value }} pods" jp1_pc_resolved_description: "The number of pods that failed to execute on each deployment has fallen below the threshold (0 pods)." -
kube_replicaset_failed_replicas#
groups: - name: kube_state_metrics rules: - alert: kube_replicaset_failed_replicas(Kube state metrics) expr: 0 < kube_replicaset_spec_replicas - kube_replicaset_status_ready_replicas for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1207" jp1_pc_metricname: "kube_replicaset_spec_replicas, kube_replicaset_status_ready_replicas" jp1_pc_nodelabel: "{{ $labels.replicaset }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of pods that failed to execute on each ReplicaSet has exceeded the threshold (0 pods). value={{ $value }} pods" jp1_pc_resolved_description: "The number of pods that failed to execute on each ReplicaSet has fallen below the threshold (0 pods)." -
kube_statefulset_failed_replicas#
groups: - name: kube_state_metrics rules: - alert: kube_statefulset_failed_replicas(Kube state metrics) expr: 0 < kube_statefulset_replicas - kube_statefulset_status_replicas_ready for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1208" jp1_pc_metricname: "kube_statefulset_replicas, kube_statefulset_status_replicas_ready" jp1_pc_nodelabel: "{{ $labels.statefulset }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of pods that failed to execute on each deployment has exceeded the threshold (0 pods). value={{ $value }} pods" jp1_pc_resolved_description: "The number of pods that failed to execute on each deployment has fallen below the threshold (0 pods)." -
kube_cron_job_status_failed#
groups: - name: kube_state_metrics rules: - alert: kube_cron_job_status_failed(Kube state metrics) expr: 0 < kube_job_status_failed * on(job_name, namespace) group_left(owner_name) kube_job_owner{owner_kind="CronJob", owner_name!="<none>"} for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1209" jp1_pc_metricname: "kube_job_status_failed, kube_job_owner" jp1_pc_nodelabel: "{{ $labels.owner_name }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The number of pods that failed to execute within a CronJob has exceeded the threshold (0 pods). value={{ $value }}%" jp1_pc_resolved_description: "The number of pods that failed to execute within a CronJob has fallen below the threshold (0 pods)." -
kube_node_status_condition_not_ready#
groups: - name: kube_state_metrics rules: - alert: kube_node_status_condition_not_ready(Kube state metrics) expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="Ready",status=~"false|unknown"}) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1210" jp1_pc_metricname: "kube_node_status_condition" jp1_pc_nodelabel: "{{ $labels.node }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The node is in an error state. value={{ $value }} node" jp1_pc_resolved_description: "The node has recovered from its error state." -
kube_node_status_condition_memory_pressure#
groups: - name: kube_state_metrics rules: - alert: kube_node_status_condition_memory_pressure(Kube state metrics) expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="MemoryPressure",status~="true|unknown"}}) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1211" jp1_pc_metricname: "kube_node_status_condition" jp1_pc_nodelabel: "{{ $labels.node }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The node is in a memory-constrained state. value={{ $value }} node" jp1_pc_resolved_description: "The node has recovered from its memory-constrained state." -
kube_node_status_condition_disk_pressure#
groups: - name: kube_state_metrics rules: - alert: kube_node_status_condition_disk_pressure(Kube state metrics) expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="DiskPressure",status=~"true|unknown"}) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1212" jp1_pc_metricname: "kube_node_status_condition" jp1_pc_nodelabel: "{{ $labels.node }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The node is in a disk-constrained state. value={{ $value }} node" jp1_pc_resolved_description: "The node has recovered from its disk-constrained state." -
kube_node_status_condition_pid_pressure#
groups: - name: kube_state_metrics rules: - alert: kube_node_status_condition_pid_pressure(Kube state metrics) expr: 1 == sum by (node, instance) (kube_node_status_condition{condition="PIDPressure",status=~"true|unknown"}) for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1213" jp1_pc_metricname: "kube_node_status_condition" jp1_pc_nodelabel: "{{ $labels.node }}" jp1_pc_exporter: "JPC Kube state metrics" jp1_pc_trendname: "kubernetes" job: "jpc_kube_state" annotations: jp1_pc_firing_description: "The node is in a PID assignment-constrained state. value={{ $value }} node" jp1_pc_resolved_description: "The node has recovered from its PID assignment-constrained state." -
kube_namespace_cpu_percent_used#
groups: - name: kubelet rules: - alert: kube_namespace_cpu_percent_used(Kubelet) expr: 80 < sum by (namespace, job) (rate(container_cpu_usage_seconds_total{name!=""}[2m])) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1222" jp1_pc_metricname: "container_cpu_usage_seconds_total" jp1_pc_nodelabel: "{{ $externalLabels.jp1_pc_prome_clustername }}" jp1_pc_exporter: "JPC Kubelet" jp1_pc_trendname: "kubernetes" job: "jpc_kubelet" instance: "{{ $externalLabels.jp1_pc_prome_hostname }}" annotations: jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%, namespace={{ $labels.namespace }}" jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%). namespace={{ $labels.namespace }}" -
kube_namespace_memory_percent_used#
groups: - name: kubelet rules: - alert: kube_namespace_memory_percent_used(Kubelet) expr: 80 < sum by (namespace, job) (container_memory_working_set_bytes and (container_spec_memory_limit_bytes{name!=""} > 0)) / sum by (namespace, job) ((container_spec_memory_limit_bytes{name!=""} > 0) and container_memory_working_set_bytes) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1223" jp1_pc_metricname: "container_memory_working_set_bytes, container_spec_memory_limit_bytes" jp1_pc_nodelabel: "{{ $externalLabels.jp1_pc_prome_clustername }}" jp1_pc_exporter: "JPC Kubelet" jp1_pc_trendname: "kubernetes" job: "jpc_kubelet" instance: "{{ $externalLabels.jp1_pc_prome_hostname }}" annotations: jp1_pc_firing_description: "Memory usage has exceeded the threshold (80%). value={{ $value }}%, namespace={{ $labels.namespace }}" jp1_pc_resolved_description: "Memory usage has fallen below the threshold (80%). namespace={{ $labels.namespace }}" -
kube_pod_cpu_percent_used_pod#
groups: - name: kubelet rules: - alert: kube_pod_cpu_percent_used_pod(Kubelet) expr: 80 < sum by (pod, namespace, instance, job) (rate(container_cpu_usage_seconds_total{name!=""}[2m])) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1220" jp1_pc_metricname: "container_cpu_usage_seconds_total" jp1_pc_nodelabel: "{{ $labels.pod }}" jp1_pc_exporter: "JPC Kubelet" jp1_pc_trendname: "kubernetes" job: "jpc_kubelet" annotations: jp1_pc_firing_description: "CPU usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has fallen below the threshold (80%)" -
kube_pod_memory_percent_used_pod#
groups: - name: kubelet rules: - alert: kube_pod_cpu_percent_used_pod(Kubelet) expr: 80 < sum by (pod, namespace, instance, job) (container_memory_working_set_bytes and (container_spec_memory_limit_bytes{name!=""} > 0)) / sum by (pod, namespace, instance, job) ((container_spec_memory_limit_bytes{name!=""} > 0) and container_memory_working_set_bytes) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS2" jp1_pc_component: "/HITACHI/JP1/JPCCS/KUBERNETES/CONFINFO" jp1_pc_severity: "Error" jp1_pc_eventid: "1221" jp1_pc_metricname: "container_memory_working_set_bytes, container_spec_memory_limit_bytes" jp1_pc_nodelabel: "{{ $labels.pod }}" jp1_pc_exporter: "JPC Kubelet" jp1_pc_trendname: "kubernetes" job: "jpc_kubelet" annotations: jp1_pc_firing_description: "Memory usage has exceeded the threshold (80%). value={{ $value }}%" jp1_pc_resolved_description: "Memory usage has fallen below the threshold (80%)"
- #
-
When defining multiple alerts with the same integrated agent host, avoid specifying duplicate groups:, or specifying a name that specifies the same group name in duplicate. In "alert", specify the value of the alert definition while following the naming rule given below. If you do not follow the rule, the JP1 event will not be created.
alert: metric-definition-name(exporter-name)any-value