Hitachi

JP1 Version 13 JP1/Integrated Management 3 - Manager Overview and System Design Guide


9.5.3 Performance monitoring function

Performance monitoring function consists of Prometheus, Alertmanager, Exporter of add-on program and provides the following two functions:

Performance data and alerts sent to the Integrated manager host can be viewed in integrated operation viewer.

Organization of this subsection

(1) Communication function

(a) Communication protocols and authentication methods

The following shows the communication protocols and authentication methods used by integrated agent.

Connection source

Connect to

Protocol

Authentication method

Prometheus server

JP1/IM agent control base

HTTP

No authentication

Alertmanager

Prometheus server

Alertmanager

HTTP

No authentication

Exporter

Blackbox exporter

monitored

HTTP/HTTPS

Basic Authentication

Basic Authentication

No authentication

HTTPS

Server Authentication

With client authentication

With client authentication

ICMP#

No authentication

Yet another cloudwatch exporter

Amazon CloudWatch

HTTPS

AWS IAM Authentication

#

ICMPv6 is not available.

(b) Network configuration

Integrated agent can be used in a network configuration with only a IPv4 environment or in a network configuration with a mix of IPv4 and IPv6 environments. Only IPv4 communication is supported in a network configuration with a mix of IPv4 and IPv6 environments.

You can use integrated agent in the following configurations without a proxy server:

Connection source

Connect to

Connection type

Prometheus server

JP1/IM agent control base

No proxy server

Alertmanager

Prometheus server

Alertmanager

Exporter

Blackbox exporter

Monitoring targets (ICMP monitoring)

Monitoring targets (HTTP monitoring)

No proxy server

Through a proxy server without authentication

Through a proxy server with authentication

Yet another cloudwatch exporter

Amazon CloudWatch

No proxy server

Through a proxy server without authentication

Through a proxy server with authentication

Integrated agent transmits the following:

Connection source

Connect to

Transmitted data

Prometheus server

JP1/IM agent control base

Performance data in Protobuf format

Alertmanager

Alert information in JSON format#1

Prometheus server

Exporter

Prometheus textual performance data#2

Blackbox exporter

monitored

Response for each protocol

Yet another cloudwatch exporter

Amazon CloudWatch

CloudWatch data

#1

For more information, see the description of the message body for the request in the manual "JP1/Integrated Management 3 - Manager Command, Definition File and API Reference", "5.6.5 JP1 Event Translation".

#2

For more information, see the manual "JP1/Integrated Management 3 - Manager Command, Definition File and API Reference", "5.23 API for Exporter for scrape Used by JP1/IM - Agent", for a textual description of Prometheus.

(2) Performance data collection function

Prometheus server is a function that collects performance data from monitored targets. It has two functions:

(a) Scrape function

Prometheus server is a function that acquires the performance data to be monitored via the Exporter.

When the Prometheus server accesses a specific URL of the Exporter, the Exporter retrieves the monitored performance data and returns it to the Prometheus server. This process is called scrape.

A scrape is executed in units of scrape jobs that combine multiple scrapes for the same purpose. Scrapes are defined in units of scrape jobs. JP1/IM - By default, the following scrape job name scrape definition is set according to the type of exporter.

Scrape Job Name

Scrape Definition

prometheus

Scrape definition for Prometheus server

jpc_node

Scrape definition for Node exporter

jpc_windows

Scrape definition for Windows exporter

jpc_blackbox_http

Scrape definition for HTTP/HTTPS monitoring in Blackbox exporeter

jpc_blackbox_icmp

Scrape Definition for ICMP Monitoring in Blackbox exporeter

jpc_cloudewatch

Scrape definition for Yet another cloudwatch exporter

If you want to scrape your own exporter, you must add a scrape definition for each target exporter.

The metric obtained from Exporter by scraping of Prometheus server is depending on the type of Exporter. For more information, see the explanation of metric definition file in each Exporter in the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" (2. Definition File).

In addition, the Prometheus server generates the following metrics when scraping is performed, in addition to the metrics obtained from the exporter.

Metric Name

Description

up

This metric indicates "1" for successful scraping and "0" for failure. It can be used to monitor the operation of the exporter.

Scrape failure may be caused by host stoppage, exporter stop, exporter returning anything other than 200, or communication error.

scrape_duration_seconds

A metric that indicates how long it took to scrape. It is not used in normal operation.

It is used for investigations when the scrape does not finish within the expected time.

scrape_samples_post_metric_relabeling

A metric that indicates the number of samples remaining after the metric is relabeled. It is not used in normal operation.

It is used to check the number of data when building the environment.

scrape_samples_scraped

A metric that indicates the number of samples returned by the exporter scraped. It is not used in normal operation.

It is used to check the number of data when building the environment.

scrape_series_added

A metric that shows the approximate number of newly generated series. It is not used in normal operation.

For information about how to run scrape, see the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "5.23 API for scrape of Exporter Used by JP1/IM - Agent". Exporter that you want to scrape must be able to run as described here.

The scrape definition method is shown below.:

  • Scrape definitions are done in units of scrape jobs.

  • The scrape definition is described in the Prometheus configuration file (jpc_prometheus_server.yml).

  • If you are editing a scrape definition, you can download Prometheus configuration file from integrated operation viewer, edit it, and then upload it.

The following are the settings related to scrape definitions supported by JP1/IM - Agent.

Table 9‒21: Settings for scrape definitions supported by JP1/IM - Agent

Setting Item

Description

Scrape Job Name

(required)

Sets the name of the scrape job that Prometheus scrapes. You can specify multiple scrape job names.

The specified scrape job name is set in the metric label as job="scrape job name".

Scrape to

(required)

Set the specific URL of the exporter to be scraped. Only exporters on hosts where JP1/IM - Agent resides can be specified as scrape destinations.

The server to be scraped in the URL is specified by the host name. "localhost" cannot be used.

The total number of scrape destinations specified in all scrape jobs is limited to 100.

Scrape parameters

(Optional)

You can set parameters to pass to the Exporter when scraping.

Depending on the type of exporter, the contents that can be set differ.

Scrape interval

(Optional)

You can set the scrape interval.

You can set a scrape interval that is common to all scrape jobs and a scrape interval for each scrape job. If both are set, the scrape interval for each scrape job takes precedence.

You can specify the following units: years, weeks, days, hours, minutes, seconds, or milliseconds.

Scrape timeout

(Optional)

You can set a timeout period when scraping takes a long time.

You can set a timeout period that is common to all scrape jobs and a timeout period for each scrape job. If both are set, the scrape interval for each scrape job takes precedence.

Relabeling

(Optional)

You can delete unnecessary metrics and customize labels.

By using this feature and setting unnecessary metrics that are not supported by default, you can reduce the amount of data sent to JP1/IM - Manager.

The outcome of scrape by Exporter subject to scrape of Prometheus server is returned in Text-based format data format of Prometheus. Here is a Text-based format of Prometheus:

Text-based format basics

Item

Description

Start time

2014 Apr

Supported Versions

Prometheus Version 0.4.0 or Later

Transmission format

HTTP

Character code

UTF-8

Line feed code is \n

Content-Type

Text/plain; version=0.0.4

If there is no version value, it is treated as the latest text format version.

Content-Encoding

gzip

Advantages

  • Human readable

  • Easy to assemble, especially for minimal cases (no need for nesting).

  • Read on a line-by-line basis (except for hints and docstring).

Constraints

  • Redundancy

  • Since the type and docstring are not part of the syntax, there is little validation of the metric contract.

  • Cost of parsing

Supported Metrics

  • Counter

  • Gauge

  • Histogram

  • Summary

  • Untyped

More information about Text-based format

Text-based format of Prometheus is row-oriented.

Separate lines with a newline character. The line feed code is \n. \ r\n is considered invalid.

The last line must be a newline character.

Also, blank lines are ignored.

Row Format

Within a line, tokens can be separated by any number of blanks or tabs. However, when joining with the previous token, it must be separated by at least one space.

In addition, leading and trailing white spaces are ignored.

Comments, help text, and information

Lines that have # as a character other than the first white space are comments.

This line is ignored unless the first token after # is a HELP or TYPE.

These lines are treated as follows:

If the token is a HELP, at least one more token (metric name) is expected. All remaining tokens are considered to be docstring of that metric name.

HELP line can contain any UTF-8 string after metric name. However, you must escape the backslash as \ and the newline character as \n. For any metric name, there can be only one HELP row.

If the token is a TYPE, two or more tokens are expected. The first is metric name. The second, either counter, gauge, histogram, summary, or untyped, defines the type of metric. There can be only one TYPE row for a given metric. Metric name of TYPE line must appear in front of the first sample.

If no TYPE row exists for metric name, the type is set to untyped.

Write a sample (one per line) using the following EBNF:

 metric_name [
    "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
 ] value [ timestamp ]
Sample Syntax
  • Metric_name and label_name are subject to the limitations of the normal Prometheus expression language.

  • The label_value is any UTF-8 string. However, backslash (\), double quote ("), and line feed must be escaped as \\, \" and \n, respectively.

  • Value is a floating-point number required by ParseFloat() function of Go language. In addition to the typical numbers, NaN, +Inf, -Inf is also a valid number. Indicates that NaN is not a number. The + Inf is positive infinity. -Inf is negative infinity.

  • The timestamp is a int64 (milliseconds from the epoch, 1970-01-01 00:00:00 UTC, excluding leap seconds), and is optionally represented by ParseInt() function of Go.

Grouping and Sorting

All rows granted with metric must be provided as a single grouping, and the optional HELP and TYPE rows must come first (in any order).

It is also recommended, but not required, to perform repeatable sorting with a repeating description.

Each line must have a unique pair of metric names / labels. If it is not a unique combination, the capture behavior is undefined.

Histograms and Summaries

Because histograms and summary types are difficult to express in text format, the following rules apply:

  • Sample sum x for the summary or histogram appears as another sample called x_sum.

  • Sample counts named x for a summary or histogram appear as another sample called x_count.

  • Each quantile in the summary named x appears as another sample line with the same name x and labeled {quantile="y"}.

  • Each bucket count in the histogram named x appears as another sample line named x_bucket and labeled {le="y"} ( y is the bucket limit).

  • The histogram must have a bucket of {le="+Inf"}. Its value must be the same as the value of x_count.

  • For le or quantile labels, the histogram bucket and summary quantiles must appear in ascending order of the values for the labels.

Sample Text-based format

Here is a sample Prometheus metric exposition that contains comments, HELP and TYPE representations, histograms, summaries, and character escaping.

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000
 
# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
 
# Minimalistic line:
metric_without_timestamp_and_labels 12.47
 
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
 
# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
 
# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

(b) Ability to obtain monitored operational information

This function acquires operation information (performance data) from the monitoring target. The process of collecting operational information is performed by a program called "Exporter".

In response to scrape requests sent from the Prometheus server to the Exporter, the Exporter collects operational information from the monitored target and returns the results to Prometheus.

Exporters shipped with JP1/IM - Agent scrape only from Prometheus in JP1/IM - Agent that cohabits. Do not scrape from Prometheus provided by other hosts or users.

This section describes the functions of each exporter included with JP1/IM - Agent.

(c) Blackbox exporter

Blackbox exporter is an exporter that sends simulated requests to monitored Internet services on the network and obtains operation information obtained from the responses. The supported communication protocols are HTTP, HTTPS, and ICMP.

When the Blackbox exporter receives a scrape request from the Prometheus server, it throws a service request such as HTTP to the monitored target and obtains the response time and response. In addition, the execution results are summarized in the form of metrics and returned to the Prometheus server.

■ Main items to be acquired

The main retrieval items of Blackbox exporter are defined in Blackbox exporter metric definition file (default). For more information, see "Blackbox exporter metric definition file (metrics_blackbox_exporter.conf)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file.

Metric Name

Prober

What to get

Label

probe_http_duration_seconds

http

The number of seconds taken per phase of the HTTP request

Note:

All redirects add up.

instance: Instance identification string

job: Job name

phase: Phase#

#

Contains one of the following:

  • "resolve"

  • "connect"

  • "tls"

  • "processing"

  • "transfer"

probe_http_content_length

http

HTTP content response length

instance: Instance identification string

job: Job name

probe_http_uncompressed_body_length

http

Uncompressed response body length

instance: Instance identification string

job: Job name

probe_http_redirects

http

Number of redirects

instance: Instance identification string

job: Job name

probe_http_ssl

http

Whether SSL was used for the final redirect

  • 0: TLS/SSL was not used

  • 1: TLS/SSL was used

instance: Instance identification string

job: Job name

probe_http_status_code

http

HTTP response status code value

Note:

If you are redirecting, the final status code is the value of the metric.

If no redirection is performed, the first status code received is the value of the metric.

instance: Instance identification string

job: Job name

probe_ssl_earliest_cert_expiry

http

Earliest expiring SSL certificate UNIX time

instance: Instance identification string

job: Job name

probe_ssl_last_chain_expiry_timestamp_seconds

http

Expiration timestamp of the last certificate in the SSL chain

Note:

If you want to monitor this metric, you must specify false for the insecure_skip_verify parameter in the tls_config settings of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml), place the certificate, and specify the path of the certificate file in the appropriate parameter.

instance: Instance identification string

job: Job name

probe_ssl_last_chain_info

http

SSL leaf certificate information

Note:

This is the SHA256 hash value of the server certificate to be monitored. The hash value is set to the label "fingerprint_sha256".

instance: Instance identification string

job: Job name

fingerprint_sha256: SHA256 fingerprint on certificate

probe_tls_version_info

http

TLS version used

Note:

The TLS version, such as "TLS 1.2", is set to the label "version".

instance: Instance identification string

job: Job name

version:TLS Version

probe_http_version

http

HTTP version of the probe response

instance: Instance identification string

job: Job name

probe_failed_due_to_regex

http

Whether the probe failed due to a regular expression check on the response body or response headers

  • 0: Success

  • 1: Failed

instance: Instance identification string

job: Job name

probe_http_last_modified_timestamp_seconds

http

UNIX time showing Last-Modified HTTP response headers

instance: Instance identification string

job: Job name

probe_icmp_duration_seconds

icmp

Seconds taken per phase of an ICMP request

instance: Instance identification string

job: Job name

phase: Phase#

#

Contains one of the following:

  • resolve

    Name Resolution Time

  • setup

    Time from resolve completion to ICMP packet transmission

  • rtt

    Time to get a response after setup

probe_icmp_reply_hop_limit

icmp

Hop limit (TTL for IPv4) value

Note:

Hop limit (TTL for IPv4) value

instance Instance identification string

job: Job name

probe_success

--

Whether the probe was successful

  • 0: Failed

  • 1: Success

instance Instance identification string

job: Job name

probe_duration_seconds

--

The number of seconds it took for the probe to complete

instance Instance identification string

job: Job name

■ IP communication with monitored objects

Only IPv4 communication is supported.

■ Encrypted communication with monitored objects

HTTP monitoring enables encrypted communication using TLS. In this case, the Blackbox exporter acts as a TLS client to the monitored object (TLS server).

When using encrypted communication using TLS, specify it in item "tls_config" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml). In addition, the following certificate and key files must be prepared.

File

Format

CA certificate file

A file encoding an X509 public key certificate in pkcs7 format in PEM format

Client certificate file

Client certificate key file

A file in which the private key in pkcs1 or pkcs8 format is encoded in PEM format#

#

You cannot use password-protected files.

The available TLS versions and cipher suites are supported below.

Item

Scope of support

TLS Version

1.0 to 1.3#

#

1.0 and 1.1 are not recommended. It is recommended to review the settings on the monitored (TLS server) side.

Cipher suites

  • "TLS_RSA_WITH_3DES_EDE_CBC_SHA" (up to TLS 1.2)

  • "TLS_RSA_WITH_AES_128_CBC_SHA" (up to TLS 1.2)

  • "TLS_RSA_WITH_AES_256_CBC_SHA" (up to TLS 1.2)

  • "TLS_RSA_WITH_AES_128_GCM_SHA256" (TLS 1.2 only)

  • "TLS_RSA_WITH_AES_256_GCM_SHA384" (TLS 1.2 only)

  • "TLS_AES_128_GCM_SHA256" (TLS 1.3 only)

  • "TLS_AES_256_GCM_SHA384" (TLS 1.3 only)

  • "TLS_CHACHA20_POLY1305_SHA256" (TLS 1.3 only)

  • "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" (TLS 1.2 only)

  • "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384" (TLS 1.2 only)

  • "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" (TLS 1.2 only)

  • "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" (TLS 1.2 only)

  • "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" (TLS 1.2 only)

  • "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256" (TLS 1.2 only)

■ Timeout for collecting health information

In a network environment where response is slow (under normal conditions), operating information can be collected by adjusting the timeout period.

On the Prometheus server, you can specify the scrape request timeout period in the entry "scrape_timeout" of the Prometheus configuration file (jpc_prometheus_server.yml). For more information, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "Prometheus configuration file (jpc_prometheus_server.yml)" (2. Definition File), "scrape_timeout" entry in the manual.

In addition, the timeout period when connecting from the Blackbox exporter to the monitoring target is 0.5 seconds before the value specified in "scrape_timeout" above.

■ Certificate expiration

When collecting operation information by HTTPS monitoring, the exporter receives a certificate list (server certificate and certificate list certifying server certificate) from the monitoring target.

The Blackbox exporter allows you to collect the expiration time (UNIX time) of the closest expiring certificate as a probe_ssl_earliest_cert_expiry metric.

You can also use the features in 9.5.3(4) Performance data monitoring notification function to monitor certificates that are close to their deadline, because you can calculate the number of seconds remaining before the deadline with the value calculated in probe_ssl_earliest_cert_expiry Metric Value-PromQL's time() function.

■ User-Agent value in HTTP request header when monitoring HTTP

The default value of User-Agent included in HTTP request header during HTTP monitoring is "Go-http-client/1.1". You can change the value of User-Agent in the setting of item "headers" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml).

The following is an example of changing the value of User-Agent to "My-Http-Client".

modules:

http:

prober: http

http:

headers:

User-Agent: "My-Http-Client"

For more information, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "Blackbox exporter configuration file (jpc_blackbox_exporter.yml)" (2. Definition File), "headers" entry in the manual.

■ About HTTP 1.1 Name-Based Virtual Host Support

The Blackbox exporter supports HTTP 1.1 name-based virtual hosts and TLS Server Name Indication (SNI). You can monitor virtual hosts that disguise one HTTP/HTTPS server as multiple HTTP/HTTPS servers.

■ About TLS Server Authentication and Client Authentication

In Blackbox exporter's HTTPS monitoring, server authentication is performed using the CA certificate described in item "ca_file" of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml) and the server certificate sent by the server when HTTPS communication with the server starts (TLS handshake).

If the sent certificate is incorrect (server name is incorrect, expired, self-certificate is used, etc.), HTTPS communication cannot be started and monitoring fails.

In addition, when a request is made to send a certificate from the monitored server at the start of HTTPS communication (TLS handshake), the client certificate described in item "cert_file" of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml) is sent to the monitored server.

If the server validates the sent certificate, recognizes it as invalid, and returns an error to the Blackbox exporter via the TLS protocol (or if communication cannot be continued due to a loss of communication, etc.), the monitoring fails.

For details on the verification contents related to the client certificate and the operation in the event of an error on the monitored server, check the specifications of the monitored server (or relay device such as a load balancer).

To detect fraudulent certificates during server authentication, if you specify "true" in item "insecure_skip_verify" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml), HTTPS communication can be started without errors. However, in that case, the verification operation related to client authentication at the server will be invalidated.

For more information, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "Blackbox exporter configuration file (jpc_blackbox_exporter.yml) (2. Definition File), and the "insecure_skip_verify" entry in the manual.

■ About cookie information

The Blackbox exporter does not use cookie information sent from the monitored target in the next HTTP communication request.

■ About external resources referenced from content included in the response body of HTTP communication

In Blackbox exporter, external resources (subframes, images, etc.) referenced from the content included in the response body of HTTP communication are not included in the monitoring range.

■ About Monitoring of Content Included in HTTP Communication Response Body

Since the Blackbox exporter does not parse the content, the execution result and execution time based on the syntax (HTML, javascript, etc.) in the content included in the response body of HTTP communication are not reflected in the monitoring result.

■ Precautions when the monitoring destination of HTTP monitoring redirects with Basic authentication

If the Blackbox exporter's HTTP monitoring destination redirects with Basic authentication, the Blackbox exporter sends the same Basic authentication username and password to the redirect source and destination. Therefore, when performing Basic authentication on both the redirect source and the redirect destination, the same user name and password must be set on the redirect source and the redirect destination.

(d) Node exporter

Node exporter is an exporter that can be embedded in a monitored Linux host to obtain operating information of a Linux host.

The Node exporter is installed on the same host as the Prometheus server, and upon a scrape request from the Prometheus server, it collects operational information from the Linux OS of the host and returns it to the Prometheus server.

It is possible to collect operational information related to memory and disk, which cannot be collected by monitoring from outside the host (external monitoring by URL or CloudWatch), from inside the host.

■ Main items to be acquired

The main retrieval items of Node exporter are defined in Node exporter metric definition file (default). For more information, see "Node exporter metric definition file (metrics_node_exporter.conf)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file. For details of "Collector" in the table, refer to the description of "Collector" at the bottom of the table.

Metric Name

Collector

What to Get

Label

node_boot_time_seconds

stat

Last boot time

Note:

Shown in UNIX time, including microseconds.

instance: Instance identification string

job: Job name

node_context_switches_total

stat

Number of times a context switch has been made (cumulative)

instance: Instance identification string

job: Job name

node_cpu_seconds_total

cpu

CPU seconds spent in each mode (cumulative)

instance: Instance identification string

job: Job name

cpu: cpuid

mode: Mode#

#

Contains one of the following:

  • user

  • nice

  • system

  • idle

  • iowait

  • irq

  • soft

  • steal

node_disk_io_now

diskstats

Number of disk I/Os currently in progress

instance: Instance identification string

job: Job name

device: Device name

node_disk_io_time_seconds_total

diskstats

Seconds spent on disk I/O (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_read_bytes_total

diskstats

Number of bytes successfully read from disk (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_read_time_seconds_total

diskstats

Seconds took to read from disk (cumulative value)

instance: Instance identification string

job: Job name

device: Device name

node_disk_reads_completed_total

diskstats

Number of successfully completed reads from disk (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_write_time_seconds_total

diskstats

Seconds took to write to disk (cumulative value)

instance: Instance identification string

job: Job name

device: Device name

node_disk_writes_completed_total

diskstats

Number of successfully completed disk writes (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_written_bytes_total

diskstats

Number of bytes successfully written to disk (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_filesystem_avail_bytes

filesystem

Number of file system bytes available to non-root users

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_files

filesystem

Number of file nodes in the file system

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_files_free

filesystem

Number of free file nodes in the file system

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_free_bytes

filesystem

Number of bytes of free file system space

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_size_bytes

filesystem

Number of bytes in file system capacity

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_intr_total

stat

Number of interrupts handled (cumulative)

instance: Instance identification string

job: Job name

node_load1

loadavg

One-minute average of the number of jobs in the run queue

instance: Instance identification string

job: Job name

node_load15

loadavg

15-minute average of the number of jobs in the run queue

instance: Instance identification string

job: Job name

node_load5

loadavg

5-minute average of the number of jobs in the run queue

instance: Instance identification string

job: Job name

node_memory_Active_file_bytes

meminfo

Bytes of recently used file cache memory

Note:

The value obtained by converting the Active(file) of /proc/meminfo to bytes.

instance: Instance identification string

job: Job name

node_memory_Buffers_bytes

meminfo

Number of bytes in the file buffer

Note:

The value of Buffers converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_memory_Cached_bytes

meminfo

Number of bytes in file read cache memory

Note:

This is the value of Cached converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_memory_Inactive_file_bytes

meminfo

Number of bytes of file cache memory that have not been used recently

Note:

The value of the Inactive(file) of /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_MemAvailable_bytes

meminfo

The number of bytes of memory available to start a new application without swapping

Note:

The value of MemAvailable in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_MemFree_bytes

meminfo

Number of bytes of free memory

Note:

The value of MemFree in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_MemTotal_bytes

meminfo

Total amount of bytes of memory

Note:

The value of MemTotal converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_memory_SReclaimable_bytes

meminfo

Number of bytes in the Slab cache that can be reclaimed

Note:

SReclaimable in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_SwapFree_bytes

meminfo

Number of bytes of free swap memory space

Note:

The value of SwapFree in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_SwapTotal_bytes

meminfo

Bytes of total swap memory

Note:

This is the value of SwapTotal converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_netstat_Icmp6_InMsgs

netstat

Number of ICMPv6 messages received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Icmp_InMsgs

netstat

Number of ICMPv4 messages received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Icmp6_OutMsgs

netstat

Number of ICMPv6 messages sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Icmp_OutMsgs

netstat

Number of ICMPv4 messages sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Tcp_InSegs

netstat

Number of TCP packets received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Tcp_OutSegs

netstat

Number of TCP packets sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Udp_InDatagrams

netstat

Number of UDP packets received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Udp_OutDatagrams

netstat

Number of UDP packets sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_flags

netclass

A numeric value indicating the state of the interface

Note:

/sys/class/net/[iface]/flags is a decimal value.

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_iface_link

netclass

Interface serial number

Note:

The value of /sys/class/net/[iface]/iflink.

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_mtu_bytes

netclass

Interface MTU value

Note:

The value of /sys/class/net/[iface]/mtu.

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_receive_bytes_total

netdev

Number of bytes received by the network device (cumulative value)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_receive_errs_total

netdev

Number of network device receive errors (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_receive_packets_total

netdev

Number of packets received by network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_bytes_total

netdev

Number of bytes sent by the network device (cumulative value)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_colls_total

netdev

Number of transmit collisions for network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_errs_total

netdev

Number of transmission errors for network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_packets_total

netdev

Number of packets sent by network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_time_seconds

time

Seconds of system time since the epoch (1970)

instance: Instance identification string

job: Job name

node_uname_info

uname

System information obtained by the uname system call

instance: Instance identification string

job: Job name

domainname: NIS and YP domain names

machine: Hardware Identifiers

nodename: Machine name in some network defined at implementation time

release: Operating system release number (e.g. "2.6.28")

sysname: The name of the OS (e.g. "Linux")

version: Operating system version

node_vmstat_pswpin

vmstat

Number of page swap-ins (cumulative)

Note:

The value of the pswpin in /proc/vmstat.

instance: Instance identification string

job: Job name

node_vmstat_pswpout

vmstat

Number of page swap-outs (cumulative)

Note:

The value of pswpout in /proc/vmstat.

instance: Instance identification string

job: Job name

■ Collector

The Node exporter has a built-in collection process called a "collector" for each monitored resource such as CPU and memory.

If you want to add the metrics listed in the table above as acquisition fields, you must enable the collector corresponding to the metric you want to use. You can also disable collectors of metrics that you do not want to collect to suppress unnecessary collection.

Per-collector enable/disable can be specified in the Node exporter command line options. Specify the collector to enable with the "--collector.collector-name" option and the collector to disable with the "--no-collector.collector-name" option.

For Node exporter command-line options, see the manual "node_exporter command options" in "unit definition file (jpc_program-name.service)" (2. Definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference".

(e) Windows exporter

Windows exporter is an exporter that can be embedded in the monitored Windows host and obtain the operating information of the Windows host.

Windows exporter is installed on the same host as the Prometheus server, and upon a scrape request from the Prometheus server, it collects operational information from the Windows OS of the host and returns it to the Prometheus server.

It is possible to collect operational information related to memory and disk, which cannot be collected by monitoring from outside the host (external monitoring by URL or CloudWatch), from inside the host.

■ Main items to be acquired

The main retrieval items of Windows exporter are defined in Windows exporter metric definition file (default). For more information, see "Windows exporter metric definition file (metrics_windows_exporter.conf)" (2.Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file. For details of "Collector" in the table, refer to the description of "Collector" at the bottom of the table.

Metric Name

Collector

What to Get

Label

windows_cache_copy_read_hits_total

cache

Number of copy read requests that hit the cache (cumulative)

instance: Instance identification string

job: Job name

windows_cache_copy_reads_total

cache

Number of reads from the file system cache page (cumulative)

instance: Instance identification string

job: Job name

windows_cpu_time_total

cpu

Number of seconds of processor time spent per mode (cumulative)

instance: Instance identification string

job: Job name

core: coreid

mode: Mode#

#

Contains one of the following:

  • "dpc"

  • "idle"

  • "interrupt"

  • "privileged"

  • "user"

windows_cs_physical_memory_bytes

cs

Number of bytes of the physical memory capacity

instance: Instance identification string

job: Job name

windows_logical_disk_idle_seconds_total

logical_disk

Number of seconds that the disk was idle (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_free_bytes

logical_disk

Number of bytes of unused disk space

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_read_bytes_total

logical_disk

Number of bytes transferred from disk during the read operation (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_read_seconds_total

logical_disk

Number of seconds that the disk was busy for read operations (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_reads_total

logical_disk

Number of read operations to disk (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_requests_queued

logical_disk

Number of requests queued on disk

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_size_bytes

logical_disk

Disk space bytes

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_write_bytes_total

logical_disk

Number of bytes transferred to disk during the write operation (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_write_seconds_total

logical_disk

Number of seconds that the disk was busy for write operations (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_writes_total

logical_disk

Number of disk write operations (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_memory_available_bytes

memory

Number of bytes of unused space in physical memory

Note:

The total of zero, free, and standby (cached) areas allocated to a process or immediately available to the system.

instance: Instance identification string

job: Job name

windows_memory_cache_bytes

memory

Number of bytes of physical memory used for file system caching

instance: Instance identification string

job: Job name

windows_memory_cache_faults_total

memory

Number of page faults in the file system cache (cumulative)

instance: Instance identification string

job: Job name

windows_memory_page_faults_total

memory

Number of times a page fault occurred (cumulative)

instance: Instance identification string

job: Job name

windows_memory_pool_nonpaged_allocs_total

memory

Number of times a nonpageable physical memory region was allocated

instance: Instance identification string

job: Job name

windows_memory_pool_paged_allocs_total

memory

Number of times you allocated a pageable physical memory region

instance: Instance identification string

job: Job name

windows_memory_swap_page_operations_total

memory

Number of pages read from or written to disk to resolve hard page faults (cumulative)

instance: Instance identification string

job: Job name

windows_memory_swap_pages_read_total

memory

Number of pages read from disk to resolve hard page faults (cumulative)

instance: Instance identification string

job: Job name

windows_memory_swap_pages_written_total

memory

Number of pages written to disk to resolve hard page faults (cumulative)

instance: Instance identification string

job: Job name

windows_memory_system_cache_resident_bytes

memory

Number of active system file cache bytes in physical memory

instance: Instance identification string

job: Job name

windows_memory_transition_faults_total

memory

The number of page faults resolved by recovering pages that were in use by other processes sharing the page, pages that were on the modified pages list or standby list, or pages that were written to disk (cumulative)

instance: Instance identification string

job: Job name

windows_net_bytes_received_total

net

Number of bytes received by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_bytes_sent_total

net

Number of bytes sent from the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_bytes_total

net

Number of bytes received and transmitted by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_packets_sent_total

net

Number of packets sent by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_packets_received_total

net

Number of packets received by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_system_context_switches_total

system

Number of context switches (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

windows_system_processor_queue_length

system

Number of threads in the processor queue

instance: Instance identification string

job: Job name

device: Network Device Name

windows_system_system_calls_total

system

Number of times the process called the OS service routine (cumulative)

instance: Instance identification string

job: Job name

■ Collector

Windows exporter has a built-in collection process called a "collector" for each monitored resource such as CPU and memory.

If you want to add the metrics listed in the table above as acquisition fields, you must enable the collector corresponding to the metric you want to use. You can also disable collectors of metrics that you do not want to collect to suppress unnecessary collection.

Enable/disable for each collector can be specified with the "--collectors.enabled" option on the Windows exporter command line or in the item "collectors.enabled" in the Windows exporter configuration file (jpc_windows_exporter.yml).

For information about Windows exporter command-line options, see the manual "windows_exporter Command Options" in "service definition file (jpc_program-name_service.xml)" (2. Definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference".

For information about Windows exporter configuration file entry "collectors.enabled", see "Windows exporter configuration file (jpc_windows_exporter.yml)" (2. Definition file) in the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" and the entry "collectors".

(f) Yet another cloudwatch exporter

Yet another cloudwatch exporter is an exporter included in the monitoring agent that uses Amazon CloudWatch to collect uptime information for AWS services in the cloud.

Yet another cloudwatch exporter is installed on the same host as the Prometheus server, and collects CloudWatch metrics obtained via the SDK provided by AWS (AWS SDK)# upon scrape requests from the Prometheus server, and sends them to the Prometheus server. I will return it.

#

SDK provided by Amazon Web Services (AWS). Yet another cloudwatch exporter uses the Go language version of the AWS SDK for Go (V1). CloudWatch monitoring requires that Amazon CloudWatch supports the AWS SDK for Go (V1).

You can monitor services that cannot include Node exporter or Windows exporter.

■ Main items to be acquired

The main retrieval items of Yet another cloudwatch exporter are defined in Yet another cloudwatch exporter metric definition file (default). For more information, see "Yet another cloudwatch exporter metric definition file (metrics_ya_cloudwatch_exporter.conf)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

■ CloudWatch metrics you can collect

You can collect metric of namespace name of AWS that is supported for monitoring by Yet another cloudwatch exporter of JP1/IM-Agent that is listed in "9.5.1(1)(k) Creating an IM Management Node for Yet another cloudwatch exporter".

Specify the metrics to collect by describing the AWS service name and CloudWatch metric name in the Yet another Cloudwatch Exporter configuration file (jpc_ya_cloudwatch_exporter.yml).

The following is an example of the description of the Yet another cloudwatch exporter configuration file when collecting CPUUtilization and DiskReadBytes for CloudWatch metrics for AWS/EC2 services.

discovery:

exportedTagsOnMetrics:

ec2:

- jp1_pc_nodelabel

jobs:

- type: ec2

regions:

- ap-northeast-1

period: 60

length: 300

delay: 60

nilToZero: true

searchTags:

- key: jp1_pc_nodelabel

value: .*

metrics:

- name: CPUUtilization

statistics:

- Maximum

- name: DiskReadBytes

statistics:

- Maximum

For information about what Yet another cloudwatch exporter configuration file describes, see "Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual

You can also add new metrics to the Yet another cloudwatch exporter metrics definition file using the metrics you set in the Yet another cloudwatch exporter configuration file.

The metrics and labels specified in the PromQL statement described in the definition file conform to the following naming conventions:

- Naming conventions for Exporter metrics

Yet another cloudwatch exporter treats the metric name of CloudWatch as the metric name of the exporter as the automatic conversion of the metric name in CloudWatch by the following rules. Also, the metric specified on the PromQL statement is described using the indicator name of the exporter.

"aws_"#1+Namespace#2+"_"+CloudWatch_Metric #2+"_"+Statistic_Type#2

#1

Appended if the namespace does not begin with "aws_".

#2

Indicates the name you set in the Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml). It is converted by the following rules:

  • It is converted from camel case notation to snake case notation.

    CamelCase is a notation that capitalizes word breaks, such as "CamelCase" or "camelCase."

    Snakecase is a notation that separates words with "_", such as "snake_case".

  • The following symbols are converted to "_".

    whitespace,comma,tab, /, \, half-width period, -, :, =, full-width left double quote, @, <, >

  • "%" is converted to "_percent".

- Exporter label naming conventions

Yet another cloudwatch exporter treats the CloudWatch dimension tag name as the Exporter's label name, which is automatically converted by the following rules. Also, labels specified on the PromQL statement are described using the label name of the Exporter.

  • For dimensions

    "dimension"+"_"+dimensions_name#

  • For tags

    "tag"+"_"+tag_name#

  • For custom tags

    "custom_tag_"+"_"+custom tag_name#

#

Indicates the name you set in the Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml).

■ About policies for IAM users in your AWS account

To connect to AWS CloudWatch, you must create a policy with the following permissions and assign it to an IAM user.

"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"

For details on how to set JSON format, see "2.19.2(7)(b) Changing the settings for connecting to CloudWatch (Linux) (optional)" in "JP1/Integrated Management 3 - Manager Configuration Guide".

■ Environment-variable HTTPS_PROXY

Environment-variable that you specify when you connect to CloudWatch from a Yet another cloudwatch exporter through a proxy. The URL that can be set in the environment-variable HTTPS_PROXY is http only. Note that the only Authentication method supported is Basic authentication.

You can set the environment-variable HTTPS_PROXY to connect to AWS CloudWatch through proxies. The following shows an example configuration.

HTTPS_PROXY=http://username:password@proxy.example.com:5678

■ How to handle monitoring targets JP1/IM-Agent does not support

If you have a product or metric that cannot be monitored by JP1/IM - Agent, you must retrieve it, for example, using user-defined Exporter.

(3) Centralized management of performance data

This function allows Prometheus server to store performance data collected from monitoring targets in the intelligent integrated management database of JP1/IM - Manager. It has the following features:

(a) Remote light function

This is a function in which the Prometheus server sends performance data collected from monitoring targets to an external database suitable for long-term storage. JP1/IM - Agent uses this function to send performance data to JP1/IM - Manager.

The following shows how to define a remote light.

  • Remote write definitions are described in the Prometheus server configuration file (jpc_prometheus_server.yml).

  • Download Prometheus server configuration file from integrated operation viewer, edit it in a text editor, modify Remote Write definition, and then upload it.

The following settings are supported by JP1/IM - Agent for defining Remote Write. For more information about the settings, see "Prometheus configuration file (jpc_prometheus_server.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

Table 9‒22: Settings for Remote Light Definition Supported by JP1/IM - Agent

Setting items

Description

Remote Light Destination

(required)

Set the endpoint URL for JP1/IM agent control base.

Remote light timeout period

(Optional)

You can set the timeout period if the remote light takes a long time.

Change it if you are satisfied with the default value.

Relabeling

(Optional)

You can remove unwanted metric and customize labeling.

(4) Performance data monitoring notification function

This function allows Prometheus server to monitor performance data collected from monitoring targets at a threshold value and notify JP1/IM - Manager. It has three functions:

(a) Alert evaluation function

This function monitors performance data collected from monitoring targets at a threshold value.

Define alert rules to evaluate alerts, monitor performance data at thresholds, and notify alerts.

Alerts can be evaluated by comparing the time series data directly with the thresholds, or by comparing the thresholds with the results of formulas using PromQL#.

#

For more information about PromQL, see 2.7.4(4) About PromQL.

For each time series of data or for each data generated by the calculation result of the PromQL expression, the alert status according to the evaluation is managed, and the action related to the notification is executed according to the alert state.

There are three alert states: pending, firing, and resolved. If the alert rule conditions are met first, it will be in the "Pending" state. After that, if the conditions of the alert rule continue to be met (not recovered) for a certain period of time set in the alert rule definition, it will be in the "firing" state. If the conditions of the alert rule are not met (recovered), or if the time series is gone, it will be in the "resolved" state.

The relationship between alert status and notification behavior is shown below.

Alert status

Description

Notification behavior

pending

A certain period of time has not passed since the alert rule was triggered.

Do not notify alerts.

firing

A certain period of time has passed since the alert rule was triggered.

Notifies you of alerts.

resolved

It is no longer applicable to the alert rule.

  • When the user recovers from the "firing" state, a notification of recovery is given.

  • When the patient recovers from the "pending" state, no recovery notification is given.

The following shows how to define an alert rule.

  • Alert rule definitions are described in the alert configuration file (jpc_alerting_rules.yml) (definitions in any YAML format can also be described).

  • Before reflecting the created definition file in the environment to be used, format check and alert rule test with the promtool command.

  • Download alert configuration file from integrated operation viewer, edit it in a text editor, change the definition of the alert rule, and then upload it.

The following settings apply to the alert rule definitions supported by JP1/IM - Agent. For more information about the settings, see "alert configuration file (jpc_alerting_rules.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual. There is no default alert rule definition.

Table 9‒23: Settings for alert rule definitions supported by JP1/IM - Agent

Setting Item

Description

Alert Name (required)

Set the alert name.

Conditional expression (required)

Set the alert condition expression (threshold).

It can be configured using PromQL.

Waiting time (required)

Set the amount of time to wait after entering the "pending" state before changing to the "firing" state.

Change it if you are satisfied with the default value.

Label (required)

Set labels to add to alerts and recovery notifications.

In JP1/IM - Agent, a specific label must be set.

Annotation (required)

Set to store additional information such as alert description and URL link.

In JP1/IM - Agent, certain annotations must be set.

Labels and annotations can use the following variables:

Variable#

Description

$labels

A variable that holds the label key-value pairs for the alert instance. The label key can be one of the following labels:

When time series data is specified in the alarm evaluation conditional expression

You can specify the label that the data retains.

  • When time series data is specified in the alarm evaluation conditional expression

    You can specify the label that the data retains.

  • When PromQL expression is specified as the condition expression for alarm evaluation

    You can specify a label that is set as the result of a PromQL expression.

$values

A variable that holds the evaluation value of the alert instance.

When an abnormality notification is notified, it is expanded to the value at the time the abnormality was detected.

During the recovery notification, it expands to the value at the time of the anomaly just before recovery (note that it is not the value at the time of recovery).

$externalLabels

This variable holds the label and value set in "external_labels" of item "global" in the Prometheus configuration file (jpc_prometheus_server.yml).

#1

Variables are expanded by enclosing them in "{{" and "}}". The following is an example of how to use variables:

description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

■ Alert rule definition for converting to JP1 events

In order to convert the alert to be notified into a JP1 event on the JP1/IM - Manager side, the following information must be set in the alert rule definition.

Setting item

Value to set

Uses

name

Configure any unique alert group definition name in integrated agent.

Alert group definition name

alert

Set any unique alert-definition-name in integrated agent.

Alert Definition Name

expr

Set the PromQL statement.

It is recommended to set the PromQL statement described in the metric definition file. This way, when the JP1 event occurs, you can display trend information in the Integrated Operation Viewer.

Abnormal conditions#

#

If the conditions are met, it is abnormal, and if the conditions are not met, it is normal.

labels.jp1_pc_product_name

Set "/HITACHI/JP1/JPCCS" as fixed.

Set to the product name of the JP1 event.

labels.jp1_pc_severity

Set one of the following:

  • Emergency

  • Alert

  • Critical

  • Error

  • Warning

  • Notice

  • Information

  • Debug

Set to JP1 event severity#.

#

This value is set to the severity of the JP1 event of the anomaly. The severity of a successful JP1 event is set to Information.

labels.jp1_pc_eventid

Set any value in the range of 0~1FFF,7FFF8000~7FFFFFFF.

Set to the event ID of the JP1 event.

labels.jp1_pc_metricname

Set the metric name.

For Yet another cloudwatch exporter, be sure to specify it. Associates the JP1 event with the IM management node in the AWS namespace corresponding to the metric name (or the first metric name if multiple metric names are specified separated by commas).

Set to the metric name of the JP1 event.

For yet another cloudwatch exporter, it is also used to correlate JP1 events.

annotations.jp1_pc_firing_description

Specify the value to be set for the message of the JP1 event when the abnormal condition of the alert is satisfied.

If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte.

If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = alert name)".

You can also specify variables to embed job names and evaluation values. If a variable is used, the first 1,024 bytes of the expanded message are valid.

It is set to the message of the JP1 event.

annotations.jp1_pc_resolved_description

Specify the value to be set for the JP1 event message when the abnormal condition of the alert is no longer satisfied.

If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte.

If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = alert name)".

You can also specify variables to embed job names and evaluation values. If a variable is used, the first 1,024 bytes of the expanded message are valid.

It is set to the message of the JP1 event.

For an example of setting an alert definition, refer to the chapter describing metric definition file for each Exporter in the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" (2. Definition File) and the explanation of "Alert Definition Example" in the table describing the setting details (initial status) of each metric.

For information about the properties of the corresponding JP1 event, see "3.2.3 JP1/IM - List of JP1 events issued by the Agent " in JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ How to operate in combination with trending-related functions

Combine the definitions of the PromQL statement described in the metric definition file and the PromQL statement evaluated by the alert evaluation function, and in the alert definition annotations.jp1_pc_firing_description and annotations.jp1_pc_resolved_description of the alert definition in the alert configuration file, By describing the metric name of the corresponding trend data, when the JP1 event of the alert is issued, you can check the past change and current value of the performance value evaluated by the alert on the [Trend] tab of the integrated operation viewer.

For details about PromQL expression defined in trend displayed related capabilities, see 9.5.1(4) Return of trend data.

For example, if you want the Node exporter to monitor CPU usage and notify you when the CPU usage exceeds 80%, create an alert configuration file (alert definition) and a metric definition file as shown in the following example.

  • Example of description of alert configuration file (alert definition)

    groups:
      - name: node_exporter
        rules:
        - alert: cpu_used_rate(Node exporter)
          expr: 80 < (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="system"}[2m])) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="user"}[2m]))) * 100
          for: 3m
          labels:
            jp1_pc_product_name: "/HITACHI/JP1/JPCCS"
            jp1_pc_severity: "Error"
            jp1_pc_eventid: "0301"
            jp1_pc_metricname: "node_cpu_seconds_total"
          annotations:
            jp1_pc_firing_description: "CPU utilization exceeded threshold (80%).value={{ $value }}%"
            jp1_pc_resolved_description: "CPU usage has dropped below the threshold (80%)."
  • Example of description of metric definition

    [
      {
        "name":"cpu_used_rate",
        "default":true,
        "promql":"(avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode=\"system\"}[2m]) and $jp1im_TrendData_labels) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode=\"user\"}[2m]) and $jp1im_TrendData_labels)) * 100",
        "resource_en":{
          "category":"platform_unix",
          "label":"CPU used rate",
          "description":"CPU usage.It also indicates the average value per processor. [Units: %]",
          "unit":"%"
        },
        "resource_ja":{
          "category":"platform_unix",
          "label":"CPU Usage",
          "description":"CPU Utilization (%). It is also an average percentage of each processor.",
          "unit":"%"
        }
      }
    }

    When the conditions of the PromQL statement specified in expr of the alert definition are satisfied and the JP1 event of the alert is issued, the message "CPU usage exceeded threshold (80%). value = performance value%" is set in the message of the JP1 event. Users can view this message to view "CPU Usage" trend information and see past changes and current values of CPU usage.

■ Behavior when the service is stopped

If the Prometheus server or Alertmanager service is stopped, the JP1 event for the alert is not issued. In addition, if the Prometheus server and Alertmanager services are running and the exporter whose alert is abnormal is stopped due to a failure, the alert becomes normal and a normal JP1 event is issued.

■ About behavior when the service is restarted

If the alert is abnormal or normal and the Prometheus server, Alertmanager, or Exporter service is restarted, but the current alert status is the same as the alert state before the restart, the JP1 event is not issued.

■ About Considering Performance Data Spikes

Performance data can be momentarily jumpy (large values, small values, or minus values). These sudden changes in performance data are commonly referred to as "spikes." In many cases, even if a spike occurs and becomes an abnormal value momentarily, it immediately returns to normal and does not need to be treated as an abnormal. Also, when the performance data is reset, such as when the OS is restarted, a spike may occur instantaneously.

When monitoring such performance data metrics, it is necessary to consider suppressing sudden anomaly detection by specifying "for" (grace period before treating alerts as anomalies) in the alert rule definition.

(b) Alert forwarder

This function notifies you when the alert status becomes "firing" or "resolved" after the Prometheus server evaluates the alert.

The Prometheus server sends alerts one by one, and the sent alerts are notified to JP1/IM - Manager (Intelligent Integrated Management Base) via Alertmanager. You will also be notified one by one when you retry.

Alerts sent to JP1/IM - Manager are basically sent in the order in which they occurred, but the order may be changed when multiple alert rules meet the conditions at the same time or when a transmission error occurs and they are resent. However, since the alert information includes the time of occurrence, it is possible to understand in which order it occurred.

In addition, if the abnormal condition continues for 7 days, an alert will be re-notified.

The following shows how to define the notification destination of the alert.

  • Alert destinations are described in both the Prometheus configuration file (jpc_prometheus_server.yml) and the Alertmanager configuration file (jpc_alertmanager.yml).

    For Prometheus configuration file, specify a Alertmanager that coexists as a destination for Prometheus server notifications. For Alertmanager configuration file, specify JP1/IM agent control base as the notification destination for Alertmanager.

  • Download the individual configuration file from integrated operation viewer, edit them in a text editor, change the alert notification destination definitions, and then upload them.

The following settings are related to definition of Prometheus server notification destinations supported by JP1/IM - Agent. For more information about the settings, see "Prometheus configuration file (jpc_prometheus_server.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

Table 9‒24: Settings for defining notification destinations for Prometheus server supported by JP1/IM - Agent

Setting items

Description

Notification destination (required)

Configure the notification destination Alertmanager.

If a host name or internet address is specified for --web.listen-address in the Alertmanager command line option, modify localhost to the host name or internet address specified in --web.listen-address.

  • For physical host environments

    Specifies the Alert manager that you want to live with.

  • For clustered environment

    Specifies the Alertmanager that runs on the logical host.

Label setting (optional)

You can add labels. Configure as needed.

The following are Alertmanager notification destinations that JP1/IM - Agent supports: For more information about the settings, see "Alertmanager configuration file (jpc_alertmanager.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.

Table 9‒25:  Settings for defining Alertmanager notification destinations supported by JP1/IM - Agent

Setting items

Description

Webhook settings (required)

Set the endpoint URL for JP1/IM agent control base.

(c) Notification suppression function

This function suppresses the notifications described in 9.5.3(4)(b) Alert forwarder. It includes:

  • Silence function

    Use this if you do not want to be temporarily notified of certain alerts.

■ Silence function

This feature temporarily suppresses certain notifications. You can set not to notify alerts that occur during temporary maintenance. Unlike when the common exclusion condition of JP1/IM - Manager is used, the notification suppression function does not notify JP1/IM - Manager itself.

While silence is enabled, you will not be notified when the alert status changes. When silence is disabled, if the state has changed compared to the state of the alert before silence was enabled, notification is given.

Here are two examples of when to notify:

Figure 9‒22: Cases where the state is different before and after disabling silence

[Figure]

The above figure shows an example in which the alert status is "Normal" when silence is enabled, and while silence is enabled, the alert status changes to "Normal", and then silence is disabled.

When the alert changes to Normal, you will not be notified because silence is enabled. When silence is disabled, the alert status has changed from "normal" to "normal" before silence is enabled, so "normal" notification is given.

Figure 9‒23: Cases where the state is the same before and after enabling silence

[Figure]

The above figure shows an example in which the alert status changed to "normal" once, changed to "abnormal" again, and then disabled silence while silence was enabled.

When silence is disabled, notification is not performed because the alert status is the same "abnormal" as before silence was enabled.

If an alert fails to be sent and retries and silence is enabled to suppress the alert, the alert will not be retried.

- How to Configure silence

Silence settings (enable or disable) and retrieve the current silence settings are performed via REST API (GUI is not supported).

In addition, when configuring silence settings, integrated agent host must be able to communicate with Alertmanager port-number from the machine that you are operating.

For more details about silence settings and REST API used to obtain current silence settings, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" "5.21.3 Obtain list of Alertmanager silence", "5.21.4 Create Alertmanager silence", and "5.21.5 Expiration of Alertmanager silence".