9.5.3 Performance monitoring function
Performance monitoring function consists of Prometheus, Alertmanager, Exporter of add-on program and provides the following two functions:
-
Function to retrieve performance data through Exporter and send performance data to the Integrated Manager host
-
This function monitors the thresholds of the acquired performance data. If a condition is met, it alerts JP1/IM - Manager.
Performance data and alerts sent to the Integrated manager host can be viewed in integrated operation viewer.
- Organization of this subsection
(1) Communication function
(a) Communication protocols and authentication methods
The following shows the communication protocols and authentication methods used by integrated agent.
Connection source |
Connect to |
Protocol |
Authentication method |
---|---|---|---|
Prometheus server |
JP1/IM agent control base |
HTTP |
No authentication |
Alertmanager |
|||
Prometheus server |
Alertmanager |
HTTP |
No authentication |
Exporter |
|||
Blackbox exporter |
monitored |
HTTP/HTTPS |
Basic Authentication |
Basic Authentication |
|||
No authentication |
|||
HTTPS |
Server Authentication |
||
With client authentication |
|||
With client authentication |
|||
ICMP# |
No authentication |
||
Yet another cloudwatch exporter |
Amazon CloudWatch |
HTTPS |
AWS IAM Authentication |
- #
-
ICMPv6 is not available.
(b) Network configuration
Integrated agent can be used in a network configuration with only a IPv4 environment or in a network configuration with a mix of IPv4 and IPv6 environments. Only IPv4 communication is supported in a network configuration with a mix of IPv4 and IPv6 environments.
You can use integrated agent in the following configurations without a proxy server:
Connection source |
Connect to |
Connection type |
---|---|---|
Prometheus server |
JP1/IM agent control base |
No proxy server |
Alertmanager |
||
Prometheus server |
Alertmanager |
|
Exporter |
||
Blackbox exporter |
Monitoring targets (ICMP monitoring) |
|
Monitoring targets (HTTP monitoring) |
No proxy server |
|
Through a proxy server without authentication |
||
Through a proxy server with authentication |
||
Yet another cloudwatch exporter |
Amazon CloudWatch |
No proxy server |
Through a proxy server without authentication |
||
Through a proxy server with authentication |
Integrated agent transmits the following:
Connection source |
Connect to |
Transmitted data |
---|---|---|
Prometheus server |
JP1/IM agent control base |
Performance data in Protobuf format |
Alertmanager |
Alert information in JSON format#1 |
|
Prometheus server |
Exporter |
Prometheus textual performance data#2 |
Blackbox exporter |
monitored |
Response for each protocol |
Yet another cloudwatch exporter |
Amazon CloudWatch |
CloudWatch data |
- #1
-
For more information, see the description of the message body for the request in the manual "JP1/Integrated Management 3 - Manager Command, Definition File and API Reference", "5.6.5 JP1 Event Translation".
- #2
-
For more information, see the manual "JP1/Integrated Management 3 - Manager Command, Definition File and API Reference", "5.23 API for Exporter for scrape Used by JP1/IM - Agent", for a textual description of Prometheus.
(2) Performance data collection function
Prometheus server is a function that collects performance data from monitored targets. It has two functions:
-
Scrape function (Prometheus server)
-
Ability to acquire monitored operation information (Exporter)
(a) Scrape function
Prometheus server is a function that acquires the performance data to be monitored via the Exporter.
When the Prometheus server accesses a specific URL of the Exporter, the Exporter retrieves the monitored performance data and returns it to the Prometheus server. This process is called scrape.
A scrape is executed in units of scrape jobs that combine multiple scrapes for the same purpose. Scrapes are defined in units of scrape jobs. JP1/IM - By default, the following scrape job name scrape definition is set according to the type of exporter.
Scrape Job Name |
Scrape Definition |
---|---|
prometheus |
Scrape definition for Prometheus server |
jpc_node |
Scrape definition for Node exporter |
jpc_windows |
Scrape definition for Windows exporter |
jpc_blackbox_http |
Scrape definition for HTTP/HTTPS monitoring in Blackbox exporeter |
jpc_blackbox_icmp |
Scrape Definition for ICMP Monitoring in Blackbox exporeter |
jpc_cloudewatch |
Scrape definition for Yet another cloudwatch exporter |
If you want to scrape your own exporter, you must add a scrape definition for each target exporter.
The metric obtained from Exporter by scraping of Prometheus server is depending on the type of Exporter. For more information, see the explanation of metric definition file in each Exporter in the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" (2. Definition File).
In addition, the Prometheus server generates the following metrics when scraping is performed, in addition to the metrics obtained from the exporter.
Metric Name |
Description |
---|---|
up |
This metric indicates "1" for successful scraping and "0" for failure. It can be used to monitor the operation of the exporter. Scrape failure may be caused by host stoppage, exporter stop, exporter returning anything other than 200, or communication error. |
scrape_duration_seconds |
A metric that indicates how long it took to scrape. It is not used in normal operation. It is used for investigations when the scrape does not finish within the expected time. |
scrape_samples_post_metric_relabeling |
A metric that indicates the number of samples remaining after the metric is relabeled. It is not used in normal operation. It is used to check the number of data when building the environment. |
scrape_samples_scraped |
A metric that indicates the number of samples returned by the exporter scraped. It is not used in normal operation. It is used to check the number of data when building the environment. |
scrape_series_added |
A metric that shows the approximate number of newly generated series. It is not used in normal operation. |
For information about how to run scrape, see the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "5.23 API for scrape of Exporter Used by JP1/IM - Agent". Exporter that you want to scrape must be able to run as described here.
The scrape definition method is shown below.:
-
Scrape definitions are done in units of scrape jobs.
-
The scrape definition is described in the Prometheus configuration file (jpc_prometheus_server.yml).
-
If you are editing a scrape definition, you can download Prometheus configuration file from integrated operation viewer, edit it, and then upload it.
The following are the settings related to scrape definitions supported by JP1/IM - Agent.
Setting Item |
Description |
---|---|
Scrape Job Name (required) |
Sets the name of the scrape job that Prometheus scrapes. You can specify multiple scrape job names. The specified scrape job name is set in the metric label as job="scrape job name". |
Scrape to (required) |
Set the specific URL of the exporter to be scraped. Only exporters on hosts where JP1/IM - Agent resides can be specified as scrape destinations. The server to be scraped in the URL is specified by the host name. "localhost" cannot be used. The total number of scrape destinations specified in all scrape jobs is limited to 100. |
Scrape parameters (Optional) |
You can set parameters to pass to the Exporter when scraping. Depending on the type of exporter, the contents that can be set differ. |
Scrape interval (Optional) |
You can set the scrape interval. You can set a scrape interval that is common to all scrape jobs and a scrape interval for each scrape job. If both are set, the scrape interval for each scrape job takes precedence. You can specify the following units: years, weeks, days, hours, minutes, seconds, or milliseconds. |
Scrape timeout (Optional) |
You can set a timeout period when scraping takes a long time. You can set a timeout period that is common to all scrape jobs and a timeout period for each scrape job. If both are set, the scrape interval for each scrape job takes precedence. |
Relabeling (Optional) |
You can delete unnecessary metrics and customize labels. By using this feature and setting unnecessary metrics that are not supported by default, you can reduce the amount of data sent to JP1/IM - Manager. |
The outcome of scrape by Exporter subject to scrape of Prometheus server is returned in Text-based format data format of Prometheus. Here is a Text-based format of Prometheus:
- Text-based format basics
-
Item
Description
Start time
2014 Apr
Supported Versions
Prometheus Version 0.4.0 or Later
Transmission format
HTTP
Character code
UTF-8
Line feed code is \n
Content-Type
Text/plain; version=0.0.4
If there is no version value, it is treated as the latest text format version.
Content-Encoding
gzip
Advantages
-
Human readable
-
Easy to assemble, especially for minimal cases (no need for nesting).
-
Read on a line-by-line basis (except for hints and docstring).
Constraints
-
Redundancy
-
Since the type and docstring are not part of the syntax, there is little validation of the metric contract.
-
Cost of parsing
Supported Metrics
-
Counter
-
Gauge
-
Histogram
-
Summary
-
Untyped
-
- More information about Text-based format
-
Text-based format of Prometheus is row-oriented.
Separate lines with a newline character. The line feed code is \n. \ r\n is considered invalid.
The last line must be a newline character.
Also, blank lines are ignored.
- Row Format
-
Within a line, tokens can be separated by any number of blanks or tabs. However, when joining with the previous token, it must be separated by at least one space.
In addition, leading and trailing white spaces are ignored.
- Comments, help text, and information
-
Lines that have # as a character other than the first white space are comments.
This line is ignored unless the first token after # is a HELP or TYPE.
These lines are treated as follows:
If the token is a HELP, at least one more token (metric name) is expected. All remaining tokens are considered to be docstring of that metric name.
HELP line can contain any UTF-8 string after metric name. However, you must escape the backslash as \ and the newline character as \n. For any metric name, there can be only one HELP row.
If the token is a TYPE, two or more tokens are expected. The first is metric name. The second, either counter, gauge, histogram, summary, or untyped, defines the type of metric. There can be only one TYPE row for a given metric. Metric name of TYPE line must appear in front of the first sample.
If no TYPE row exists for metric name, the type is set to untyped.
Write a sample (one per line) using the following EBNF:
metric_name [ "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}" ] value [ timestamp ]
- Sample Syntax
-
-
Metric_name and label_name are subject to the limitations of the normal Prometheus expression language.
-
The label_value is any UTF-8 string. However, backslash (\), double quote ("), and line feed must be escaped as \\, \" and \n, respectively.
-
Value is a floating-point number required by ParseFloat() function of Go language. In addition to the typical numbers, NaN, +Inf, -Inf is also a valid number. Indicates that NaN is not a number. The + Inf is positive infinity. -Inf is negative infinity.
-
The timestamp is a int64 (milliseconds from the epoch, 1970-01-01 00:00:00 UTC, excluding leap seconds), and is optionally represented by ParseInt() function of Go.
-
- Grouping and Sorting
-
All rows granted with metric must be provided as a single grouping, and the optional HELP and TYPE rows must come first (in any order).
It is also recommended, but not required, to perform repeatable sorting with a repeating description.
Each line must have a unique pair of metric names / labels. If it is not a unique combination, the capture behavior is undefined.
- Histograms and Summaries
-
Because histograms and summary types are difficult to express in text format, the following rules apply:
-
Sample sum x for the summary or histogram appears as another sample called x_sum.
-
Sample counts named x for a summary or histogram appear as another sample called x_count.
-
Each quantile in the summary named x appears as another sample line with the same name x and labeled {quantile="y"}.
-
Each bucket count in the histogram named x appears as another sample line named x_bucket and labeled {le="y"} ( y is the bucket limit).
-
The histogram must have a bucket of {le="+Inf"}. Its value must be the same as the value of x_count.
-
For le or quantile labels, the histogram bucket and summary quantiles must appear in ascending order of the values for the labels.
-
- Sample Text-based format
-
Here is a sample Prometheus metric exposition that contains comments, HELP and TYPE representations, histograms, summaries, and character escaping.
# HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values: msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9 # Minimalistic line: metric_without_timestamp_and_labels 12.47 # A weird metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 24054 http_request_duration_seconds_bucket{le="0.1"} 33444 http_request_duration_seconds_bucket{le="0.2"} 100392 http_request_duration_seconds_bucket{le="0.5"} 129389 http_request_duration_seconds_bucket{le="1"} 133988 http_request_duration_seconds_bucket{le="+Inf"} 144320 http_request_duration_seconds_sum 53423 http_request_duration_seconds_count 144320 # Finally a summary, which has a complex representation, too: # HELP rpc_duration_seconds A summary of the RPC duration in seconds. # TYPE rpc_duration_seconds summary rpc_duration_seconds{quantile="0.01"} 3102 rpc_duration_seconds{quantile="0.05"} 3272 rpc_duration_seconds{quantile="0.5"} 4773 rpc_duration_seconds{quantile="0.9"} 9001 rpc_duration_seconds{quantile="0.99"} 76656 rpc_duration_seconds_sum 1.7560473e+07 rpc_duration_seconds_count 2693
(b) Ability to obtain monitored operational information
This function acquires operation information (performance data) from the monitoring target. The process of collecting operational information is performed by a program called "Exporter".
In response to scrape requests sent from the Prometheus server to the Exporter, the Exporter collects operational information from the monitored target and returns the results to Prometheus.
Exporters shipped with JP1/IM - Agent scrape only from Prometheus in JP1/IM - Agent that cohabits. Do not scrape from Prometheus provided by other hosts or users.
This section describes the functions of each exporter included with JP1/IM - Agent.
(c) Blackbox exporter
Blackbox exporter is an exporter that sends simulated requests to monitored Internet services on the network and obtains operation information obtained from the responses. The supported communication protocols are HTTP, HTTPS, and ICMP.
When the Blackbox exporter receives a scrape request from the Prometheus server, it throws a service request such as HTTP to the monitored target and obtains the response time and response. In addition, the execution results are summarized in the form of metrics and returned to the Prometheus server.
■ Main items to be acquired
The main retrieval items of Blackbox exporter are defined in Blackbox exporter metric definition file (default). For more information, see "Blackbox exporter metric definition file (metrics_blackbox_exporter.conf)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file.
Metric Name |
Prober |
What to get |
Label |
---|---|---|---|
probe_http_duration_seconds |
http |
The number of seconds taken per phase of the HTTP request
|
instance: Instance identification string job: Job name phase: Phase#
|
probe_http_content_length |
http |
HTTP content response length |
instance: Instance identification string job: Job name |
probe_http_uncompressed_body_length |
http |
Uncompressed response body length |
instance: Instance identification string job: Job name |
probe_http_redirects |
http |
Number of redirects |
instance: Instance identification string job: Job name |
probe_http_ssl |
http |
Whether SSL was used for the final redirect
|
instance: Instance identification string job: Job name |
probe_http_status_code |
http |
HTTP response status code value
|
instance: Instance identification string job: Job name |
probe_ssl_earliest_cert_expiry |
http |
Earliest expiring SSL certificate UNIX time |
instance: Instance identification string job: Job name |
probe_ssl_last_chain_expiry_timestamp_seconds |
http |
Expiration timestamp of the last certificate in the SSL chain
|
instance: Instance identification string job: Job name |
probe_ssl_last_chain_info |
http |
SSL leaf certificate information
|
instance: Instance identification string job: Job name fingerprint_sha256: SHA256 fingerprint on certificate |
probe_tls_version_info |
http |
TLS version used
|
instance: Instance identification string job: Job name version:TLS Version |
probe_http_version |
http |
HTTP version of the probe response |
instance: Instance identification string job: Job name |
probe_failed_due_to_regex |
http |
Whether the probe failed due to a regular expression check on the response body or response headers
|
instance: Instance identification string job: Job name |
probe_http_last_modified_timestamp_seconds |
http |
UNIX time showing Last-Modified HTTP response headers |
instance: Instance identification string job: Job name |
probe_icmp_duration_seconds |
icmp |
Seconds taken per phase of an ICMP request |
instance: Instance identification string job: Job name phase: Phase#
|
probe_icmp_reply_hop_limit |
icmp |
Hop limit (TTL for IPv4) value
|
instance Instance identification string job: Job name |
probe_success |
-- |
Whether the probe was successful
|
instance Instance identification string job: Job name |
probe_duration_seconds |
-- |
The number of seconds it took for the probe to complete |
instance Instance identification string job: Job name |
■ IP communication with monitored objects
Only IPv4 communication is supported.
■ Encrypted communication with monitored objects
HTTP monitoring enables encrypted communication using TLS. In this case, the Blackbox exporter acts as a TLS client to the monitored object (TLS server).
When using encrypted communication using TLS, specify it in item "tls_config" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml). In addition, the following certificate and key files must be prepared.
File |
Format |
---|---|
CA certificate file |
A file encoding an X509 public key certificate in pkcs7 format in PEM format |
Client certificate file |
|
Client certificate key file |
A file in which the private key in pkcs1 or pkcs8 format is encoded in PEM format#
|
The available TLS versions and cipher suites are supported below.
Item |
Scope of support |
---|---|
TLS Version |
1.0 to 1.3#
|
Cipher suites |
|
■ Timeout for collecting health information
In a network environment where response is slow (under normal conditions), operating information can be collected by adjusting the timeout period.
On the Prometheus server, you can specify the scrape request timeout period in the entry "scrape_timeout" of the Prometheus configuration file (jpc_prometheus_server.yml). For more information, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "Prometheus configuration file (jpc_prometheus_server.yml)" (2. Definition File), "scrape_timeout" entry in the manual.
In addition, the timeout period when connecting from the Blackbox exporter to the monitoring target is 0.5 seconds before the value specified in "scrape_timeout" above.
■ Certificate expiration
When collecting operation information by HTTPS monitoring, the exporter receives a certificate list (server certificate and certificate list certifying server certificate) from the monitoring target.
The Blackbox exporter allows you to collect the expiration time (UNIX time) of the closest expiring certificate as a probe_ssl_earliest_cert_expiry metric.
You can also use the features in 9.5.3(4) Performance data monitoring notification function to monitor certificates that are close to their deadline, because you can calculate the number of seconds remaining before the deadline with the value calculated in probe_ssl_earliest_cert_expiry Metric Value-PromQL's time() function.
■ User-Agent value in HTTP request header when monitoring HTTP
The default value of User-Agent included in HTTP request header during HTTP monitoring is "Go-http-client/1.1". You can change the value of User-Agent in the setting of item "headers" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml).
The following is an example of changing the value of User-Agent to "My-Http-Client".
modules: http: prober: http http: headers: User-Agent: "My-Http-Client" |
For more information, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "Blackbox exporter configuration file (jpc_blackbox_exporter.yml)" (2. Definition File), "headers" entry in the manual.
■ About HTTP 1.1 Name-Based Virtual Host Support
The Blackbox exporter supports HTTP 1.1 name-based virtual hosts and TLS Server Name Indication (SNI). You can monitor virtual hosts that disguise one HTTP/HTTPS server as multiple HTTP/HTTPS servers.
■ About TLS Server Authentication and Client Authentication
In Blackbox exporter's HTTPS monitoring, server authentication is performed using the CA certificate described in item "ca_file" of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml) and the server certificate sent by the server when HTTPS communication with the server starts (TLS handshake).
If the sent certificate is incorrect (server name is incorrect, expired, self-certificate is used, etc.), HTTPS communication cannot be started and monitoring fails.
In addition, when a request is made to send a certificate from the monitored server at the start of HTTPS communication (TLS handshake), the client certificate described in item "cert_file" of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml) is sent to the monitored server.
If the server validates the sent certificate, recognizes it as invalid, and returns an error to the Blackbox exporter via the TLS protocol (or if communication cannot be continued due to a loss of communication, etc.), the monitoring fails.
For details on the verification contents related to the client certificate and the operation in the event of an error on the monitored server, check the specifications of the monitored server (or relay device such as a load balancer).
To detect fraudulent certificates during server authentication, if you specify "true" in item "insecure_skip_verify" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml), HTTPS communication can be started without errors. However, in that case, the verification operation related to client authentication at the server will be invalidated.
For more information, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference", "Blackbox exporter configuration file (jpc_blackbox_exporter.yml) (2. Definition File), and the "insecure_skip_verify" entry in the manual.
■ About cookie information
The Blackbox exporter does not use cookie information sent from the monitored target in the next HTTP communication request.
■ About external resources referenced from content included in the response body of HTTP communication
In Blackbox exporter, external resources (subframes, images, etc.) referenced from the content included in the response body of HTTP communication are not included in the monitoring range.
■ About Monitoring of Content Included in HTTP Communication Response Body
Since the Blackbox exporter does not parse the content, the execution result and execution time based on the syntax (HTML, javascript, etc.) in the content included in the response body of HTTP communication are not reflected in the monitoring result.
■ Precautions when the monitoring destination of HTTP monitoring redirects with Basic authentication
If the Blackbox exporter's HTTP monitoring destination redirects with Basic authentication, the Blackbox exporter sends the same Basic authentication username and password to the redirect source and destination. Therefore, when performing Basic authentication on both the redirect source and the redirect destination, the same user name and password must be set on the redirect source and the redirect destination.
(d) Node exporter
Node exporter is an exporter that can be embedded in a monitored Linux host to obtain operating information of a Linux host.
The Node exporter is installed on the same host as the Prometheus server, and upon a scrape request from the Prometheus server, it collects operational information from the Linux OS of the host and returns it to the Prometheus server.
It is possible to collect operational information related to memory and disk, which cannot be collected by monitoring from outside the host (external monitoring by URL or CloudWatch), from inside the host.
■ Main items to be acquired
The main retrieval items of Node exporter are defined in Node exporter metric definition file (default). For more information, see "Node exporter metric definition file (metrics_node_exporter.conf)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file. For details of "Collector" in the table, refer to the description of "Collector" at the bottom of the table.
Metric Name |
Collector |
What to Get |
Label |
---|---|---|---|
node_boot_time_seconds |
stat |
Last boot time
|
instance: Instance identification string job: Job name |
node_context_switches_total |
stat |
Number of times a context switch has been made (cumulative) |
instance: Instance identification string job: Job name |
node_cpu_seconds_total |
cpu |
CPU seconds spent in each mode (cumulative) |
instance: Instance identification string job: Job name cpu: cpuid mode: Mode#
|
node_disk_io_now |
diskstats |
Number of disk I/Os currently in progress |
instance: Instance identification string job: Job name device: Device name |
node_disk_io_time_seconds_total |
diskstats |
Seconds spent on disk I/O (cumulative) |
instance: Instance identification string job: Job name device: Device name |
node_disk_read_bytes_total |
diskstats |
Number of bytes successfully read from disk (cumulative) |
instance: Instance identification string job: Job name device: Device name |
node_disk_read_time_seconds_total |
diskstats |
Seconds took to read from disk (cumulative value) |
instance: Instance identification string job: Job name device: Device name |
node_disk_reads_completed_total |
diskstats |
Number of successfully completed reads from disk (cumulative) |
instance: Instance identification string job: Job name device: Device name |
node_disk_write_time_seconds_total |
diskstats |
Seconds took to write to disk (cumulative value) |
instance: Instance identification string job: Job name device: Device name |
node_disk_writes_completed_total |
diskstats |
Number of successfully completed disk writes (cumulative) |
instance: Instance identification string job: Job name device: Device name |
node_disk_written_bytes_total |
diskstats |
Number of bytes successfully written to disk (cumulative) |
instance: Instance identification string job: Job name device: Device name |
node_filesystem_avail_bytes |
filesystem |
Number of file system bytes available to non-root users |
instance: Instance identification string job: Job name fstype: File System Type mountpoint: Mount Point |
node_filesystem_files |
filesystem |
Number of file nodes in the file system |
instance: Instance identification string job: Job name fstype: File System Type mountpoint: Mount Point |
node_filesystem_files_free |
filesystem |
Number of free file nodes in the file system |
instance: Instance identification string job: Job name fstype: File System Type mountpoint: Mount Point |
node_filesystem_free_bytes |
filesystem |
Number of bytes of free file system space |
instance: Instance identification string job: Job name fstype: File System Type mountpoint: Mount Point |
node_filesystem_size_bytes |
filesystem |
Number of bytes in file system capacity |
instance: Instance identification string job: Job name fstype: File System Type mountpoint: Mount Point |
node_intr_total |
stat |
Number of interrupts handled (cumulative) |
instance: Instance identification string job: Job name |
node_load1 |
loadavg |
One-minute average of the number of jobs in the run queue |
instance: Instance identification string job: Job name |
node_load15 |
loadavg |
15-minute average of the number of jobs in the run queue |
instance: Instance identification string job: Job name |
node_load5 |
loadavg |
5-minute average of the number of jobs in the run queue |
instance: Instance identification string job: Job name |
node_memory_Active_file_bytes |
meminfo |
Bytes of recently used file cache memory
|
instance: Instance identification string job: Job name |
node_memory_Buffers_bytes |
meminfo |
Number of bytes in the file buffer
|
instance: Instance identification string job: Job name |
node_memory_Cached_bytes |
meminfo |
Number of bytes in file read cache memory
|
instance: Instance identification string job: Job name |
node_memory_Inactive_file_bytes |
meminfo |
Number of bytes of file cache memory that have not been used recently
|
instance: Instance identification string job: Job name |
node_memory_MemAvailable_bytes |
meminfo |
The number of bytes of memory available to start a new application without swapping
|
instance: Instance identification string job: Job name |
node_memory_MemFree_bytes |
meminfo |
Number of bytes of free memory
|
instance: Instance identification string job: Job name |
node_memory_MemTotal_bytes |
meminfo |
Total amount of bytes of memory
|
instance: Instance identification string job: Job name |
node_memory_SReclaimable_bytes |
meminfo |
Number of bytes in the Slab cache that can be reclaimed
|
instance: Instance identification string job: Job name |
node_memory_SwapFree_bytes |
meminfo |
Number of bytes of free swap memory space
|
instance: Instance identification string job: Job name |
node_memory_SwapTotal_bytes |
meminfo |
Bytes of total swap memory
|
instance: Instance identification string job: Job name |
node_netstat_Icmp6_InMsgs |
netstat |
Number of ICMPv6 messages received (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Icmp_InMsgs |
netstat |
Number of ICMPv4 messages received (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Icmp6_OutMsgs |
netstat |
Number of ICMPv6 messages sent (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Icmp_OutMsgs |
netstat |
Number of ICMPv4 messages sent (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Tcp_InSegs |
netstat |
Number of TCP packets received (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Tcp_OutSegs |
netstat |
Number of TCP packets sent (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Udp_InDatagrams |
netstat |
Number of UDP packets received (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_netstat_Udp_OutDatagrams |
netstat |
Number of UDP packets sent (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_flags |
netclass |
A numeric value indicating the state of the interface
|
instance: Instance identification string job: Job name device: Network Device Name |
node_network_iface_link |
netclass |
Interface serial number
|
instance: Instance identification string job: Job name device: Network Device Name |
node_network_mtu_bytes |
netclass |
Interface MTU value
|
instance: Instance identification string job: Job name device: Network Device Name |
node_network_receive_bytes_total |
netdev |
Number of bytes received by the network device (cumulative value) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_receive_errs_total |
netdev |
Number of network device receive errors (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_receive_packets_total |
netdev |
Number of packets received by network devices (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_transmit_bytes_total |
netdev |
Number of bytes sent by the network device (cumulative value) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_transmit_colls_total |
netdev |
Number of transmit collisions for network devices (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_transmit_errs_total |
netdev |
Number of transmission errors for network devices (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_network_transmit_packets_total |
netdev |
Number of packets sent by network devices (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
node_time_seconds |
time |
Seconds of system time since the epoch (1970) |
instance: Instance identification string job: Job name |
node_uname_info |
uname |
System information obtained by the uname system call |
instance: Instance identification string job: Job name domainname: NIS and YP domain names machine: Hardware Identifiers nodename: Machine name in some network defined at implementation time release: Operating system release number (e.g. "2.6.28") sysname: The name of the OS (e.g. "Linux") version: Operating system version |
node_vmstat_pswpin |
vmstat |
Number of page swap-ins (cumulative)
|
instance: Instance identification string job: Job name |
node_vmstat_pswpout |
vmstat |
Number of page swap-outs (cumulative)
|
instance: Instance identification string job: Job name |
■ Collector
The Node exporter has a built-in collection process called a "collector" for each monitored resource such as CPU and memory.
If you want to add the metrics listed in the table above as acquisition fields, you must enable the collector corresponding to the metric you want to use. You can also disable collectors of metrics that you do not want to collect to suppress unnecessary collection.
Per-collector enable/disable can be specified in the Node exporter command line options. Specify the collector to enable with the "--collector.collector-name" option and the collector to disable with the "--no-collector.collector-name" option.
For Node exporter command-line options, see the manual "node_exporter command options" in "unit definition file (jpc_program-name.service)" (2. Definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference".
(e) Windows exporter
Windows exporter is an exporter that can be embedded in the monitored Windows host and obtain the operating information of the Windows host.
Windows exporter is installed on the same host as the Prometheus server, and upon a scrape request from the Prometheus server, it collects operational information from the Windows OS of the host and returns it to the Prometheus server.
It is possible to collect operational information related to memory and disk, which cannot be collected by monitoring from outside the host (external monitoring by URL or CloudWatch), from inside the host.
■ Main items to be acquired
The main retrieval items of Windows exporter are defined in Windows exporter metric definition file (default). For more information, see "Windows exporter metric definition file (metrics_windows_exporter.conf)" (2.Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file. For details of "Collector" in the table, refer to the description of "Collector" at the bottom of the table.
Metric Name |
Collector |
What to Get |
Label |
---|---|---|---|
windows_cache_copy_read_hits_total |
cache |
Number of copy read requests that hit the cache (cumulative) |
instance: Instance identification string job: Job name |
windows_cache_copy_reads_total |
cache |
Number of reads from the file system cache page (cumulative) |
instance: Instance identification string job: Job name |
windows_cpu_time_total |
cpu |
Number of seconds of processor time spent per mode (cumulative) |
instance: Instance identification string job: Job name core: coreid mode: Mode#
|
windows_cs_physical_memory_bytes |
cs |
Number of bytes of the physical memory capacity |
instance: Instance identification string job: Job name |
windows_logical_disk_idle_seconds_total |
logical_disk |
Number of seconds that the disk was idle (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_free_bytes |
logical_disk |
Number of bytes of unused disk space |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_read_bytes_total |
logical_disk |
Number of bytes transferred from disk during the read operation (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_read_seconds_total |
logical_disk |
Number of seconds that the disk was busy for read operations (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_reads_total |
logical_disk |
Number of read operations to disk (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_requests_queued |
logical_disk |
Number of requests queued on disk |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_size_bytes |
logical_disk |
Disk space bytes |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_write_bytes_total |
logical_disk |
Number of bytes transferred to disk during the write operation (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_write_seconds_total |
logical_disk |
Number of seconds that the disk was busy for write operations (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_logical_disk_writes_total |
logical_disk |
Number of disk write operations (cumulative) |
instance: Instance identification string job: Job name volume: Volume name |
windows_memory_available_bytes |
memory |
Number of bytes of unused space in physical memory
|
instance: Instance identification string job: Job name |
windows_memory_cache_bytes |
memory |
Number of bytes of physical memory used for file system caching |
instance: Instance identification string job: Job name |
windows_memory_cache_faults_total |
memory |
Number of page faults in the file system cache (cumulative) |
instance: Instance identification string job: Job name |
windows_memory_page_faults_total |
memory |
Number of times a page fault occurred (cumulative) |
instance: Instance identification string job: Job name |
windows_memory_pool_nonpaged_allocs_total |
memory |
Number of times a nonpageable physical memory region was allocated |
instance: Instance identification string job: Job name |
windows_memory_pool_paged_allocs_total |
memory |
Number of times you allocated a pageable physical memory region |
instance: Instance identification string job: Job name |
windows_memory_swap_page_operations_total |
memory |
Number of pages read from or written to disk to resolve hard page faults (cumulative) |
instance: Instance identification string job: Job name |
windows_memory_swap_pages_read_total |
memory |
Number of pages read from disk to resolve hard page faults (cumulative) |
instance: Instance identification string job: Job name |
windows_memory_swap_pages_written_total |
memory |
Number of pages written to disk to resolve hard page faults (cumulative) |
instance: Instance identification string job: Job name |
windows_memory_system_cache_resident_bytes |
memory |
Number of active system file cache bytes in physical memory |
instance: Instance identification string job: Job name |
windows_memory_transition_faults_total |
memory |
The number of page faults resolved by recovering pages that were in use by other processes sharing the page, pages that were on the modified pages list or standby list, or pages that were written to disk (cumulative) |
instance: Instance identification string job: Job name |
windows_net_bytes_received_total |
net |
Number of bytes received by the interface (cumulative)
|
instance: Instance identification string job: Job name device: Network Device Name |
windows_net_bytes_sent_total |
net |
Number of bytes sent from the interface (cumulative)
|
instance: Instance identification string job: Job name device: Network Device Name |
windows_net_bytes_total |
net |
Number of bytes received and transmitted by the interface (cumulative)
|
instance: Instance identification string job: Job name device: Network Device Name |
windows_net_packets_sent_total |
net |
Number of packets sent by the interface (cumulative)
|
instance: Instance identification string job: Job name device: Network Device Name |
windows_net_packets_received_total |
net |
Number of packets received by the interface (cumulative)
|
instance: Instance identification string job: Job name device: Network Device Name |
windows_system_context_switches_total |
system |
Number of context switches (cumulative) |
instance: Instance identification string job: Job name device: Network Device Name |
windows_system_processor_queue_length |
system |
Number of threads in the processor queue |
instance: Instance identification string job: Job name device: Network Device Name |
windows_system_system_calls_total |
system |
Number of times the process called the OS service routine (cumulative) |
instance: Instance identification string job: Job name |
■ Collector
Windows exporter has a built-in collection process called a "collector" for each monitored resource such as CPU and memory.
If you want to add the metrics listed in the table above as acquisition fields, you must enable the collector corresponding to the metric you want to use. You can also disable collectors of metrics that you do not want to collect to suppress unnecessary collection.
Enable/disable for each collector can be specified with the "--collectors.enabled" option on the Windows exporter command line or in the item "collectors.enabled" in the Windows exporter configuration file (jpc_windows_exporter.yml).
For information about Windows exporter command-line options, see the manual "windows_exporter Command Options" in "service definition file (jpc_program-name_service.xml)" (2. Definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference".
For information about Windows exporter configuration file entry "collectors.enabled", see "Windows exporter configuration file (jpc_windows_exporter.yml)" (2. Definition file) in the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" and the entry "collectors".
(f) Yet another cloudwatch exporter
Yet another cloudwatch exporter is an exporter included in the monitoring agent that uses Amazon CloudWatch to collect uptime information for AWS services in the cloud.
Yet another cloudwatch exporter is installed on the same host as the Prometheus server, and collects CloudWatch metrics obtained via the SDK provided by AWS (AWS SDK)# upon scrape requests from the Prometheus server, and sends them to the Prometheus server. I will return it.
- #
-
SDK provided by Amazon Web Services (AWS). Yet another cloudwatch exporter uses the Go language version of the AWS SDK for Go (V1). CloudWatch monitoring requires that Amazon CloudWatch supports the AWS SDK for Go (V1).
You can monitor services that cannot include Node exporter or Windows exporter.
■ Main items to be acquired
The main retrieval items of Yet another cloudwatch exporter are defined in Yet another cloudwatch exporter metric definition file (default). For more information, see "Yet another cloudwatch exporter metric definition file (metrics_ya_cloudwatch_exporter.conf)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
■ CloudWatch metrics you can collect
You can collect metric of namespace name of AWS that is supported for monitoring by Yet another cloudwatch exporter of JP1/IM-Agent that is listed in "9.5.1(1)(k) Creating an IM Management Node for Yet another cloudwatch exporter".
Specify the metrics to collect by describing the AWS service name and CloudWatch metric name in the Yet another Cloudwatch Exporter configuration file (jpc_ya_cloudwatch_exporter.yml).
The following is an example of the description of the Yet another cloudwatch exporter configuration file when collecting CPUUtilization and DiskReadBytes for CloudWatch metrics for AWS/EC2 services.
discovery: exportedTagsOnMetrics: ec2: - jp1_pc_nodelabel jobs: - type: ec2 regions: - ap-northeast-1 period: 60 length: 300 delay: 60 nilToZero: true searchTags: - key: jp1_pc_nodelabel value: .* metrics: - name: CPUUtilization statistics: - Maximum - name: DiskReadBytes statistics: - Maximum |
For information about what Yet another cloudwatch exporter configuration file describes, see "Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml)" (2. Definition File) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual
You can also add new metrics to the Yet another cloudwatch exporter metrics definition file using the metrics you set in the Yet another cloudwatch exporter configuration file.
The metrics and labels specified in the PromQL statement described in the definition file conform to the following naming conventions:
- - Naming conventions for Exporter metrics
-
Yet another cloudwatch exporter treats the metric name of CloudWatch as the metric name of the exporter as the automatic conversion of the metric name in CloudWatch by the following rules. Also, the metric specified on the PromQL statement is described using the indicator name of the exporter.
"aws_"#1+Namespace#2+"_"+CloudWatch_Metric #2+"_"+Statistic_Type#2
- #1
-
Appended if the namespace does not begin with "aws_".
- #2
-
Indicates the name you set in the Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml). It is converted by the following rules:
-
It is converted from camel case notation to snake case notation.
CamelCase is a notation that capitalizes word breaks, such as "CamelCase" or "camelCase."
Snakecase is a notation that separates words with "_", such as "snake_case".
-
The following symbols are converted to "_".
whitespace,comma,tab, /, \, half-width period, -, :, =, full-width left double quote, @, <, >
-
"%" is converted to "_percent".
-
- - Exporter label naming conventions
-
Yet another cloudwatch exporter treats the CloudWatch dimension tag name as the Exporter's label name, which is automatically converted by the following rules. Also, labels specified on the PromQL statement are described using the label name of the Exporter.
-
For dimensions
"dimension"+"_"+dimensions_name#
-
For tags
"tag"+"_"+tag_name#
-
For custom tags
"custom_tag_"+"_"+custom tag_name#
- #
-
Indicates the name you set in the Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml).
-
■ About policies for IAM users in your AWS account
To connect to AWS CloudWatch, you must create a policy with the following permissions and assign it to an IAM user.
"tag:GetResources", "cloudwatch:GetMetricData", "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics"
For details on how to set JSON format, see "2.19.2(7)(b) Changing the settings for connecting to CloudWatch (Linux) (optional)" in "JP1/Integrated Management 3 - Manager Configuration Guide".
■ Environment-variable HTTPS_PROXY
Environment-variable that you specify when you connect to CloudWatch from a Yet another cloudwatch exporter through a proxy. The URL that can be set in the environment-variable HTTPS_PROXY is http only. Note that the only Authentication method supported is Basic authentication.
You can set the environment-variable HTTPS_PROXY to connect to AWS CloudWatch through proxies. The following shows an example configuration.
HTTPS_PROXY=http://username:password@proxy.example.com:5678
■ How to handle monitoring targets JP1/IM-Agent does not support
If you have a product or metric that cannot be monitored by JP1/IM - Agent, you must retrieve it, for example, using user-defined Exporter.
(3) Centralized management of performance data
This function allows Prometheus server to store performance data collected from monitoring targets in the intelligent integrated management database of JP1/IM - Manager. It has the following features:
-
Remote light function
(a) Remote light function
This is a function in which the Prometheus server sends performance data collected from monitoring targets to an external database suitable for long-term storage. JP1/IM - Agent uses this function to send performance data to JP1/IM - Manager.
The following shows how to define a remote light.
-
Remote write definitions are described in the Prometheus server configuration file (jpc_prometheus_server.yml).
-
Download Prometheus server configuration file from integrated operation viewer, edit it in a text editor, modify Remote Write definition, and then upload it.
The following settings are supported by JP1/IM - Agent for defining Remote Write. For more information about the settings, see "Prometheus configuration file (jpc_prometheus_server.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
Setting items |
Description |
---|---|
Remote Light Destination (required) |
Set the endpoint URL for JP1/IM agent control base. |
Remote light timeout period (Optional) |
You can set the timeout period if the remote light takes a long time. Change it if you are satisfied with the default value. |
Relabeling (Optional) |
You can remove unwanted metric and customize labeling. |
(4) Performance data monitoring notification function
This function allows Prometheus server to monitor performance data collected from monitoring targets at a threshold value and notify JP1/IM - Manager. It has three functions:
-
Alert evaluation function
-
Alert notification function
-
Notification suppression function
(a) Alert evaluation function
This function monitors performance data collected from monitoring targets at a threshold value.
Define alert rules to evaluate alerts, monitor performance data at thresholds, and notify alerts.
Alerts can be evaluated by comparing the time series data directly with the thresholds, or by comparing the thresholds with the results of formulas using PromQL#.
- #
-
For more information about PromQL, see 2.7.4(4) About PromQL.
For each time series of data or for each data generated by the calculation result of the PromQL expression, the alert status according to the evaluation is managed, and the action related to the notification is executed according to the alert state.
There are three alert states: pending, firing, and resolved. If the alert rule conditions are met first, it will be in the "Pending" state. After that, if the conditions of the alert rule continue to be met (not recovered) for a certain period of time set in the alert rule definition, it will be in the "firing" state. If the conditions of the alert rule are not met (recovered), or if the time series is gone, it will be in the "resolved" state.
The relationship between alert status and notification behavior is shown below.
Alert status |
Description |
Notification behavior |
---|---|---|
pending |
A certain period of time has not passed since the alert rule was triggered. |
Do not notify alerts. |
firing |
A certain period of time has passed since the alert rule was triggered. |
Notifies you of alerts. |
resolved |
It is no longer applicable to the alert rule. |
|
The following shows how to define an alert rule.
-
Alert rule definitions are described in the alert configuration file (jpc_alerting_rules.yml) (definitions in any YAML format can also be described).
-
Before reflecting the created definition file in the environment to be used, format check and alert rule test with the promtool command.
-
Download alert configuration file from integrated operation viewer, edit it in a text editor, change the definition of the alert rule, and then upload it.
The following settings apply to the alert rule definitions supported by JP1/IM - Agent. For more information about the settings, see "alert configuration file (jpc_alerting_rules.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual. There is no default alert rule definition.
Setting Item |
Description |
---|---|
Alert Name (required) |
Set the alert name. |
Conditional expression (required) |
Set the alert condition expression (threshold). It can be configured using PromQL. |
Waiting time (required) |
Set the amount of time to wait after entering the "pending" state before changing to the "firing" state. Change it if you are satisfied with the default value. |
Label (required) |
Set labels to add to alerts and recovery notifications. In JP1/IM - Agent, a specific label must be set. |
Annotation (required) |
Set to store additional information such as alert description and URL link. In JP1/IM - Agent, certain annotations must be set. |
Labels and annotations can use the following variables:
Variable# |
Description |
---|---|
$labels |
A variable that holds the label key-value pairs for the alert instance. The label key can be one of the following labels: When time series data is specified in the alarm evaluation conditional expression You can specify the label that the data retains.
|
$values |
A variable that holds the evaluation value of the alert instance. When an abnormality notification is notified, it is expanded to the value at the time the abnormality was detected. During the recovery notification, it expands to the value at the time of the anomaly just before recovery (note that it is not the value at the time of recovery). |
$externalLabels |
This variable holds the label and value set in "external_labels" of item "global" in the Prometheus configuration file (jpc_prometheus_server.yml). |
- #1
-
Variables are expanded by enclosing them in "{{" and "}}". The following is an example of how to use variables:
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
■ Alert rule definition for converting to JP1 events
In order to convert the alert to be notified into a JP1 event on the JP1/IM - Manager side, the following information must be set in the alert rule definition.
Setting item |
Value to set |
Uses |
---|---|---|
name |
Configure any unique alert group definition name in integrated agent. |
Alert group definition name |
alert |
Set any unique alert-definition-name in integrated agent. |
Alert Definition Name |
expr |
Set the PromQL statement. It is recommended to set the PromQL statement described in the metric definition file. This way, when the JP1 event occurs, you can display trend information in the Integrated Operation Viewer. |
Abnormal conditions#
|
labels.jp1_pc_product_name |
Set "/HITACHI/JP1/JPCCS" as fixed. |
Set to the product name of the JP1 event. |
labels.jp1_pc_severity |
Set one of the following:
|
Set to JP1 event severity#.
|
labels.jp1_pc_eventid |
Set any value in the range of 0~1FFF,7FFF8000~7FFFFFFF. |
Set to the event ID of the JP1 event. |
labels.jp1_pc_metricname |
Set the metric name. For Yet another cloudwatch exporter, be sure to specify it. Associates the JP1 event with the IM management node in the AWS namespace corresponding to the metric name (or the first metric name if multiple metric names are specified separated by commas). |
Set to the metric name of the JP1 event. For yet another cloudwatch exporter, it is also used to correlate JP1 events. |
annotations.jp1_pc_firing_description |
Specify the value to be set for the message of the JP1 event when the abnormal condition of the alert is satisfied. If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte. If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = alert name)". You can also specify variables to embed job names and evaluation values. If a variable is used, the first 1,024 bytes of the expanded message are valid. |
It is set to the message of the JP1 event. |
annotations.jp1_pc_resolved_description |
Specify the value to be set for the JP1 event message when the abnormal condition of the alert is no longer satisfied. If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte. If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = alert name)". You can also specify variables to embed job names and evaluation values. If a variable is used, the first 1,024 bytes of the expanded message are valid. |
It is set to the message of the JP1 event. |
For an example of setting an alert definition, refer to the chapter describing metric definition file for each Exporter in the manual "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" (2. Definition File) and the explanation of "Alert Definition Example" in the table describing the setting details (initial status) of each metric.
For information about the properties of the corresponding JP1 event, see "3.2.3 JP1/IM - List of JP1 events issued by the Agent " in JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.
■ How to operate in combination with trending-related functions
Combine the definitions of the PromQL statement described in the metric definition file and the PromQL statement evaluated by the alert evaluation function, and in the alert definition annotations.jp1_pc_firing_description and annotations.jp1_pc_resolved_description of the alert definition in the alert configuration file, By describing the metric name of the corresponding trend data, when the JP1 event of the alert is issued, you can check the past change and current value of the performance value evaluated by the alert on the [Trend] tab of the integrated operation viewer.
For details about PromQL expression defined in trend displayed related capabilities, see 9.5.1(4) Return of trend data.
For example, if you want the Node exporter to monitor CPU usage and notify you when the CPU usage exceeds 80%, create an alert configuration file (alert definition) and a metric definition file as shown in the following example.
-
Example of description of alert configuration file (alert definition)
groups: - name: node_exporter rules: - alert: cpu_used_rate(Node exporter) expr: 80 < (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="system"}[2m])) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="user"}[2m]))) * 100 for: 3m labels: jp1_pc_product_name: "/HITACHI/JP1/JPCCS" jp1_pc_severity: "Error" jp1_pc_eventid: "0301" jp1_pc_metricname: "node_cpu_seconds_total" annotations: jp1_pc_firing_description: "CPU utilization exceeded threshold (80%).value={{ $value }}%" jp1_pc_resolved_description: "CPU usage has dropped below the threshold (80%)."
-
Example of description of metric definition
[ { "name":"cpu_used_rate", "default":true, "promql":"(avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode=\"system\"}[2m]) and $jp1im_TrendData_labels) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode=\"user\"}[2m]) and $jp1im_TrendData_labels)) * 100", "resource_en":{ "category":"platform_unix", "label":"CPU used rate", "description":"CPU usage.It also indicates the average value per processor. [Units: %]", "unit":"%" }, "resource_ja":{ "category":"platform_unix", "label":"CPU Usage", "description":"CPU Utilization (%). It is also an average percentage of each processor.", "unit":"%" } } }
When the conditions of the PromQL statement specified in expr of the alert definition are satisfied and the JP1 event of the alert is issued, the message "CPU usage exceeded threshold (80%). value = performance value%" is set in the message of the JP1 event. Users can view this message to view "CPU Usage" trend information and see past changes and current values of CPU usage.
■ Behavior when the service is stopped
If the Prometheus server or Alertmanager service is stopped, the JP1 event for the alert is not issued. In addition, if the Prometheus server and Alertmanager services are running and the exporter whose alert is abnormal is stopped due to a failure, the alert becomes normal and a normal JP1 event is issued.
■ About behavior when the service is restarted
If the alert is abnormal or normal and the Prometheus server, Alertmanager, or Exporter service is restarted, but the current alert status is the same as the alert state before the restart, the JP1 event is not issued.
■ About Considering Performance Data Spikes
Performance data can be momentarily jumpy (large values, small values, or minus values). These sudden changes in performance data are commonly referred to as "spikes." In many cases, even if a spike occurs and becomes an abnormal value momentarily, it immediately returns to normal and does not need to be treated as an abnormal. Also, when the performance data is reset, such as when the OS is restarted, a spike may occur instantaneously.
When monitoring such performance data metrics, it is necessary to consider suppressing sudden anomaly detection by specifying "for" (grace period before treating alerts as anomalies) in the alert rule definition.
(b) Alert forwarder
This function notifies you when the alert status becomes "firing" or "resolved" after the Prometheus server evaluates the alert.
The Prometheus server sends alerts one by one, and the sent alerts are notified to JP1/IM - Manager (Intelligent Integrated Management Base) via Alertmanager. You will also be notified one by one when you retry.
Alerts sent to JP1/IM - Manager are basically sent in the order in which they occurred, but the order may be changed when multiple alert rules meet the conditions at the same time or when a transmission error occurs and they are resent. However, since the alert information includes the time of occurrence, it is possible to understand in which order it occurred.
In addition, if the abnormal condition continues for 7 days, an alert will be re-notified.
The following shows how to define the notification destination of the alert.
-
Alert destinations are described in both the Prometheus configuration file (jpc_prometheus_server.yml) and the Alertmanager configuration file (jpc_alertmanager.yml).
For Prometheus configuration file, specify a Alertmanager that coexists as a destination for Prometheus server notifications. For Alertmanager configuration file, specify JP1/IM agent control base as the notification destination for Alertmanager.
-
Download the individual configuration file from integrated operation viewer, edit them in a text editor, change the alert notification destination definitions, and then upload them.
The following settings are related to definition of Prometheus server notification destinations supported by JP1/IM - Agent. For more information about the settings, see "Prometheus configuration file (jpc_prometheus_server.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
Setting items |
Description |
---|---|
Notification destination (required) |
Configure the notification destination Alertmanager. If a host name or internet address is specified for --web.listen-address in the Alertmanager command line option, modify localhost to the host name or internet address specified in --web.listen-address.
|
Label setting (optional) |
You can add labels. Configure as needed. |
The following are Alertmanager notification destinations that JP1/IM - Agent supports: For more information about the settings, see "Alertmanager configuration file (jpc_alertmanager.yml)" (2. definition file) in "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" manual.
Setting items |
Description |
---|---|
Webhook settings (required) |
Set the endpoint URL for JP1/IM agent control base. |
(c) Notification suppression function
This function suppresses the notifications described in 9.5.3(4)(b) Alert forwarder. It includes:
-
Silence function
Use this if you do not want to be temporarily notified of certain alerts.
■ Silence function
This feature temporarily suppresses certain notifications. You can set not to notify alerts that occur during temporary maintenance. Unlike when the common exclusion condition of JP1/IM - Manager is used, the notification suppression function does not notify JP1/IM - Manager itself.
While silence is enabled, you will not be notified when the alert status changes. When silence is disabled, if the state has changed compared to the state of the alert before silence was enabled, notification is given.
Here are two examples of when to notify:
|
The above figure shows an example in which the alert status is "Normal" when silence is enabled, and while silence is enabled, the alert status changes to "Normal", and then silence is disabled.
When the alert changes to Normal, you will not be notified because silence is enabled. When silence is disabled, the alert status has changed from "normal" to "normal" before silence is enabled, so "normal" notification is given.
|
The above figure shows an example in which the alert status changed to "normal" once, changed to "abnormal" again, and then disabled silence while silence was enabled.
When silence is disabled, notification is not performed because the alert status is the same "abnormal" as before silence was enabled.
If an alert fails to be sent and retries and silence is enabled to suppress the alert, the alert will not be retried.
- - How to Configure silence
-
Silence settings (enable or disable) and retrieve the current silence settings are performed via REST API (GUI is not supported).
In addition, when configuring silence settings, integrated agent host must be able to communicate with Alertmanager port-number from the machine that you are operating.
For more details about silence settings and REST API used to obtain current silence settings, see "JP1/Integrated Management 3 - Manager Command, Definition File, and API Reference" "5.21.3 Obtain list of Alertmanager silence", "5.21.4 Create Alertmanager silence", and "5.21.5 Expiration of Alertmanager silence".