Hitachi

JP1 Version 13 JP1/Integrated Management 3 - Manager Overview and System Design Guide


3.15.1 Performance monitoring function by JP1/IM - Agent

Performance monitoring function consists of Prometheus, Alertmanager, Exporter of add-on program and provides the following two functions:

Performance data and alerts sent to the Integrated manager host can be viewed in integrated operation viewer.

Organization of this subsection

(1) Performance data collection function

Prometheus server is a function that collects performance data from monitored targets. It has two functions:

(a) Scrape function

Prometheus server is a function that acquires the performance data to be monitored via the Exporter.

When the Prometheus server accesses a specific URL of the Exporter, the Exporter retrieves the monitored performance data and returns it to the Prometheus server. This process is called scrape.

A scrape is executed in units of scrape jobs that combine multiple scrapes for the same purpose.

If a discovery configuration file is used for monitoring through UAP monitoring, jobs should be defined. Also, additional settings are required for the scraping definitions of the log metrics feature.

For details on the scraping description of the log metrics feature, see 1.21.2(10) Setting up scraping definitions in the JP1/Integrated Management 3 - Manager Configuration Guide.

Scrapes are defined in units of scrape jobs. JP1/IM - By default, the following scrape job name scrape definition is set according to the type of exporter.

Scrape Job Name

Scrape Definition

prometheus

Scrape definition for Prometheus server

jpc_node

Scrape definition for Node exporter

jpc_windows

Scrape definition for Windows exporter

jpc_blackbox_http

Scrape definition for HTTP/HTTPS monitoring in Blackbox exporeter

jpc_blackbox_icmp

Scrape Definition for ICMP Monitoring in Blackbox exporeter

jpc_cloudewatch

Scrape definition for Yet another cloudwatch exporter

jpc_process

Scraping definition for Process exporter

jpc_promitor

Scraping definition for Promitor

jpc_script

Scraping definition for Script exporter

If you want to scrape your own exporter, you must add a scrape definition for each target exporter.

The metric obtained from Exporter by scraping of Prometheus server is depending on the type of Exporter. For details, see the description of metric definition file in each Exporter in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

In addition, the Prometheus server generates the following metrics when scraping is performed, in addition to the metrics obtained from the exporter.

Metric Name

Description

up

This metric indicates "1" for successful scraping and "0" for failure. It can be used to monitor the operation of the exporter.

Scrape failure may be caused by host stoppage, exporter stop, exporter returning anything other than 200, or communication error.

scrape_duration_seconds

A metric that indicates how long it took to scrape. It is not used in normal operation.

It is used for investigations when the scrape does not finish within the expected time.

scrape_samples_post_metric_relabeling

A metric that indicates the number of samples remaining after the metric is relabeled. It is not used in normal operation.

It is used to check the number of data when building the environment.

scrape_samples_scraped

A metric that indicates the number of samples returned by the exporter scraped. It is not used in normal operation.

It is used to check the number of data when building the environment.

scrape_series_added

A metric that shows the approximate number of newly generated series. It is not used in normal operation.

For details about how to run scrape, see 5.23 API for scrape of Exporter used by JP1/IM - Agent in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference. Exporter that you want to scrape must be able to run as described here.

The scrape definition method is shown below.:

  • Scrape definitions are done in units of scrape jobs.

  • The scrape definition is described in the Prometheus configuration file (jpc_prometheus_server.yml).

  • If you are editing a scrape definition, you can download Prometheus configuration file from integrated operation viewer, edit it, and then upload it.

The following are the settings related to scrape definitions supported by JP1/IM - Agent.

Table 3‒15: Settings for scrape definitions supported by JP1/IM - Agent

Setting Item

Description

Scrape Job Name

(required)

Sets the name of the scrape job that Prometheus scrapes. You can specify multiple scrape job names.

The specified scrape job name is set in the metric label as job="scrape job name".

Scrape to

(required)

Set the specific URL of the exporter to be scraped. Only exporters on hosts where JP1/IM - Agent resides can be specified as scrape destinations.

The server to be scraped in the URL is specified by the host name. "localhost" cannot be used.

The total number of scrape destinations specified in all scrape jobs is limited to 100.

Scrape parameters

(Optional)

You can set parameters to pass to the Exporter when scraping.

Depending on the type of exporter, the contents that can be set differ.

Scrape interval

(Optional)

You can set the scrape interval.

You can set a scrape interval that is common to all scrape jobs and a scrape interval for each scrape job. If both are set, the scrape interval for each scrape job takes precedence.

You can specify the following units: years, weeks, days, hours, minutes, seconds, or milliseconds.

Scrape timeout

(Optional)

You can set a timeout period when scraping takes a long time.

You can set a timeout period that is common to all scrape jobs and a timeout period for each scrape job. If both are set, the scrape interval for each scrape job takes precedence.

Relabeling

(Optional)

You can delete unnecessary metrics and customize labels.

By using this feature and setting unnecessary metrics that are not supported by default, you can reduce the amount of data sent to JP1/IM - Manager.

The outcome of scrape by Exporter subject to scrape of Prometheus server is returned in Text-based format data format of Prometheus. Here is a Text-based format of Prometheus:

Text-based format basics

Item

Description

Start time

2014 Apr

Supported Versions

Prometheus Version 0.4.0 or Later

Transmission format

HTTP

Character code

UTF-8

Line feed code is \n

Content-Type

Text/plain; version=0.0.4

If there is no version value, it is treated as the latest text format version.

Content-Encoding

gzip

Advantages

  • Human readable

  • Easy to assemble, especially for minimal cases (no need for nesting).

  • Read on a line-by-line basis (except for hints and docstring).

Constraints

  • Redundancy

  • Since the type and docstring are not part of the syntax, there is little validation of the metric contract.

  • Cost of parsing

Supported Metrics

  • Counter

  • Gauge

  • Histogram

  • Summary

  • Untyped

More information about Text-based format

Text-based format of Prometheus is row-oriented.

Separate lines with a newline character. The line feed code is \n. \ r\n is considered invalid.

The last line must be a newline character.

Also, blank lines are ignored.

Row Format

Within a line, tokens can be separated by any number of blanks or tabs. However, when joining with the previous token, it must be separated by at least one space.

In addition, leading and trailing white spaces are ignored.

Comments, help text, and information

Lines that have # as a character other than the first white space are comments.

This line is ignored unless the first token after # is a HELP or TYPE.

These lines are treated as follows:

If the token is a HELP, at least one more token (metric name) is expected. All remaining tokens are considered to be docstring of that metric name.

HELP line can contain any UTF-8 string after metric name. However, you must escape the backslash as \ and the newline character as \n. For any metric name, there can be only one HELP row.

If the token is a TYPE, two or more tokens are expected. The first is metric name. The second, either counter, gauge, histogram, summary, or untyped, defines the type of metric. There can be only one TYPE row for a given metric. Metric name of TYPE line must appear in front of the first sample.

If no TYPE row exists for metric name, the type is set to untyped.

Write a sample (one per line) using the following EBNF:

 metric_name [
    "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
 ] value [ timestamp ]
Sample Syntax
  • Metric_name and label_name are subject to the limitations of the normal Prometheus expression language.

  • The label_value is any UTF-8 string. However, backslash (\), double quote ("), and line feed must be escaped as \\, \" and \n, respectively.

  • Value is a floating-point number required by ParseFloat() function of Go language. In addition to the typical numbers, NaN, +Inf, -Inf is also a valid number. Indicates that NaN is not a number. The + Inf is positive infinity. -Inf is negative infinity.

  • The timestamp is a int64 (milliseconds from the epoch, 1970-01-01 00:00:00 UTC, excluding leap seconds), and is optionally represented by ParseInt() function of Go.

Grouping and Sorting

All rows granted with metric must be provided as a single grouping, and the optional HELP and TYPE rows must come first (in any order).

It is also recommended, but not required, to perform repeatable sorting with a repeating description.

Each line must have a unique pair of metric names / labels. If it is not a unique combination, the capture behavior is undefined.

Histograms and Summaries

Because histograms and summary types are difficult to express in text format, the following rules apply:

  • Sample sum x for the summary or histogram appears as another sample called x_sum.

  • Sample counts named x for a summary or histogram appear as another sample called x_count.

  • Each quantile in the summary named x appears as another sample line with the same name x and labeled {quantile="y"}.

  • Each bucket count in the histogram named x appears as another sample line named x_bucket and labeled {le="y"} ( y is the bucket limit).

  • The histogram must have a bucket of {le="+Inf"}. Its value must be the same as the value of x_count.

  • For le or quantile labels, the histogram bucket and summary quantiles must appear in ascending order of the values for the labels.

Sample Text-based format

Here is a sample Prometheus metric exposition that contains comments, HELP and TYPE representations, histograms, summaries, and character escaping.

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000
 
# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9
 
# Minimalistic line:
metric_without_timestamp_and_labels 12.47
 
# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045
 
# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320
 
# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

(b) Ability to obtain monitored operational information

This function acquires operation information (performance data) from the monitoring target. The process of collecting operational information is performed by a program called "Exporter".

In response to scrape requests sent from the Prometheus server to the Exporter, the Exporter collects operational information from the monitored target and returns the results to Prometheus.

Exporters shipped with JP1/IM - Agent scrape only from Prometheus in JP1/IM - Agent that cohabits. Do not scrape from Prometheus provided by other hosts or users.

This section describes the functions of each exporter included with JP1/IM - Agent.

(c) Windows exporter (Windows performance data collection capability)

Windows exporter is an exporter that can be embedded in the monitored Windows host and obtain the operating information of the Windows host.

Windows exporter is installed on the same host as the Prometheus server, and upon a scrape request from the Prometheus server, it collects operational information from the Windows OS of the host and returns it to the Prometheus server.

It is possible to collect operational information related to memory and disk, which cannot be collected by monitoring from outside the host (external monitoring by URL or CloudWatch), from inside the host.

In addition, with JP1/IM - Manager and JP1/IM - Agent version 13-01 or later, you can monitor the operational status of integrated agent host (Windows) services (programs registered in Windows services) (service monitoring function#).

Note that you cannot use the service monitoring function by running JP1/IM - Agent inside the containers.

#

If you use the service monitoring function in an environment where the version is upgraded from 13-00 to 13-01 or later, you need to configure the settings to perform service monitoring. The following are JP1/IM - Manger and JP1/IM - Agent setup instructions:

Where to find instructions for setting up JP1/IM - Manager

See Editing category name definition file for IM management nodes (imdd_category_name.conf) (Optional) in 1.19.3(1)(d) Settings of product plugin (for Windows) in the JP1/Integrated Management 3 - Manager Configuration Guide.

Where to find instructions for setting up JP1/IM - Agent

See the instructions for configuring service monitoring in 1.21.2(3)(f) Configuring service monitoring (for Windows) (optional) and 1.21.2(5)(b) Modify metric to Collect (Optional) in the JP1/Integrated Management 3 - Manager Configuration Guide.

This feature creates an IM management node for each service that you want to monitor. For details on displaying the tree, see 3.15.6(1)(i) Tree Format. If you configure an alert, a JP1 event is issued when the service is stopped and registered with IM management node corresponding to the stopped service. You can check the operational status of the past service from the service trend display.

■ Main items to be acquired

The main retrieval items of Windows exporter are defined in Windows exporter metric definition file (default) and Windows exporter (Service monitoring) metric definition file (default). For details, see Windows exporter metric definition file (metrics_windows_exporter.conf) in Chapter 2. Definition Files and Windows exporter (Service monitoring) metric definition file (metrics_windows_exporter_service.conf) in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file. For details of "Collector" in the table, refer to the description of "Collector" at the bottom of the table.

Metric Name

Collector

What to Get

Label

windows_cache_copy_read_hits_total

cache

Number of copy read requests that hit the cache (cumulative)

instance: Instance identification string

job: Job name

windows_cache_copy_reads_total

cache

Number of reads from the file system cache page (cumulative)

instance: Instance identification string

job: Job name

windows_cpu_time_total

cpu

Number of seconds of processor time spent per mode (cumulative)

instance: Instance identification string

job: Job name

core: coreid

mode: Mode#

#

Contains one of the following:

  • "dpc"

  • "idle"

  • "interrupt"

  • "privileged"

  • "user"

windows_cs_physical_memory_bytes

cs

Number of bytes of the physical memory capacity

instance: Instance identification string

job: Job name

windows_logical_disk_idle_seconds_total

logical_disk

Number of seconds that the disk was idle (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_free_bytes

logical_disk

Number of bytes of unused disk space

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_read_bytes_total

logical_disk

Number of bytes transferred from disk during the read operation (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_read_seconds_total

logical_disk

Number of seconds that the disk was busy for read operations (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_reads_total

logical_disk

Number of read operations to disk (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_requests_queued

logical_disk

Number of requests queued on disk

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_size_bytes

logical_disk

Disk space bytes

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_write_bytes_total

logical_disk

Number of bytes transferred to disk during the write operation (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_write_seconds_total

logical_disk

Number of seconds that the disk was busy for write operations (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_logical_disk_writes_total

logical_disk

Number of disk write operations (cumulative)

instance: Instance identification string

job: Job name

volume: Volume name

windows_memory_available_bytes

memory

Number of bytes of unused space in physical memory

Note:

The total of zero, free, and standby (cached) areas allocated to a process or immediately available to the system.

instance: Instance identification string

job: Job name

windows_memory_cache_bytes

memory

Number of bytes of physical memory used for file system caching

instance: Instance identification string

job: Job name

windows_memory_cache_faults_total

memory

Number of page faults in the file system cache (cumulative)

instance: Instance identification string

job: Job name

windows_memory_page_faults_total

memory

Number of times a page fault occurred (cumulative)

instance: Instance identification string

job: Job name

windows_memory_pool_nonpaged_allocs_total

memory

Number of times a nonpageable physical memory region was allocated

instance: Instance identification string

job: Job name

windows_memory_pool_paged_allocs_total

memory

Number of times you allocated a pageable physical memory region

instance: Instance identification string

job: Job name

windows_memory_swap_page_operations_total

memory

Number of pages read from or written to disk to resolve hard page faults (cumulative)

instance: Instance identification string

job: Job name

windows_memory_swap_pages_read_total

memory

Number of pages read from disk to resolve hard page faults (cumulative)

instance: Instance identification string

job: Job name

windows_memory_swap_pages_written_total

memory

Number of pages written to disk to resolve hard page faults (cumulative)

instance: Instance identification string

job: Job name

windows_memory_system_cache_resident_bytes

memory

Number of active system file cache bytes in physical memory

instance: Instance identification string

job: Job name

windows_memory_transition_faults_total

memory

The number of page faults resolved by recovering pages that were in use by other processes sharing the page, pages that were on the modified pages list or standby list, or pages that were written to disk (cumulative)

instance: Instance identification string

job: Job name

windows_net_bytes_received_total

net

Number of bytes received by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_bytes_sent_total

net

Number of bytes sent from the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_bytes_total

net

Number of bytes received and transmitted by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_packets_sent_total

net

Number of packets sent by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_net_packets_received_total

net

Number of packets received by the interface (cumulative)

Note:

If the NIC name contains characters other than half-width alphanumeric characters, these characters are converted to underscores and set in the NIC label.

instance: Instance identification string

job: Job name

device: Network Device Name

windows_system_context_switches_total

system

Number of context switches (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

windows_system_processor_queue_length

system

Number of threads in the processor queue

instance: Instance identification string

job: Job name

device: Network Device Name

windows_system_system_calls_total

system

Number of times the process called the OS service routine (cumulative)

instance: Instance identification string

job: Job name

windows_process_start_time

process

Time of process start

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

windows_process_cpu_time_total

process

Returns elapsed time that all of the threads of this process used the processor to execute instructions by mode (privileged, user). An instruction is the basic unit of execution in a computer, a thread is the object that executes instructions, and a process is the object created when a program is run. Code executed to handle some hardware interrupts and trap conditions is included in this count.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

mode: mode (privileged or user)

windows_process_io_bytes_total

process

Bytes issued to I/O operations in different modes (read, write, other). This property counts all I/O activity generated by the process to include file, network, and device I/Os. Read and write mode includes data operations; other mode includes those that do not involve data, such as control operations.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

mode: mode (privileged or user)

windows_process_io_operations_total

process

I/O operations issued in different modes (read, write, other). This property counts all I/O activity generated by the process to include file, network, and device I/Os. Read and write mode includes data operations; other mode includes those that do not involve data, such as control operations.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

mode: mode (read, write, or other)

windows_process_page_faults_total

process

Page faults by the threads executing in this process. A page fault occurs when a thread refers to a virtual memory page that is not in its working set in main memory. This can cause the page not to be fetched from disk if it is on the standby list and hence already in main memory, or if it is in use by another process with which the page is shared.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

windows_process_page_file_bytes

process

Current number of bytes this process has used in the paging file(s). Paging files are used to store pages of memory used by the process that are not contained in other files. Paging files are shared by all processes, and lack of space in paging files can prevent other processes from allocating memory.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

windows_process_pool_bytes

process

Pool Bytes is the last observed number of bytes in the paged or nonpaged pool. The nonpaged pool is an area of system memory (physical memory used by the operating system) for objects that cannot be written to disk, but must remain in physical memory as long as they are allocated. The paged pool is an area of system memory (physical memory used by the operating system) for objects that can be written to disk when they are not being used. Nonpaged pool bytes is calculated differently than paged pool bytes, so it might not equal the total of paged pool bytes.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

pool: paged (pool paged) or nonpaged (pool non paged)

windows_process_priority_base

process

Current base priority of this process. Threads within a process can raise and lower their own base priority relative to the process base priority of the process.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

windows_process_private_bytes

process

Current number of bytes this process has allocated that cannot be shared with other processes.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

windows_process_virtual_bytes

process

Current size, in bytes, of the virtual address space that the process is using. Use of virtual address space does not necessarily imply corresponding use of either disk or main memory pages. Virtual space is finite and, by using too much, the process can limit its ability to load libraries.

instance: instance-identifier-string

job: job-name

process: process-name

process_id: process-ID

creating_process_id: creator-process-ID

windows_service_state

service

The state of the service (State)

instance: instance-identifier-string

job: job-name

name: service-name#1

state: service-status#2

#1

Uppercase letters are converted to lowercase.

#2

Contains one of the following:

  • continue pending (pending continuation)

  • pause pending (suspended)

  • paused (paused)

  • running (running)

  • start pending (pending startup)

  • stop pending (suspended)

  • stopped (stopped)

  • unknown (unknown)

■ Collector

Windows exporter has a built-in collection process called a "collector" for each monitored resource such as CPU and memory.

If you want to add the metrics listed in the table above as acquisition fields, you must enable the collector corresponding to the metric you want to use. You can also disable collectors of metrics that you do not want to collect to suppress unnecessary collection.

Enable/disable for each collector can be specified with the "--collectors.enabled" option on the Windows exporter command line or in the item "collectors.enabled" in the Windows exporter configuration file (jpc_windows_exporter.yml).

For details about Windows exporter command-line options, see the description of windows_exporter command options in Service definition file (jpc_program-name.service.xml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

For details about Windows exporter configuration file entry "collectors.enabled", see the description of item collectors in Windows exporter configuration file (jpc_windows_exporter.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ Specifying Monitored Services

When using the service monitoring function of Windows exporter, the service to be monitored is specified in the "services-where" field of Windows exporter configuration file (jpc_windows_exporter.yml).

For information about Windows exporter configuration file entry "services-where", see the entry "services-where" in Windows exporter configuration file (jpc_windows_exporter.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

The value of name label of the metric output by service collectors of Windows exporter is set to the service name. If half-width uppercase characters are included in the service name of the monitoring target, they are converted to half-width lowercase characters and set. When full-pitch uppercase characters are included, they are converted to full-pitch lowercase characters and set.

- About Monitoring JP1/IM - Agent Services

For the service name of JP1/IM - Agent service, see 10.1 Service of JP1/IM - Agent in the JP1/Integrated Management 3 - Manager Administration Guide. For information about the service name in a logical host environment, see 7.3.6 Newly installing JP1/IM - Agent with integrated agent host (for Windows) in the JP1/Integrated Management 3 - Manager Configuration Guide.

Note that you cannot use the service monitoring function to monitor Prometheus server and Windows exporter services.

(d) Node exporter (Linux performance data collection capability)

Node exporter is an exporter that can be embedded in a monitored Linux host to obtain operating information of a Linux host.

The Node exporter is installed on the same host as the Prometheus server, and upon a scrape request from the Prometheus server, it collects operational information from the Linux OS of the host and returns it to the Prometheus server.

It is possible to collect operational information related to memory and disk, which cannot be collected by monitoring from outside the host (external monitoring by URL or CloudWatch), from inside the host.

In addition, with JP1/IM - Manager and JP1/IM - Agent version 13-01 or later, you can monitor the operational status of integrated agent host (Linux) service (program registered in Systemd) (service monitoring function#).

Note that you cannot use the service monitoring function by running JP1/IM - Agent inside the containers.

#

If you use the service monitoring function in an environment where the version is upgraded from 13-00 to 13-01 or later, you need to configure the settings to perform service monitoring.

The following are JP1/IM - Manger and JP1/IM - Agent setup instructions:

Where to find instructions for setting up JP1/IM - Manager

See Editing category name definition file for IM management nodes (imdd_category_name.conf) (Optional) in 1.19.3(1)(d) Settings of product plugin (for Windows) in the JP1/Integrated Management 3 - Manager Configuration Guide.

Where to find instructions for setting up JP1/IM - Agent

Refer to the instructions for configuring service monitoring in 2.19.2(3)(f) Configuring service monitor settings (for Linux) (Optional) and 2.19.2(5)(b) Change metric to collect (optional) in the JP1/Integrated Management 3 - Manager Configuration Guide.

This feature creates an IM management node for each service that you want to monitor. For details on displaying the tree, see 3.15.6(1)(i) Tree Format. If you configure an alert, a JP1 event is issued when the service is stopped and registered with IM management node corresponding to the stopped service. You can check the operational status of the past service from the service trend display.

■ Main items to be acquired

The main retrieval items of Node exporter are defined in Node exporter metric definition file (default) and Node exporter (Service monitoring) metric definition file (default). For details, see Node exporter metric definition file (metrics_node_exporter.conf) and Node exporter (service monitoring) metric definition file (metrics_windows_exporter_service.conf) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file. For details of "Collector" in the table, refer to the description of "Collector" at the bottom of the table.

Metric Name

Collector

What to Get

Label

node_boot_time_seconds

stat

Last boot time

Note:

Shown in UNIX time, including microseconds.

instance: Instance identification string

job: Job name

node_context_switches_total

stat

Number of times a context switch has been made (cumulative)

instance: Instance identification string

job: Job name

node_cpu_seconds_total

cpu

CPU seconds spent in each mode (cumulative)

instance: Instance identification string

job: Job name

cpu: cpuid

mode: Mode#

#

Contains one of the following:

  • user

  • nice

  • system

  • idle

  • iowait

  • irq

  • soft

  • steal

node_disk_io_now

diskstats

Number of disk I/Os currently in progress

instance: Instance identification string

job: Job name

device: Device name

node_disk_io_time_seconds_total

diskstats

Seconds spent on disk I/O (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_read_bytes_total

diskstats

Number of bytes successfully read from disk (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_read_time_seconds_total

diskstats

Seconds took to read from disk (cumulative value)

instance: Instance identification string

job: Job name

device: Device name

node_disk_reads_completed_total

diskstats

Number of successfully completed reads from disk (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_write_time_seconds_total

diskstats

Seconds took to write to disk (cumulative value)

instance: Instance identification string

job: Job name

device: Device name

node_disk_writes_completed_total

diskstats

Number of successfully completed disk writes (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_disk_written_bytes_total

diskstats

Number of bytes successfully written to disk (cumulative)

instance: Instance identification string

job: Job name

device: Device name

node_filesystem_avail_bytes

filesystem

Number of file system bytes available to non-root users

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_files

filesystem

Number of file nodes in the file system

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_files_free

filesystem

Number of free file nodes in the file system

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_free_bytes

filesystem

Number of bytes of free file system space

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_filesystem_size_bytes

filesystem

Number of bytes in file system capacity

instance: Instance identification string

job: Job name

fstype: File System Type

mountpoint: Mount Point

node_intr_total

stat

Number of interrupts handled (cumulative)

instance: Instance identification string

job: Job name

node_load1

loadavg

One-minute average of the number of jobs in the run queue

instance: Instance identification string

job: Job name

node_load15

loadavg

15-minute average of the number of jobs in the run queue

instance: Instance identification string

job: Job name

node_load5

loadavg

5-minute average of the number of jobs in the run queue

instance: Instance identification string

job: Job name

node_memory_Active_file_bytes

meminfo

Bytes of recently used file cache memory

Note:

The value obtained by converting the Active(file) of /proc/meminfo to bytes.

instance: Instance identification string

job: Job name

node_memory_Buffers_bytes

meminfo

Number of bytes in the file buffer

Note:

The value of Buffers converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_memory_Cached_bytes

meminfo

Number of bytes in file read cache memory

Note:

This is the value of Cached converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_memory_Inactive_file_bytes

meminfo

Number of bytes of file cache memory that have not been used recently

Note:

The value of the Inactive(file) of /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_MemAvailable_bytes

meminfo

The number of bytes of memory available to start a new application without swapping

Note:

The value of MemAvailable in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_MemFree_bytes

meminfo

Number of bytes of free memory

Note:

The value of MemFree in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_MemTotal_bytes

meminfo

Total amount of bytes of memory

Note:

The value of MemTotal converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_memory_SReclaimable_bytes

meminfo

Number of bytes in the Slab cache that can be reclaimed

Note:

SReclaimable in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_SwapFree_bytes

meminfo

Number of bytes of free swap memory space

Note:

The value of SwapFree in /proc/meminfo converted to bytes.

instance: Instance identification string

job: Job name

node_memory_SwapTotal_bytes

meminfo

Bytes of total swap memory

Note:

This is the value of SwapTotal converted to bytes in /proc/meminfo.

instance: Instance identification string

job: Job name

node_netstat_Icmp6_InMsgs

netstat

Number of ICMPv6 messages received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Icmp_InMsgs

netstat

Number of ICMPv4 messages received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Icmp6_OutMsgs

netstat

Number of ICMPv6 messages sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Icmp_OutMsgs

netstat

Number of ICMPv4 messages sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Tcp_InSegs

netstat

Number of TCP packets received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Tcp_OutSegs

netstat

Number of TCP packets sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Udp_InDatagrams

netstat

Number of UDP packets received (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_netstat_Udp_OutDatagrams

netstat

Number of UDP packets sent (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_flags

netclass

A numeric value indicating the state of the interface

Note:

/sys/class/net/[iface]/flags is a decimal value.

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_iface_link

netclass

Interface serial number

Note:

The value of /sys/class/net/[iface]/iflink.

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_mtu_bytes

netclass

Interface MTU value

Note:

The value of /sys/class/net/[iface]/mtu.

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_receive_bytes_total

netdev

Number of bytes received by the network device (cumulative value)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_receive_errs_total

netdev

Number of network device receive errors (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_receive_packets_total

netdev

Number of packets received by network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_bytes_total

netdev

Number of bytes sent by the network device (cumulative value)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_colls_total

netdev

Number of transmit collisions for network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_errs_total

netdev

Number of transmission errors for network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_network_transmit_packets_total

netdev

Number of packets sent by network devices (cumulative)

instance: Instance identification string

job: Job name

device: Network Device Name

node_time_seconds

time

Seconds of system time since the epoch (1970)

instance: Instance identification string

job: Job name

node_uname_info

uname

System information obtained by the uname system call

instance: Instance identification string

job: Job name

domainname: NIS and YP domain names

machine: Hardware Identifiers

nodename: Machine name in some network defined at implementation time

release: Operating system release number (e.g. "2.6.28")

sysname: The name of the OS (e.g. "Linux")

version: Operating system version

node_vmstat_pswpin

vmstat

Number of page swap-ins (cumulative)

Note:

The value of the pswpin in /proc/vmstat.

instance: Instance identification string

job: Job name

node_vmstat_pswpout

vmstat

Number of page swap-outs (cumulative)

Note:

The value of pswpout in /proc/vmstat.

instance: Instance identification string

job: Job name

node_systemd_unit_state

systemd

The state of the systemd unit.

instance: instance-identifier-string

job: job-name

name: unit-file-name

state: service-status#1

type: How to launch a process#2

#1

Contains one of the following:

  • activating (during startup)

  • active (running)

  • deactivating (stopped)

  • failed (failed to execute)

  • inactive (stopped)

#2

Contains the Type value of the unit file.

■ Collector

The Node exporter has a built-in collection process called a "collector" for each monitored resource such as CPU and memory.

If you want to add the metrics listed in the table above as acquisition fields, you must enable the collector corresponding to the metric you want to use. You can also disable collectors of metrics that you do not want to collect to suppress unnecessary collection.

Per-collector enable/disable can be specified in the Node exporter command line options. Specify the collector to enable with the "--collector.collector-name" option and the collector to disable with the "--no-collector.collector-name" option.

For details about Node exporter command-line options, see the description of node_exporter command options in Unit definition file (jpc_program-name.service) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ Specifying monitored services

When using the service monitoring function of Node exporter, the service to be monitored is specified in the "services-where" field of Node exporter unit definition file (jpc_node_exporter.service). Collects performance data for the service specified in this file that meets one of the following conditions:

  • Automatic start of monitored services is enabled (running systemctl enable)

  • Automatic startup of monitored services is disabled, but the status is active

Performance data for services with auto-start disabled is not collected while the service is stopped. Therefore, if you want to monitor a service that has auto-start disabled and is stopped, start the service that you want to monitor and collect performance data prior to creating IM management node tree.

For unit definition file, see the description in item "--collector.systemd.unit-include" in "node_exporter command options" in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

- About monitoring JP1/IM - Agent services

For unit definition file name of JP1/IM - Agent services, see 10.1 Service of JP1/IM - Agent in the JP1/Integrated Management 3 - Manager Administration Guide. For unit definition file name in a logical host environment, see 8.3.6 Newly installing JP1/IM - Agent with integrated agent host (for UNIX) in the JP1/Integrated Management 3 - Manager Configuration Guide.

Note that you cannot use the service monitoring function to monitor Prometheus server and Node exporter services.

(e) Process exporter (Linux process data collection capability)

Process exporter, built into a monitored Linux host, collects operating information of processes running on that host.

Installed in the same host as Prometheus server, Process exporter collects operating information of the processes from the Linux OS on the host when triggered by scraping requests from Prometheus server, and returns it to the server.

Process exporter allows you to collect process-related operating information, which cannot be obtained through monitoring from outside the host (such as synthetic monitoring with URLs or CloudWatch), from within the host.

■ Key metric items

The key Process exporter metric items are defined in the Process exporter metric definition file (initial status). For details, see Process exporter metric definition file (metrics_process_exporter.conf) in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can add more metric items to the metric definition file. The following table shows the metrics you can specify with PromQL statements used within the definition file.

Metric name

Data to be obtained

Label

namedprocess_namegroup_num_procs

Number of processes in this group.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_cpu_seconds_total

CPU usage based on /proc/[pid]/stat fields utime(14) and stime(15) i.e. user and system time.

instance: instance-identifier-string

job: job-name

groupname: group-name

mode: user or system

namedprocess_namegroup_read_bytes_total

Bytes read based on /proc/[pid]/io field read_bytes. As /proc/[pid]/io are set by the kernel as read only to the process' user, to get these values you should run process-exporter either as that user or as root. Otherwise, we can't read these values and you'll get a constant 0 in the metric.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_write_bytes_total

Bytes written based on /proc/[pid]/io field write_bytes.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_major_page_faults_total

Number of major page faults based on /proc/[pid]/stat field majflt(12).

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_minor_page_faults_total

Number of minor page faults based on /proc/[pid]/stat field minflt(10).

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_context_switches_total

Number of context switches based on /proc/[pid]/status fields voluntary_ctxt_switches and nonvoluntary_ctxt_switches. The extra label ctxswitchtype can have two values: voluntary and nonvoluntary.

instance: instance-identifier-string

job: job-name

groupname: group-name

ctxswitchtype: voluntary or nonvoluntary

namedprocess_namegroup_memory_bytes

Number of bytes of memory used. The extra label memtype can have three values:

  • resident: Field rss(24) from /proc/[pid]/stat. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out.

  • virtual: Field vsize(23) from /proc/[pid]/stat, virtual memory size.

  • swapped: Field VmSwap from /proc/[pid]/status, translated from KB to bytes.

If gathering smaps file is enabled, two additional values for memtype are added:

  • proportionalResident: Sum of Pss fields from /proc/[pid]/smaps

proportionalSwapped: Sum of SwapPss fields from /proc/[pid]/smaps

instance: instance-identifier-string

job: job-name

groupname: group-name

memtype: resident, virtual, swapped, proportionalResident, or proportionalSwapped

namedprocess_namegroup_open_filedesc

Number of file descriptors, based on counting how many entries are in the directory /proc/[pid]/fd.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_worst_fd_ratio

Worst ratio of open filedescs to filedesc limit, amongst all the procs in the group. The limit is the fd soft limit based on /proc/[pid]/limits.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_oldest_start_time_seconds

Epoch time (seconds since 1970/1/1) at which the oldest process in the group started. This is derived from field starttime(22) from /proc/[pid]/stat, added to boot time to make it relative to epoch.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_num_threads

Sum of number of threads of all process in the group. Based on field num_threads(20) from /proc/[pid]/stat.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_states

Number of threads in the group in each of various states, based on the field state(3) from /proc/[pid]/stat.

The extra label state can have these values: Running, Sleeping, Waiting, Zombie, Other.

instance: instance-identifier-string

job: job-name

groupname: group-name

state: Running, Sleeping, Waiting, Zombie, or Other

namedprocess_namegroup_thread_count

Number of threads in this thread subgroup.

instance: instance-identifier-string

job: job-name

groupname: group-name

threadname: thread-name

namedprocess_namegroup_thread_cpu_seconds_total

Same as cpu_user_seconds_total and cpu_system_seconds_total, but broken down per-thread subgroup.

instance: instance-identifier-string

job: job-name

groupname: group-name

threadname: thread-name

mode: user or system

namedprocess_namegroup_thread_io_bytes_total

Same as read_bytes_total and write_bytes_total, but broken down per-thread subgroup. Unlike read_bytes_total/write_bytes_total, the label iomode is used to distinguish between read and write bytes.

instance: instance-identifier-string

job: job-name

groupname: group-name

threadname: thread-name

iomode: read or write

namedprocess_namegroup_thread_major_page_faults_total

Same as major_page_faults_total, but broken down per-thread subgroup.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_thread_minor_page_faults_total

Same as minor_page_faults_total, but broken down per-thread subgroup.

instance: instance-identifier-string

job: job-name

groupname: group-name

namedprocess_namegroup_thread_context_switches_total

Same as context_switches_total, but broken down per-thread subgroup.

instance: instance-identifier-string

job: job-name

groupname: group-name

Important
  • Processes whose name contains multi-byte characters cannot be monitored.

  • Process exporter still continues to output information of processes that it collected once, even after the processes stop running. Therefore, if Process exporter is configured to collect information based on PIDs, new time-series data is added every time a process is restarted and its PID is changed, resulting in large amounts of unnecessary data.

    Furthermore, it is not recommended to use PIDs in open source software (OSS), and thus version 13-00 of our software is configured not to collect PID information by default (groupname). If the user wants to manage processes on the same command line separately, we recommend operational means, such as a change in the order of arguments or the use of PIDs (however, periodic restarts are needed to prevent collected information from accumulating continuously).

    Note that information collected by Windows exporter is different from what Process exporter collects, because Windows exporter collects the PID information.

(f) Node exporter for AIX (AIX performance data collection capability)

A Node exporter for AIX is an Exporter that is embedded in a monitored AIX host to obtain the health of the host.

Node exporter for AIX is installed on a host other than Prometheus server and is returned to Prometheus server after scrape is requested from Prometheus server to collect operational data from AIX OS of the same host.

You can collect activity on memory and disks from inside the host that cannot be collected by monitoring from outside the host (external shape monitoring by URL or CloudWatch).

■ Prerequisites

It is a prerequisite that the ports used by Node exporter for AIX are protected by firewalls, networking configurations, and so on, so that they are not accessed by anything other than Prometheus server of JP1/IM - Agent.

For the ports used by Node exporter for AIX, see the explanation of node_exporter_aix command options in 10.4.2(1) Enabling registering services in the JP1/Integrated Management 3 - Manager Administration Guide.

■ Conditions to be monitored

See the Release Notes for the supporting OS of the host on which you are installing Node exporter for AIX.

WPAR is not supported.

Multiple boots of Node exporter for AIX on the same host are not supported, even if they are booted on both physical and logical hosts.

The logical host configuration of the monitored AIX hosts is supported only if the following conditions are met:

  • The hostname of the monitored AIX hostname can be uniquely resolved from Prometheus.

Note: If more than one IP address is assigned to AIX monitored host, Node exporter for AIX can be accessed by all IP addresses.

For the upper limit of Node exporter for AIX that can be monitored by one Prometheus server, refer to the limit value list in JP1/IM - Agent of Appendix D.1 Limits when using the Intelligent Integrated Management Base.

■ Main items to be acquired

The main retrieval items for Node exporter for AIX that JP1/IM - Agent ships with are defined in metric definition-file (default) of Node exporter for AIX. For details, see Node exporter for AIX metric definition file (metrics_node_exporter_aix.conf) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can add retrieval items to metric definition file. The following table lists metric that can be specified for PromQL expression in the definition file:

Metric Name

Command-line options for retrieva

Contents to be acquired

Label

Data Source

node_context_switches

-C

Total number of context switches.

(cumulative value)

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_cpu_total func

pswitch of perfstat_cpu_t structure

node_cpu

-C

Seconds the cpus spent in each mode.

(cumulative value)

instance: instance-identity-string

job: job-name

cpu: cpuid

mode: mode (idle, sys, user, or wait)

Get by perfstat_cpu func

Perfstat_cpu_t structure

aix_diskpath_wblks

-D

Blocks written via the path

cpupool_id=physical-processor-shared-pooling-ID

diskpath=disk pathname

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_diskpath func

wblks of perfstat_diskpath_t structure

aix_diskpath_rblks

-D

Blocks read via the path

cpupool_id=physical-processor-shared-pooling-ID

diskpath=disk-path-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_diskpath func

rblks of perfstat_diskpath_t structure

aix_disk_rserv

-d

Read or receive service time

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

rserv of perfstat_disk_t structure

aix_disk_rblks

-d

Number of blocks read from disk

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

rblks of perfstat_disk_t structures

aix_disk_wserv

-d

Write or send service time

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

wserv of perfstat_disk_t structure

aix_disk_wblks

-d

Number of blocks written to disk

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

wblks of perfstat_disk_t structure

aix_disk_time

-d

Amount of time disk is active

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

time of perfstat_disk_t structure

aix_disk_xrate

-d

Number of transfers from disk

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

xrate of perfstat_disk_t structure

aix_disk_xfers

-d

Number of transfers to/from disk

cpupool_id=physical-processor-shared-pooling-ID

disk=disk-name

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

vgname=volume-group-name

Get by perfstat_disk func

xfers of perfstat_disk_t structure

node_filesystem_avail_bytes

-f

Filesystem space available to non-root users in bytes.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

device=device-name

fstype=file-system-type

mountpoint=mount-point

Get by stat_filesystems func

avail_bytes of filesystem structure

node_filesystem_files

-f

Filesystem total file nodes.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

device=device-name

fstype=file-system-type

mountpoint=mount-point

Get by stat_filesystems func

files of filesystem structure

node_filesystem_files_free

-f

Filesystem total free file nodes.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

device=device-name

fstype=file-system-type

mountpoint=mount-point

Get by stat_filesystems func

files_free of filesystem structure

node_filesystem_free_bytes

-f

Filesystem free space in bytes.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

device=device-name

fstype=file-system-type

mountpoint=mount-point

Get by stat_filesystems func

free_bytes of filesystem structure

node_filesystem_size_bytes

-f

Filesystem size in bytes.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

device=device-name

fstype=file-system-type

mountpoint=mount-point

Get by stat_filesystems func

size_bytes of filesystem structure

node_intr

-C

Total number of interrupts serviced.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_cpu_total func

decrintrs of perfstat_cpu_total_t structure

mpcsintrs of perfstat_cpu_total_t structure

devintrs of perfstat_cpu_total_t structure

softintrs of perfstat_cpu_total_t structure

node_load1

-C

1m load average.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_cpu_total func

loadavg[0] of perfstat_cpu_total_t structure

node_load5

-C

5m load average.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_cpu_total func

loadavg[1] of perfstat_cpu_total_t structure

node_load15

-C

15m load average.

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_cpu_total func

loadavg[2] of perfstat_cpu_total_t structure

aix_memory_real_avail

-m

Number of pages (in 4KB pages) of memory available without paging out working segments

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_memory_total func

real_avail of perfstat_memory_total_t structure

aix_memory_real_free

-m

Free real memory (in 4 KB pages).

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_memory_total func

real_free of perfstat_memory_total_t structures

aix_memory_real_inuse

-m

Real memory which is in use (in 4KB pages)

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_memory_total func

real_inuse of perfstat_memory_total_t structures

aix_memory_real_total

-m

Total real memory (in 4 KB pages).

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_memory_total func

perfstat_memory_total_t structure real_total

aix_netinterface_mtu

-i

Network frame size

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

mtu of perfstat_netinterface_t structure

aix_netinterface_ibytes

-i

Number of bytes received on interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

ibytes of perfstat_netinterface_t structure

aix_netinterface_ierrors

-i

Number of input errors on interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

ierrors of perfstat_netinterface_t structure

aix_netinterface_ipackets

-i

Number of packets received on interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

ipackets of perfstat_netinterface_t structure

aix_netinterface_obytes

-i

Number of bytes sent on interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

obytes of perfstat_netinterface_t structure

aix_netinterface_collisions

-i

Number of collisions on csma interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

collisions of perfstat_netinterface_t structure

aix_netinterface_oerrors

-i

Number of output errors on interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

oerrors of perfstat_netinterface_t structure

aix_netinterface_opackets

-i

Number of packets sent on interface

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

netinterface=net interface name

Get by perfstat_netinterface func

opackets of perfstat_netinterface_t structure

aix_memory_pgspins

-m

Number of page ins from paging space

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_memory_total func

pgspins of perfstat_memory_total_t structure

aix_memory_pgspouts

-m

Number of pages paged out from paging space

cpupool_id=physical-processor-shared-pooling-ID

group_id=group-ID

instance: instance-identity-string

job: job-name

lpar=partition-name

machine_serial=machine-ID

Get by perfstat_memory_total func

pgspouts of perfstat_memory_total_t structure

Node exporter for AIX is collected for each monitored resource, such as CPU, memories. You can enable or disable collection for each resource that you want to monitor by using Node exporter for AIX command-line options.

For Node exporter for AIX command-line options, see the description of node_exporter_aix command options in 10.4.2(1) Enabling registering services in the JP1/Integrated Management 3 - Manager Administration Guide.

Use Script exporter to collect information about processes. For details on how to configure the settings, see 1.23.2(4)(e) Monitoring processes on monitored hosts (AIX) (optional) in the JP1/Integrated Management 3 - Manager Configuration Guide.

Use JP1/Base log file trap feature to monitor the log files of the monitored AIX hosts.

■ Notes on logging Node exporter for AIX

Node exporter for AIX log file is output to OS system log. Therefore, the destination depends on OS system log settings. For details on changing the output destination of the system log for Node exporter for AIX logging OS, see 1.23.2(4)(f) Changing Node exporter for AIX log output destination (optional) in the JP1/Integrated Management 3 - Manager Configuration Guide.

■ Precautions When Using SMT or Micro-Partitioning

In an SMT(Simultaneous multithreading) or Micro-Partitioning deployment, calculating CPU Utilization (cpu_used_rate) metric for Node exporter for AIX does not include physical CPU quotas, but calculating CPU utilization as displayed by sar command includes physical CPU quotas.

Therefore, CPU Utilization (cpu_used_rate) of Node exporter for AIX might show a lower metric than sar command output.

(g) Yet another cloudwatch exporter (Azure Monitor performance data collection capability)

Yet another cloudwatch exporter is an exporter included in the monitoring agent that uses Amazon CloudWatch to collect uptime information for AWS services in the cloud.

Yet another cloudwatch exporter is installed on the same host as the Prometheus server, and collects CloudWatch metrics obtained via the SDK provided by AWS (AWS SDK)# upon scrape requests from the Prometheus server, and sends them to the Prometheus server. I will return it.

#

SDK provided by Amazon Web Services (AWS). Yet another cloudwatch exporter uses the Go language version of the AWS SDK for Go (V1). CloudWatch monitoring requires that Amazon CloudWatch supports the AWS SDK for Go (V1).

You can monitor services that cannot include Node exporter or Windows exporter.

■ Main items to be acquired

The main retrieval items of Yet another cloudwatch exporter are defined in Yet another cloudwatch exporter metric definition file (default). For details, see Yet another cloudwatch exporter metric definition file (metrics_ya_cloudwatch_exporter.conf) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ CloudWatch metrics you can collect

You can collect metric of namespace name of AWS that is supported for monitoring by Yet another cloudwatch exporter of JP1/IM - Agent that is listed in 3.15.6(1)(k) Creating an IM Management Node for Yet another cloudwatch exporter.

Specify the metrics to collect by describing the AWS service name and CloudWatch metric name in the Yet another Cloudwatch Exporter configuration file (jpc_ya_cloudwatch_exporter.yml).

The following is an example of the description of the Yet another cloudwatch exporter configuration file when collecting CPUUtilization and DiskReadBytes for CloudWatch metrics for AWS/EC2 services.

discovery:

exportedTagsOnMetrics:

ec2:

- jp1_pc_nodelabel

jobs:

- type: ec2

regions:

- ap-northeast-1

period: 60

length: 300

delay: 60

nilToZero: true

searchTags:

- key: jp1_pc_nodelabel

value: .*

metrics:

- name: CPUUtilization

statistics:

- Maximum

- name: DiskReadBytes

statistics:

- Maximum

For details about what Yet another cloudwatch exporter configuration file describes, see Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can also add new metrics to the Yet another cloudwatch exporter metrics definition file using the metrics you set in the Yet another cloudwatch exporter configuration file.

The metrics and labels specified in the PromQL statement described in the definition file conform to the following naming conventions:

- Naming conventions for Exporter metrics

Yet another cloudwatch exporter treats the metric name of CloudWatch as the metric name of the exporter as the automatic conversion of the metric name in CloudWatch by the following rules. Also, the metric specified on the PromQL statement is described using the indicator name of the exporter.

"aws_"#1+Namespace#2+"_"+CloudWatch_Metric #2+"_"+Statistic_Type#2

#1

Appended if the namespace does not begin with "aws_".

#2

Indicates the name you set in the Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml). It is converted by the following rules:

  • It is converted from camel case notation to snake case notation.

    CamelCase is a notation that capitalizes word breaks, such as "CamelCase" or "camelCase."

    Snakecase is a notation that separates words with "_", such as "snake_case".

  • The following symbols are converted to "_".

    whitespace,comma,tab, /, \, half-width period, -, :, =, full-width left double quote, @, <, >

  • "%" is converted to "_percent".

- Exporter label naming conventions

Yet another cloudwatch exporter treats the CloudWatch dimension tag name as the Exporter's label name, which is automatically converted by the following rules. Also, labels specified on the PromQL statement are described using the label name of the Exporter.

  • For dimensions

    "dimension"+"_"+dimensions_name#

  • For tags

    "tag"+"_"+tag_name#

  • For custom tags

    "custom_tag_"+"_"+custom tag_name#

#

Indicates the name you set in the Yet another cloudwatch exporter configuration file (jpc_ya_cloudwatch_exporter.yml).

■ About policies for IAM users in your AWS account

To connect to AWS CloudWatch, you must create a policy with the following permissions and assign it to an IAM user.

"tag:GetResources",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"

For details on how to set JSON format, see 2.19.2(8)(b) Modify Setup to connect to CloudWatch (for Linux) (optional) in the JP1/Integrated Management 3 - Manager Configuration Guide.

■ Environment-variable HTTPS_PROXY

Environment-variable that you specify when you connect to CloudWatch from a Yet another cloudwatch exporter through a proxy. The URL that can be set in the environment-variable HTTPS_PROXY is http only. Note that the only Authentication method supported is Basic authentication.

You can set the environment-variable HTTPS_PROXY to connect to AWS CloudWatch through proxies. The following shows an example configuration.

HTTPS_PROXY=http://username:password@proxy.example.com:5678

■ How to handle monitoring targets JP1/IM - Agent does not support

If you have a product or metric that cannot be monitored by JP1/IM - Agent, you must retrieve it, for example, using user-defined Exporter.

(h) Promitor (Azure Monitor performance data collection capability)

Promitor, included in the integrated agent, collects operating information of Azure services on the cloud environment through Azure Monitor and Azure Resource Graph.

Promitor consists of Promitor Scraper and Promitor Resource Discovery. Promitor Scraper collects metrics on resources from Azure Monitor according to schedule settings and returns them.

Metrics can be collected from target resources in two ways: one method is to specify the target resources separately in a configuration file and the other is to detect the resources automatically. If you choose to detect them automatically, Promitor Resource Discovery detects resources in a tenant through Azure Resource Graph, and based on the results, Promitor Scraper collects metric information.

In addition, both Promitor Scraper and Promitor Resource Discovery require two configuration files for each of them. One configuration file is to define runtime settings, such as authentication information, and the other is to define metric information to be collected.

■ Key metric items

The key Promitor metric items are defined in the Promitor metric definition file (initial status). For details, see the description under Promitor metric definition file (metrics_promitor.conf) in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ Metrics you can collect

Promitor can collect metrics for the following services to monitor:

You specify metrics you want to collect in the Promitor Scraper configuration file (metrics-declaration.yaml).

If you want to change the metrics specified in the Promitor Scraper settings file, see Change monitoring metrics (optional) in the JP1/Integrated Management 3 - Manager Configuration Guide 1.21.2(8) Set up of Promitor (d) Configuring scraping targets (required).

You can also add new metrics to the Promitor metric definition file, based on the metrics specified in the Promitor Scraper configuration file. Metrics defined in Promitor Scraper configuration file can be specified to the PromQL statement written in the definition file.

Table 3‒16: Services supported as monitoring targets by Promitor

Promitor resourceType name

Azure Monitor namespace

Automatic discovery support

VirtualMachine

Microsoft.Compute/virtualMachines

Y

FunctionApp

Microsoft.Web/sites

Y

ContainerInstance

Microsoft.ContainerInstance/containerGroups

--

KubernetesService

Microsoft.ContainerService/managedClusters

Y

FileStorage

Microsoft.Storage/storageAccounts/fileServices

--

BlobStorage

Microsoft.Storage/storageAccounts/blobServices

--

ServiceBusNamespace

Microsoft.ServiceBus/namespaces

Y

CosmosDb

Microsoft.DocumentDB/databaseAccounts

Y

SqlDatabase

Microsoft.Sql/servers/databases

Y

SqlServer

Microsoft.Sql/servers/databases

Microsoft.Sql/servers/elasticPools

--

SqlManagedInstance

Microsoft.Sql/managedInstances

Y

SqlElasticPool

Microsoft.Sql/servers/elasticPools

Y

LogicApp

Microsoft.Logic/workflows

Y

Legend:

Y: Automatic discovery is supported.

--: Automatic discovery is not supported.

■ Checking how Azure SDKs used by Promitor are supported

Promitor employs Azure SDK for .NET. An end of Azure SDK support is announced 12 months in advance. For details on the lifecycle of Azure SDK, see Lifecycle FAQ at the following website:

https://learn.microsoft.com/ja-jp/lifecycle/faq/azure#azure-sdk-----------

For the lifecycles of versions of Azure SDK libraries, you can find them in the following website:

https://azure.github.io/azure-sdk/releases/latest/all/dotnet.html

■ Credentials required for account information

Promitor can connect to Azure through the service principal method or the managed ID method. For details on the credentials assigned to the service principal and managed ID, see (a) Configuring the settings for establishing a connection to Azure (required) in the JP1/Integrated Management 3 - Manager Configuration Guide 1.21.2(8) Set up of Promitor.

(i) Blackbox exporter (Synthetic metric collector)

Blackbox exporter is an exporter that sends simulated requests to monitored Internet services on the network and obtains operation information obtained from the responses. The supported communication protocols are HTTP, HTTPS, and ICMP.

When the Blackbox exporter receives a scrape request from the Prometheus server, it throws a service request such as HTTP to the monitored target and obtains the response time and response. In addition, the execution results are summarized in the form of metrics and returned to the Prometheus server.

■ Main items to be acquired

The main retrieval items of Blackbox exporter are defined in Blackbox exporter metric definition file (default). For details, see Blackbox exporter metric definition file (metrics_blackbox_exporter.conf) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can add retrieved items to the metric definition file. The following are the metrics that can be specified in the PromQL statement described in the definition file.

Metric Name

Prober

What to get

Label

probe_http_duration_seconds

http

The number of seconds taken per phase of the HTTP request

Note:

All redirects add up.

instance: Instance identification string

job: Job name

phase: Phase#

#

Contains one of the following:

  • "resolve"

  • "connect"

  • "tls"

  • "processing"

  • "transfer"

probe_http_content_length

http

HTTP content response length

instance: Instance identification string

job: Job name

probe_http_uncompressed_body_length

http

Uncompressed response body length

instance: Instance identification string

job: Job name

probe_http_redirects

http

Number of redirects

instance: Instance identification string

job: Job name

probe_http_ssl

http

Whether SSL was used for the final redirect

  • 0: TLS/SSL was not used

  • 1: TLS/SSL was used

instance: Instance identification string

job: Job name

probe_http_status_code

http

HTTP response status code value

Note:

If you are redirecting, the final status code is the value of the metric.

If no redirection is performed, the first status code received is the value of the metric.

instance: Instance identification string

job: Job name

probe_ssl_earliest_cert_expiry

http

Earliest expiring SSL certificate UNIX time

instance: Instance identification string

job: Job name

probe_ssl_last_chain_expiry_timestamp_seconds

http

Expiration timestamp of the last certificate in the SSL chain

Note:

If you want to monitor this metric, you must specify false for the insecure_skip_verify parameter in the tls_config settings of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml), place the certificate, and specify the path of the certificate file in the appropriate parameter.

instance: Instance identification string

job: Job name

probe_ssl_last_chain_info

http

SSL leaf certificate information

Note:

This is the SHA256 hash value of the server certificate to be monitored. The hash value is set to the label "fingerprint_sha256".

instance: Instance identification string

job: Job name

fingerprint_sha256: SHA256 fingerprint on certificate

probe_tls_version_info

http

TLS version used

Note:

The TLS version, such as "TLS 1.2", is set to the label "version".

instance: Instance identification string

job: Job name

version:TLS Version

probe_http_version

http

HTTP version of the probe response

instance: Instance identification string

job: Job name

probe_failed_due_to_regex

http

Whether the probe failed due to a regular expression check on the response body or response headers

  • 0: Success

  • 1: Failed

instance: Instance identification string

job: Job name

probe_http_last_modified_timestamp_seconds

http

UNIX time showing Last-Modified HTTP response headers

instance: Instance identification string

job: Job name

probe_icmp_duration_seconds

icmp

Seconds taken per phase of an ICMP request

instance: Instance identification string

job: Job name

phase: Phase#

#

Contains one of the following:

  • resolve

    Name Resolution Time

  • setup

    Time from resolve completion to ICMP packet transmission

  • rtt

    Time to get a response after setup

probe_icmp_reply_hop_limit

icmp

Hop limit (TTL for IPv4) value

Note:

Hop limit (TTL for IPv4) value

instance Instance identification string

job: Job name

probe_success

--

Whether the probe was successful

  • 0: Failed

  • 1: Success

instance Instance identification string

job: Job name

probe_duration_seconds

--

The number of seconds it took for the probe to complete

instance Instance identification string

job: Job name

■ IP communication with monitored objects

Only IPv4 communication is supported.

■ Encrypted communication with monitored objects

HTTP monitoring enables encrypted communication using TLS. In this case, the Blackbox exporter acts as a TLS client to the monitored object (TLS server).

When using encrypted communication using TLS, specify it in item "tls_config" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml). In addition, the following certificate and key files must be prepared.

File

Format

CA certificate file

A file encoding an X509 public key certificate in pkcs7 format in PEM format

Client certificate file

Client certificate key file

A file in which the private key in pkcs1 or pkcs8 format is encoded in PEM format#

#

You cannot use password-protected files.

The available TLS versions and cipher suites are supported below.

Item

Scope of support

TLS Version

1.2 to 1.3

Cipher suites

  • "TLS_RSA_WITH_AES_128_CBC_SHA" (up to TLS 1.2)

  • "TLS_RSA_WITH_AES_256_CBC_SHA" (up to TLS 1.2)

  • "TLS_RSA_WITH_AES_128_GCM_SHA256" (TLS 1.2 only)

  • "TLS_RSA_WITH_AES_256_GCM_SHA384" (TLS 1.2 only)

  • "TLS_AES_128_GCM_SHA256" (TLS 1.3 only)

  • "TLS_AES_256_GCM_SHA384" (TLS 1.3 only)

  • "TLS_CHACHA20_POLY1305_SHA256" (TLS 1.3 only)

  • "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA" (up to TLS 1.2)

  • "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" (TLS 1.2 only)

  • "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384" (TLS 1.2 only)

  • "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" (TLS 1.2 only)

  • "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" (TLS 1.2 only)

  • "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" (TLS 1.2 only)

  • "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256" (TLS 1.2 only)

■ Timeout for collecting health information

In a network environment where response is slow (under normal conditions), operating information can be collected by adjusting the timeout period.

On the Prometheus server, you can specify the scrape request timeout period in the entry "scrape_timeout" of the Prometheus configuration file (jpc_prometheus_server.yml). For details, see the description of item scrape_timeout in Prometheus configuration file (jpc_prometheus_server.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

In addition, the timeout period when connecting from the Blackbox exporter to the monitoring target is 0.5 seconds before the value specified in "scrape_timeout" above.

■ Certificate expiration

When collecting operation information by HTTPS monitoring, the exporter receives a certificate list (server certificate and certificate list certifying server certificate) from the monitoring target.

The Blackbox exporter allows you to collect the expiration time (UNIX time) of the closest expiring certificate as a probe_ssl_earliest_cert_expiry metric.

You can also use the features in 3.15.1(3) Performance data monitoring notification function to monitor certificates that are close to their deadline, because you can calculate the number of seconds remaining before the deadline with the value calculated in probe_ssl_earliest_cert_expiry Metric Value-PromQL's time() function.

■ User-Agent value in HTTP request header when monitoring HTTP

The default value of User-Agent included in HTTP request header during HTTP monitoring is as shown below:

  • For version 13-00 or earlier

    "Go-http-client/1.1"

  • For version 13-00-01 or later

    "Blackbox Exporter/0.24.0"

You can change the value of User-Agent in the setting of item "headers" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml).

The following is an example of changing the value of User-Agent to "My-Http-Client".

modules:
  http:
    prober: http
    http:
      headers:
        User-Agent: "My-Http-Client"

For details, see the description of item headers in Blackbox exporter configuration file (jpc_blackbox_exporter.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ About HTTP 1.1 Name-Based Virtual Host Support

The Blackbox exporter supports HTTP 1.1 name-based virtual hosts and TLS Server Name Indication (SNI). You can monitor virtual hosts that disguise one HTTP/HTTPS server as multiple HTTP/HTTPS servers.

■ About TLS Server Authentication and Client Authentication

In Blackbox exporter's HTTPS monitoring, server authentication is performed using the CA certificate described in item "ca_file" of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml) and the server certificate sent by the server when HTTPS communication with the server starts (TLS handshake).

If the sent certificate is incorrect (server name is incorrect, expired, self-certificate is used, etc.), HTTPS communication cannot be started and monitoring fails.

In addition, when a request is made to send a certificate from the monitored server at the start of HTTPS communication (TLS handshake), the client certificate described in item "cert_file" of the Blackbox exporter configuration file (jpc_blackbox_exporter.yml) is sent to the monitored server.

If the server validates the sent certificate, recognizes it as invalid, and returns an error to the Blackbox exporter via the TLS protocol (or if communication cannot be continued due to a loss of communication, etc.), the monitoring fails.

For details on the verification contents related to the client certificate and the operation in the event of an error on the monitored server, check the specifications of the monitored server (or relay device such as a load balancer).

To detect fraudulent certificates during server authentication, if you specify "true" in item "insecure_skip_verify" in the Blackbox exporter configuration file (jpc_blackbox_exporter.yml), HTTPS communication can be started without errors. However, in that case, the verification operation related to client authentication at the server will be invalidated.

For details, see the description of item insecure_skip_verify in Blackbox exporter configuration file (jpc_blackbox_exporter.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

Server authentication cannot be performed using certificates that the host name is not listed in the Subject Alternative Name field.

■ About cookie information

The Blackbox exporter does not use cookie information sent from the monitored target in the next HTTP communication request.

■ About external resources referenced from content included in the response body of HTTP communication

In Blackbox exporter, external resources (subframes, images, etc.) referenced from the content included in the response body of HTTP communication are not included in the monitoring range.

■ About Monitoring of Content Included in HTTP Communication Response Body

Since the Blackbox exporter does not parse the content, the execution result and execution time based on the syntax (HTML, javascript, etc.) in the content included in the response body of HTTP communication are not reflected in the monitoring result.

■ Precautions when the monitoring destination of HTTP monitoring redirects with Basic authentication

If the Blackbox exporter's HTTP monitoring destination redirects with Basic authentication, the Blackbox exporter sends the same Basic authentication username and password to the redirect source and destination. Therefore, when performing Basic authentication on both the redirect source and the redirect destination, the same user name and password must be set on the redirect source and the redirect destination.

(j) Script exporter (UAP monitoring capability)

Script exporter runs scripts on a host and gets results.

Installed in the same host as Prometheus, Script exporter runs a script on the host and gets a result when triggered by a scraping request from Prometheus server, and returns the result to the server.

Developing a script that gets UAP information and converts it to a metric and adding the script to Script exporter enables you to monitor applications that are not supported by Exporter as you want.

■ Key metric items

The key Script exporter metric items are defined in the Script exporter metric definition file (initial status). For details, see Script exporter metric definition file (metrics_script_exporter.conf) in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

You can add more metric items to the metric definition file. The following table shows the metrics you can specify with PromQL statements used within the definition file.

Metric name

Data to be obtained

Label

script_success

Script exit status (0 = error, 1 = success)

instance: instance-identifier-string

job: job-name

script: script-name

script_duration_seconds

Script execution time, in seconds.

instance: instance-identifier-string

job: job-name

script: script-name

script_exit_code

The exit code of the script.

instance: instance-identifier-string

job: job-name

script: script-name

(k) OracleDB exporter (Oracle Database monitoring function)

OracleDB exporter is an Exporter for Prometheus that retrieves performance data from Oracle Database.

- About the number of sessions

If you monitor Oracle Database from OracleDB exporter, it connects to each scrape and disconnects when the data-collection is complete. The number of sessions when connecting is 1.

■ Conditions to be monitored

The following shows Oracle Database configurations that JP1/IM - Agent monitors and supports:

  • For non-clusters

    Non CDB and CDB configurations

  • For Oracle RAC

    CDB configuration

Because OracleDB exporter connects to one service in a single process, it launches more than one OracleDB exporter if there is more than one target.

Note
  • Oracle RAC One Node and Oracle Database Cloud Service are not supported.

  • HA clustering configuration on Oracle Database is not supported.

■ Acquisition items

The metrics that can be retrieved with the OracleDB exporter shipped with the JP1/IM - Agent are the metrics and cache_hit_ratio defined by the OracleDB exporter default.

OracleDB exporter retrieval items are defined in metric definition-file (default) of OracleDB exporter. For details, see OracleDB exporter metric definition file (metrics_oracledb_exporter.conf) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

The following tables list metric that can be specified for PromQL expression in the definition file. The value of each metric is obtained by executing the SQL statement shown in the table to Oracle Database. For details about metric, contact Oracle based on SQL statement of the data source.

Metric name

Contents to be acquired

Label

Data source (SQL statement)

oracledb_sessions_value

Count of sessions

status: status

type: session type

SELECT status, type, COUNT(*) as value FROM v$session GROUP BY status, type

oracledb_resource_current_utilization

Resource usage#1

resource_name: resource_name

SELECT resource_name,current_utilization,CASE WHEN TRIM(limit_value) LIKE 'UNLIMITED' THEN '-1' ELSE TRIM(limit_value) END as limit_value FROM v$resource_limit

oracledb_resource_limit_value

Resource usage limit#1 (UNLIMITED: -1)

resource_name: resource_name

oracledb_asm_diskgroup_total

Bytes of total size of ASM disk group

name: disk group name

SELECT name,total_mb*1024*1024 as total,free_mb*1024*1024 as free FROM v$asm_diskgroup_stat where exists (select 1 from v$datafile where name like '+%')

oracledb_asm_diskgroup_free

Bytes of free space available on ASM disk group

name: disk group name

oracledb_activity_execute_count

Total number of calls (user calls and recursive calls) executing SQL statements (cumulative value)

none

SELECT name, value FROM v$sysstat WHERE name IN ('parse count (total)', 'execute count', 'user commits', 'user rollbacks', 'db block gets from cache', 'consistent gets from cache', 'physical reads cache')

oracledb_activity_parse_count_total

Total number of parse calls (hard, soft and describe) (cumulative value)

none

oracledb_activity_user_commits

Total number of user commit (cumulative value)

none

oracledb_activity_user_rollbacks

The number of times a user manually issued a ROLLBACK statement, or the total number of times an error occurred during a user's transaction (cumulative value)

none

oracledb_activity_physical_reads_cache

Total number of data blocks read from disk to the buffer cache (cumulative value)

none

oracledb_activity_consistent_gets_from_cache

Number of times block read consistency was requested from the buffer cache (cumulative value)

none

oracledb_activity_db_block_gets_from_cache

Number of times CURRENT blocking was requested from the buffer cache (cumulative value)

none

oracledb_process_count

Count of Oracle Database active-processes

none

SELECT COUNT(*) as count FROM v$process

oracledb_wait_time_administrative

Hours spent waiting for Administrative wait class (in 1/100 seconds)#2

none

SELECT

n.wait_class as WAIT_CLASS,

round(m.time_waited/m.INTSIZE_CSEC,3) as VALUE

FROM

v$waitclassmetric m, v$system_wait_class n

WHERE

m.wait_class_id=n.wait_class_id AND n.wait_class != 'Idle'

oracledb_wait_time_application

Hours spent waiting for Application wait class (in 1/100 seconds)#2

none

oracledb_wait_time_commit

Hours spent waiting for Commit wait class (in 1/100 seconds)#2

none

oracledb_wait_time_concurrency

Hours spent waiting for Concurrency wait class (in 1/100 seconds)#2

none

oracledb_wait_time_configuration

Hours spent waiting for Configuration wait class (in 1/100 seconds)#2

none

oracledb_wait_time_network

Hours spent waiting for Network wait class (in 1/100 seconds)#2

none

oracledb_wait_time_other

Hours spent waiting for Other wait class (in 1/100 seconds)#2

none

oracledb_wait_time_scheduler

Hours spent waiting for Scheduler wait class (in 1/100 seconds)#2

none

oracledb_wait_time_system_io

Hours spent waiting for System I/O wait class (in 1/100 seconds)#2

none

oracledb_wait_time_user_io

Hours spent waiting for User I/O wait class (in 1/100 seconds)#2

none

oracledb_tablespace_bytes

Total bytes consumed by tablespaces

tablespace: name of the tablespace

type: tablespace contents

SELECT

dt.tablespace_name as tablespace,

dt.contents as type,

dt.block_size * dtum.used_space as bytes,

dt.block_size * dtum.tablespace_size as max_bytes,

dt.block_size * (dtum.tablespace_size - dtum.used_space) as free,

dtum.used_percent

FROM dba_tablespace_usage_metrics dtum, dba_tablespaces dt

WHERE dtum.tablespace_name = dt.tablespace_name

ORDER by tablespace

oracledb_tablespace_max_bytes

Maximum number of bytes in a tablespace

tablespace: name of the tablespace

type: tablespace contents

oracledb_tablespace_free

Number of free bytes in the tablespace

tablespace: name of the tablespace

type: tablespace contents

oracledb_tablespace_used_percent

Tablespace utilization

If auto extension is ON, it is calculated with auto extension taken into account.

tablespace: name of the tablespace

type: tablespace contents

oracledb_exporter_last_scrape_duration_seconds

The number of seconds taken the last scrape

none

-

oracledb_exporter_last_scrape_error

Whether the last scrape resulted in an error

0: Error

1: Success

none

-

oracledb_exporter_scrapes_total

Total number of times Oracle Database was scraped for metrics

none

-

oracledb_up

Whether the Oracle Database Server is up

0: Not running

1: Running

none

-

#1

In a PDB, the table in the source v$resource_limit is empty and cannot be retrieved.

#2

In a PDB, the table in the source v$waitclassmetric is empty and cannot be retrieved.

Important
  • Prior to using OracleDB exporter, make sure that SQL statements that serve as the data source can be executed, for example, with SQL*Plus command. This ensures that the required information can be displayed. Use OracleDB exporter to connect to Oracle Database when checking.

  • OracleDB exporter provided by JP1/IM - Agent does not support the ability to collect any metric (custom metrics).

■ Requirements for monitoring Oracle Database

When you monitor Oracle Database in OracleDB exporter, you must configure the following settings on Oracle Database:

You do not need to install Oracle Client, etc. on JP1/IM - Agent host-side.

  • Oracle listener

    • Configure Oracle listener and servicename so that they can connect to the target.

    • Oracle listener is configured to accept unencrypted connect requests.

  • Oracle Database

    Set Oracle Database database-character set to the following:

    • AL32UTF8 (Unicode UTF-8)

    • JA16SJIS (Japanese-language SJIS)

    • ZHS16GBK (Simplified Chinese GBK)

  • Users used to access Oracle Database

    • Grant the permissions below to the users you want to use to connect to Oracle Database

      - Login permissions

      - SELECT permissions to the following tables

      dba_tablespace_usage_metrics

      dba_tablespaces

      v$system_wait_class

      v$asm_diskgroup_stat

      v$datafile

      v$sysstat

      v$process

      v$waitclassmetric

      v$session

      v$resource_limit

    • User used to connect to Oracle Database

      For details about the character types and maximum lengths that can be specified for user names, see Environment variables.

    • Password of the user used to connect to Oracle Database

      The following character types can be used for passwords:

      - Uppercase letters, lowercase letters, numbers, @, +, ', !, $, :, ., (, ), ~, -, _

      - The password can be from 1 to 30 bytes in length.

■ Obfuscation of Oracle Database passwords

OracleDB exporter shipped with JP1/IM - Agent manages the passwords in secret obfuscation capabilities for accessing Oracle Database from OracleDB exporter. For details, see 3.15.10 Secret obfuscation function.

■ Notes on Oracle Database log files

Monitoring Oracle Database with OracleDB exporter can generate a large number of logfiles. Therefore, Oracle Database administrator should consider deleting logfiles periodically.

Directory where log files are generated

(including subdirectories)

Increasing log file extensions

$ORACLE_BASE/diag/rdbms

.trc, .trm

Below is a sample command line for deleting ".trc" or ".trm" files with older renewal dates. If necessary, consider running such commands periodically to delete unnecessary logs.

OS

Command line example for deleting logs

Windows

forfiles /P "%ORACLE_BASE%\diag\rdbms" /M *.trm /S /C "cmd /C del /Q @path" /D -14

forfiles /P "%ORACLE_BASE%\diag\rdbms" /M *.trc /S /C "cmd /C del /Q @path" /D -14

Linux

find $ORACLE_BASE/diag/rdbms -name '*.tr[cm]' -mtime +14 -delete

Set the $ORACLE_BASE and %ORACLE_BASE% environment variables as needed.

■ Environment variables

The following environment variables are required when using OracleDB exporter.

- Environment-variable "DATA_SOURCE_NAME" (mandatory)

Specify the destination of OracleDB exporter in the following format: There is no default value.

  • For Windows

oracle://user-name@host-name:port/service-name?connection timeout=10[&amp;instance name=instance-name]
  • For Linux

oracle://user-name@host-name:port/service-name?connection timeout=10[&instance name=instance-name]
user-name
  • Specifies the username to connect to Oracle listener. Up to 30 characters can be specified.

  • You can use uppercase letters, numbers, underscores, dollar signs, pound signs, periods, and at signs. Note that lowercase letters are not allowed.

  • For Linux, replace the pound sign with "%%23" when you include your username in unit definition file. For example, if you are a shared CDB user, specify "C##USER" as "C%%23%%23USER".

  • For Windows, replace the pound sign with %23 when you include the username in service definition file. For example, if you are a shared CDB user, specify "C##USER" as "C%23%23USER".

host-name
  • Specifies the host name of Oracle Database host to monitor. Up to 253 characters can be specified.

  • You can use uppercase letters, lowercase letters, numbers, hyphens, and periods.

port
  • Specifies the port number for connecting to Oracle listener.

service-name
  • Specifies the service name of Oracle listener. Up to 64 characters can be specified.

  • You can use uppercase letters, lowercase letters, numbers, underscores, hyphens, and periods.

Option

You can specify the following options. If you specify more than one, connect them with &amp; in Windows and & in Linux.

  • connection timeout=number

    Specifies the connection timeout in seconds. This option must be specified.

    Be sure to specify 10. If you specify a value other than 10 or do not specify this option, scrape of Prometheus server times out and up metric may be 0 even if OracleDB exporter is running.

  • instance name=instance-name

    Specifies instance to connect to. Specifying this option is optional.

(Example of specification)

oracle://orauser@orahost:1521/orasrv?connection timeout=10
  • For Windows

oracle://orauser@orahost:1521/orasrv?connection timeout=10&amp;instance name=orcl1
  • For Linux

oracle://orauser@orahost:1521/orasrv?connection timeout=10
&instance name=orcl1
- Environment variable DATA_SOURCE_NAME (mandatory)

Specify the full path of jp1ima directory under JP1/IM - Agent installation directory.

For a logical host, specify the full path of jp1ima directory under JP1/IM - Agent shared directory.

(Example of specification)

  • For Windows

C:\Program files\Hitachi\jp1ima
  • For Linux

/opt/jp1ima

■ Notes

  • If you try to stop the monitored Oracle Database instance and containers prior to stopping OracleDB exporter, NORMAL shutdown of Oracle may not terminate. Stop OracleDB exporter in advance or stop Oracle Database by IMMEDIATE shutdown

  • Shut down OracleDB exporter before making configuration changes or maintaining Oracle Database instance and containers.

(l) Fluentd (Log metrics)

This capability can generate and measure log metrics from log files created by monitoring targets. For details on the function, see 3.15.2 Log metrics.

■ Key metric items

You define what figures you need from the log files created by your monitoring targets in the log metrics definition file (fluentd_any-name_logmetrics.conf). These definitions allow you to get quantified data (log metrics) as metric items.

For details on the log metrics definition file, see Log metrics definition file (fluentd_any-name_logmetrics.conf) in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ Sample files

The following provides descriptions of sample files for when you use the log metrics feature. If you copy the sample files, be careful of the linefeed codes. For details, see the description of each file of 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference. These sample files are based on the assumptions in Assumptions of the sample files. Copy each file and change the settings according to your monitoring targets.

- Assumptions of the sample files

The sample files described here assume that HostA, a monitored host (integrated agent host), exists and JP1/IM - Agent is installed in it, and that WebAppA, an application running on HostA, creates the following log file.

- ControllerLog.log

As shown in target log message 1, a log message is created, saying that an HTTP endpoint in WebAppA is used, at the start of processing of the request for that endpoint. The log message also indicates the number of records handled upon request processing.

Target log message 1:

...
2022-10-19 10:00:00 [INFO] c.b.springbootlogging.LoggingController : endpoint "/register" started. Target record: 5.
...

In the sample files, a regular expression to match target log message 1 is used, and the number of the log messages that match the expression is counted. The number is then displayed in the Trends tab of the JP1/IM integrated operation viewer as log metric 1, Requests to the register Endpoint.

The definition for log metric 1 uses counter as its log metric type.

In addition, the regular expression used in the above also extracts the number indicated as Target record from target log message 1, and then the extracted numbers are summed up. The total is then displayed in the Trends tab of the JP1/IM integrated operation viewer as log metric 2, Number of Registered Records.

The definition for log metric 2 uses counter as its log metric type.

Fluentd workers (multi-process workers feature) for the number of log files to be monitored are required. For details on the worker settings related to the log metrics feature, see the log metrics definition file (fluentd_any-name_logmetrics.conf). Here, it is assumed that 11 fluentd workers are running, and ControllerLog.log is monitored by a worker whose worker ID is 10.

These sample files also assume the tree structure consisting of the following IM management nodes:

All Systems
 + Host A
    + Application Server
       + WebAppA
- Target files in this example

The target files used in this example are as follows:

  • Integrated manager host

    - User-specific metric definition file

  • Integrated agent host

    - Prometheus configuration file

    - User-specific discovery configuration file

    - Log metrics definition file

    - Fluentd log monitoring target definition file

- Sample user-specific metric definition file

- File name: metrics_logmatrics1.conf

- Written code

[
  {
    "name":"logmetrics_request_endpoint_register",
    "default":true,
    "promql":"logmetrics_request_endpoint_register and $jp1im_TrendData_labels",
    "resource_en":{
      "category":"HTTP",
      "label":"request_num_of_endpoint_register",
      "description":"The request number of endpoint register",
      "unit":"request"
    },
    "resource_ja":{
      "category":"HTTP",
      "label":"Requests to the register Endpoint",
      "description":"The request number of endpoint register",
      "unit":"request"
    }
  },
  {
    "name":"logmetrics_num_of_registeredrecord",
    "default":true,
    "promql":"logmetrics_num_of_registeredrecord and $jp1im_TrendData_labels",
    "resource_en":{
      "category":"DB",
      "label":"logmetrics_num_of_registeredrecord",
      "description":"The number of registered record",
      "unit":"record"
    },
    "resource_ja":{
      "category":"DB",
      "label":"Number of Registered Records",
      "description":"The number of registered record",
      "unit":"record"
    }
  }
]
Note

The storage directory, written code, and file name follow the format of the user-specific metric definition file (metrics_any-Prometheus-trend-name.conf).

- Sample Prometheus configuration file

- File name: jpc_prometheus_server.yml

- Written code

global:
  ...
(omitted)
  ...
scrape_configs:
  - job_name: 'LogMetrics'
    
    file_sd_configs:
      - files:
        - 'user/user_file_sd_config_logmetrics.yml'
    
    relabel_configs:
      - target_label: jp1_pc_nodelabel
        replacement: Log trapper(Fluentd)
    
    metric_relabel_configs:
      - target_label: jp1_pc_nodelabel
        replacement: ControllerLog
      - source_labels: ['__name__']
        regex: 'logmetrics_request_endpoint_register|logmetrics_num_of_registeredrecord'
        action: 'keep'
      - regex: (jp1_pc_multiple_node|jp1_pc_agent_create_flag)
        action: labeldrop
 
  ...
(omitted)
  ...
Note

The storage directory and written code follow the format of the Prometheus configuration file (jpc_prometheus_server.yml). You do not have to create a new file. Instead, you add the scrape_configs section for the log metrics feature to the Prometheus configuration file (jpc_prometheus_server.yml) created during installation.

- Sample user-specific discovery configuration file

- File name: user_file_sd_config_logmetrics.yml

- Written code

- targets:
  - HostA:24830
  labels:
    jp1_pc_exporter: logmetrics
    jp1_pc_category: WebAppA
    jp1_pc_trendname: logmetrics1
    jp1_pc_multiple_node: "{__name__=~'logmetrics_.*'}"
    jp1_pc_agent_create_flag: false
Note

The storage directory and written code follow the format of the user-specific discovery configuration file (file_sd_config_any-name.yml).

ControllerLog.log is monitored by the worker whose Fluentd worker ID is 10. Thus, when 24820 is set for port in the Sample log metrics definition file, the port number of the worker monitoring ControllerLog.log is 24820 + 10 = 24830.

- Sample log metrics definition file

- File name: fluentd_WebAppA_logmetrics.conf

- Written code

## Input
<worker 10>
  <source>
    @type prometheus
    bind '0.0.0.0'
    port 24820
    metrics_path /metrics
  </source>
</worker>
## Extract target log message 1
<worker 10>
  <source>
    @type tail
    @id logmetrics_counter
    path /usr/lib/WebAppA/ControllerLog/ControllerLog.log
    tag WebAppA.ControllerLog
    pos_file ../data/fluentd/tail/ControllerLog.pos
    read_from_head true
    <parse>
      @type regexp
      expression /^(?<logtime>[^\[]*) \[(?<loglebel>[^\]]*)\] (?<class>[^\[]*) : endpoint "\/register" started. Target record: (?<record_num>\d[^\[]*).$/
      time_key logtime
      time_format %Y-%m-%d %H:%M:%S
      types record_num:integer
    </parse>
  </source>
 
## Output
## Define log metrics 1 and 2
  <match WebAppA.ControllerLog>
    @type prometheus
    <metric>
      name logmetrics_request_endpoint_register
      type counter
      desc The request number of endpoint register
    </metric>
    <metric>
      name logmetrics_num_of_registeredrecord
      type counter
      desc The number of registered record
      key record_num
      <labels>
      loggroup ${tag_parts[0]}
      log ${tag_parts[1]}
      </labels>
    </metric>
  </match>
</worker>
Note

The storage directory and written code follow the format of the log metrics definition file (fluentd_any-name_logmetrics.conf).

- Sample Fluentd log monitoring target definition file

- File name: jpc_fluentd_common_list.conf

- Written code

## [Target Settings]
  ...
(omitted)
  ...
@include user/fluentd_WebAppA_logmetrics.conf
Note

The storage directory and written code follow the format of the Fluentd log monitoring target definition file (jpc_fluentd_common_list.conf) in JP1/IM - Agent definition files. You do not have to create a new file. Instead, you add the include section for the log metrics feature to the Fluentd log monitoring target definition file (jpc_fluentd_common_list.conf) created during installation.

(m) Whether Prometheus and Exporter are supported for the same host configuration and another host configuration

The following tables show whether Prometheus and Exporter can be supported for the same host configuration and another host configuration.

Table 3‒17:  Whether or not Prometheus and Exporter host configuration are supported

Exporter type

Configuring Prometheus and Exporter hosts

Same host

Another host

Exporter provided by JP1/IM - Agent

Node exporter for AIX

N

Y

Exporter other than the above

Y

N

User-defined Exporter

Y

Y

Legend

Y: Supported

N: Not supported

The following configurations are not supported:

  • Configuring scrape from more than one Prometheus to the same Exporter

  • Exporter# on the remote agent (the host on Exporter and the host being monitored are separate hosts)

#

Exporter of the remote agent is Exporter whose discovery configuration file contains the description "jp1_pc_remote_monitor_instance".

Also, if Prometheus and Exporter are configured on different hosts, it is assumed that the ports used by Exporter are protected by firewalls, network configurations, etc. so that they are not accessed by anyone other than JP1/IM - Agent's Prometheus server (e.g. by building integrated agent host and Exporter hosts in the same network so that they are not accessed externally).

(2) Centralized management of performance data

This function allows Prometheus server to store performance data collected from monitoring targets in the intelligent integrated management database of JP1/IM - Manager. It has the following features:

(a) Remote light function

This is a function in which the Prometheus server sends performance data collected from monitoring targets to an external database suitable for long-term storage. JP1/IM - Agent uses this function to send performance data to JP1/IM - Manager.

The following shows how to define a remote light.

  • Remote write definitions are described in the Prometheus server configuration file (jpc_prometheus_server.yml).

  • Download Prometheus server configuration file from integrated operation viewer, edit it in a text editor, modify Remote Write definition, and then upload it.

The following settings are supported by JP1/IM - Agent for defining Remote Write. For details about the settings, see Prometheus configuration file (jpc_prometheus_server.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

Table 3‒18: Settings for Remote Light Definition Supported by JP1/IM - Agent

Setting items

Description

Remote Light Destination

(required)

Set the endpoint URL for JP1/IM agent control base.

Remote light timeout period

(Optional)

You can set the timeout period if the remote light takes a long time.

Change it if you are satisfied with the default value.

Relabeling

(Optional)

You can remove unwanted metric and customize labeling.

(3) Performance data monitoring notification function

This function allows Prometheus server to monitor performance data collected from monitoring targets at a threshold value and notify JP1/IM - Manager. It has three functions:

If you add a service to be monitored in an environment where an alert definition for monitoring a service is set, the added service is also monitored. If you exclude a monitored service for which an alert has been fired from the monitoring target, you will receive an alert indicating that the alert that was fired has been recovered.

For an example of defining an alert, see Metric alert definition example in Node exporter metric definition file and Metric alert definition example in Windows exporter metric definition file in Alert configuration file (jpc_alerting_rules.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command Definition File API Reference. For Linux, the alerts are defined differently depending on whether or not the monitored auto-start is enabled (running systemctl enable). If you want to monitor a service for which automatic startup is disabled, you must create and configure an alert definition for each target.

(a) Alert evaluation function

This function monitors performance data collected from monitoring targets at a threshold value.

Define alert rules to evaluate alerts, monitor performance data at thresholds, and notify alerts.

Alerts can be evaluated by comparing the time series data directly with the thresholds, or by comparing the thresholds with the results of formulas using PromQL#.

#

For details about PromQL, see 2.7.4(4) About PromQL.

For each time series of data or for each data generated by the calculation result of the PromQL expression, the alert status according to the evaluation is managed, and the action related to the notification is executed according to the alert state.

There are three alert states: pending, firing, and resolved. When the condition meets the alert rule first, it will be in the "pending" state. After that, when the condition continues to meet the alert rule (not resolved) during the time of "for" clause defined in the alert rule definition, it will be in the "firing" state.

When the condition does not meet(resolved), or if the time series is gone, it will be in the "resolved" state.

The relationship between alert status and notification behavior is as below.

Alert status

Description

Notification behavior

pending

The threshold is exceeded. The state the threshold is exceeded, but the time of "for" clause defined in the alert rule definition has not passed yet.

Do not notify alerts.

firing

The firing state. The state the threshold is exceeded, and the time of "for" clause defined in the alert rule definition has passed. Alternatively, the state the threshold is exceeded, and the "for" clause of the alert is not specified.

Notifies you of alerts.

resolved

The resolved state. The state the alert rule is no longer met.

  • When the condition recovers from the "firing" state, a notification of resolved is given.

  • When the condition recovers from the "pending" state, no resolved notification is given.

The following shows how to define an alert rule.

  • Alert rule definitions are described in the alert configuration file (jpc_alerting_rules.yml) (definitions in any YAML format can also be described).

  • Before reflecting the created definition file in the environment to be used, format check and alert rule test with the promtool command.

  • Download alert configuration file from integrated operation viewer, edit it in a text editor, change the definition of the alert rule, and then upload it.

The following settings apply to the alert rule definitions supported by JP1/IM - Agent. For details about the settings, see Alert configuration file (jpc_alerting_rules.yml) in Chapter 2. definition file) in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference. There is no default alert rule definition.

Table 3‒19: Settings for alert rule definitions supported by JP1/IM - Agent

Setting Item

Description

Alert Name (required)

Set the alert name.

Conditional expression (required)

Set the alert condition expression (threshold).

It can be configured using PromQL.

Waiting time (required)

Set the amount of time to wait after entering the "pending" state before changing to the "firing" state.

Change it if you are satisfied with the default value.

Label (required)

Set labels to add to alerts and recovery notifications.

In JP1/IM - Agent, a specific label must be set.

Annotation (required)

Set to store additional information such as alert description and URL link.

In JP1/IM - Agent, certain annotations must be set.

Labels and annotations can use the following variables:

Variable#

Description

$labels

A variable that holds the label key-value pairs for the alert instance. The label key can be one of the following labels:

When time series data is specified in the alarm evaluation conditional expression

You can specify the label that the data retains.

  • When time series data is specified in the alarm evaluation conditional expression

    You can specify the label that the data retains.

  • When PromQL expression is specified as the condition expression for alarm evaluation

    You can specify a label that is set as the result of a PromQL expression.

    The label that the data retains depends on the metrics.

    With regards to the label, refer the description of the metrics that can be specified in the PromQL statement, in 3.15.1(1) Performance data collection function.

$values

A variable that holds the evaluation value of the alert instance.

When a firing is notified, it is expanded to the value at the time the firing was detected.

When the resolved notification, it is expanded to the value as of the firing just before resolved (note that it is not the value as of resolved).

$externalLabels

This variable holds the label and value set in "external_labels" of item "global" in the Prometheus configuration file (jpc_prometheus_server.yml).

#1

Variables are expanded by enclosing them in "{{" and "}}". The following is an example of how to use variables:

description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

■ Alert rule definition for converting to JP1 events

In order to convert the alert to be notified into a JP1 event on the JP1/IM - Manager side, the following information must be set in the alert rule definition.

Setting item

Value to set

Uses

name

Configure any unique alert group definition name in integrated agent.

Alert group definition name

alert

Set any unique alert-definition-name in integrated agent.

Alert Definition Name

expr

Set the PromQL statement.

It is recommended to set the PromQL statement described in the metric definition file. This way, when the JP1 event occurs, you can display trend information in the Integrated Operation Viewer.

Firing conditions#

#

If the conditions are met, it is firing, and if the conditions are not met, it is resolved.

labels.jp1_pc_product_name

Set "/HITACHI/JP1/JPCCS" as fixed.

Set to the product name of the JP1 event.

labels.jp1_pc_severity

Set one of the following:

  • Emergency

  • Alert

  • Critical

  • Error

  • Warning

  • Notice

  • Information

  • Debug

Set to JP1 event severity#.

#

This value is set to the severity of the JP1 event of the anomaly. The severity of a successful JP1 event is set to Information.

labels.jp1_pc_eventid

Set any value in the range of 0~1FFF,7FFF8000~7FFFFFFF.

Set to the event ID of the JP1 event.

labels.jp1_pc_metricname

Set the metric name.

For Yet another cloudwatch exporter, be sure to specify it. Associates the JP1 event with the IM management node in the AWS namespace corresponding to the metric name (or the first metric name if multiple metric names are specified separated by commas).

Set to the metric name of the JP1 event.

For yet another cloudwatch exporter, it is also used to correlate JP1 events.

annotations.jp1_pc_firing_description

Specify the value to be set for the message of the JP1 event when the firing condition of the alert is satisfied.

If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte.

If the specification is omitted, the message content of the JP1 event is "The alert is firing. (alert = alert name)".

You can also specify variables to embed job names and evaluation values. If a variable is used, the first 1,024 bytes of the expanded message are valid.

It is set to the message of the JP1 event.

annotations.jp1_pc_resolved_description

Specify the value to be set for the message of the JP1 event when the firing condition of the alert is not satisfied.

If the length of the value is 1,024 bytes or more, set the string from the beginning to the 1,023rd byte.

If the specification is omitted, the content of the message in the JP1 event is "The alert is resolved. (alert = alert name)".

You can also specify variables to embed job names and evaluation values. If a variable is used, the first 1,024 bytes of the expanded message are valid.

It is set to the message of the JP1 event.

For an example of setting an alert definition, see Definition example in alert configuration file (jpc_alerting_rules.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

For details about the properties of the corresponding JP1 event, see 3.2.3 Lists of JP1 events output by JP1/IM - Agent in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

■ How to operate in combination with trending-related functions

Combine the definitions of the PromQL statement described in the metric definition file and the PromQL statement evaluated by the alert evaluation function, and in the alert definition annotations.jp1_pc_firing_description and annotations.jp1_pc_resolved_description of the alert definition in the alert configuration file, By describing the metric name of the corresponding trend data, when the JP1 event of the alert is issued, you can check the past change and current value of the performance value evaluated by the alert on the [Trend] tab of the integrated operation viewer.

For details about PromQL expression defined in trend displayed related capabilities, see 3.15.6(4) Return of trend data.

For example, if you want the Node exporter to monitor CPU usage and notify you when the CPU usage exceeds 80%, create an alert configuration file (alert definition) and a metric definition file as shown in the following example.

  • Example of description of alert configuration file (alert definition)

    groups:
      - name: node_exporter
        rules:
        - alert: cpu_used_rate(Node exporter)
          expr: 80 < (avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="system"}[2m])) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode="user"}[2m]))) * 100
          for: 3m
          labels:
            jp1_pc_product_name: "/HITACHI/JP1/JPCCS"
            jp1_pc_severity: "Error"
            jp1_pc_eventid: "0301"
            jp1_pc_metricname: "node_cpu_seconds_total"
          annotations:
            jp1_pc_firing_description: "CPU utilization exceeded threshold (80%).value={{ $value }}%"
            jp1_pc_resolved_description: "CPU usage has dropped below the threshold (80%)."
  • Example of description of metric definition

    [
      {
        "name":"cpu_used_rate",
        "default":true,
        "promql":"(avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode=\"system\"}[2m]) and $jp1im_TrendData_labels) + avg by (instance,job,jp1_pc_nodelabel,jp1_pc_exporter) (rate(node_cpu_seconds_total{mode=\"user\"}[2m]) and $jp1im_TrendData_labels)) * 100",
        "resource_en":{
          "category":"platform_unix",
          "label":"CPU used rate",
          "description":"CPU usage.It also indicates the average value per processor. [Units: %]",
          "unit":"%"
        },
        "resource_ja":{
          "category":"platform_unix",
          "label":"CPU Usage",
          "description":"CPU Utilization (%). It is also an average percentage of each processor.",
          "unit":"%"
        }
      }
    }

    When the conditions of the PromQL statement specified in expr of the alert definition are satisfied and the JP1 event of the alert is issued, the message "CPU usage exceeded threshold (80%). value = performance value%" is set in the message of the JP1 event. Users can view this message to view "CPU Usage" trend information and see past changes and current values of CPU usage.

■ Behavior when the service is stopped

If the Alertmanager service is stopped, the JP1 event for the alert is not issued. In addition, if the Prometheus server and Alertmanager services are running and the exporter whose alert is firing is stopped due to a failure, the alert becomes resolved and a normal JP1 event is issued.

When alert is firing and the Prometheus server service is stopped while the Alertmanager is running, a normal JP1 event that gives a notification of resolved of the alert is issued.

For details, see About behavior when the Prometheus server is restarted or stoppedwhile the Alertmanager is running.

■ About behavior when the service is restarted

Even if the alert is firing or resolved and the Prometheus server, Alertmanager, or Exporter service is restarted, when the current alert status is the same as the alert state before the restart, the JP1 event is not issued.

When the alert is firing and the Prometheus server service is restarted while the Alertmanager is running, there are cases in which a normal JP1 event that gives a notification of resolved of the alert is issued.

For details, see About behavior when the Prometheus server is restarted or stopped while the Alertmanager is running.

■ About Considering Performance Data Spikes

Performance data can be momentarily jumpy (large values, small values, or minus values). These sudden changes in performance data are commonly referred to as "spikes." In many cases, even if a spike occurs and becomes an abnormal value momentarily, it immediately returns to normal and does not need to be treated as an abnormal. Also, when the performance data is reset, such as when the OS is restarted, a spike may occur instantaneously.

When monitoring such performance data metrics, it is necessary to consider suppressing sudden anomaly detection by specifying "for" (grace period before treating alerts as anomalies) in the alert rule definition.

■ About behavior when the Prometheus server is restarted or stopped while the Alertmanager is running

When the alert is firing and the Prometheus server service is restarted or stopped while the Alertmanager is running, there are cases in which a normal JP1 event that gives a notification of resolved of the alert be issued.

When following conditions are met, a normal JP1 event is issued.

  • The sum total of the duration of the "for" clause# defined in alert definition of firing alert and the duration that Prometheus server service is not runnig due to being stopped or reloading becomes greater than the value of "evaluation_interval" defined in Prometheus configuration file.

  • #: When the "for" clause of the alert is not specified, define 0.

■ About behavior when the service is reloaded

Even if the alert is firing or resolved and the API that reloads the Prometheus server, Alertmanager, or Exporter service is executed, the JP1 event is not issued.

(b) Alert forwarder

This function notifies you when the alert status becomes "firing" or "resolved" after the Prometheus server evaluates the alert.

When the state of alert changes during JP1/IM - Manager (Intelligent Integrated Management Base) is stopped, there are cases in which a notification of firing and resolved is not performed.

The Prometheus server sends alerts one by one, and the sent alerts are notified to JP1/IM - Manager (Intelligent Integrated Management Base) via Alertmanager. You will also be notified one by one when you retry.

Alerts sent to JP1/IM - Manager are basically sent in the order in which they occurred, but the order may be changed when multiple alert rules meet the conditions at the same time or when a transmission error occurs and they are resent. However, since the alert information includes the time of occurrence, it is possible to understand in which order it occurred.

In addition, if the abnormal condition continues for 7 days, an alert will be re-notified.

The following shows how to define the notification destination of the alert.

  • Alert destinations are described in both the Prometheus configuration file (jpc_prometheus_server.yml) and the Alertmanager configuration file (jpc_alertmanager.yml).

    For Prometheus configuration file, specify a Alertmanager that coexists as a destination for Prometheus server notifications. For Alertmanager configuration file, specify JP1/IM agent control base as the notification destination for Alertmanager.

  • Download the individual configuration file from integrated operation viewer, edit them in a text editor, change the alert notification destination definitions, and then upload them.

The following settings are related to definition of Prometheus server notification destinations supported by JP1/IM - Agent. For details about the settings, see Prometheus configuration file (jpc_prometheus_server.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

Table 3‒20: Settings for defining notification destinations for Prometheus server supported by JP1/IM - Agent

Setting items

Description

Notification destination (required)

Configure the notification destination Alertmanager.

If a host name or internet address is specified for --web.listen-address in the Alertmanager command line option, modify localhost to the host name or internet address specified in --web.listen-address.

  • For physical host environments

    Specifies the Alert manager that you want to live with.

  • For clustered environment

    Specifies the Alertmanager that runs on the logical host.

Label setting (optional)

You can add labels. Configure as needed.

The following are Alertmanager notification destinations that JP1/IM - Agent supports: For details about the settings, see Alertmanager configuration file (jpc_alertmanager.yml) in Chapter 2. Definition Files in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

Table 3‒21:  Settings for defining Alertmanager notification destinations supported by JP1/IM - Agent

Setting items

Description

Webhook settings (required)

Set the endpoint URL for JP1/IM agent control base.

(c) Notification suppression function

This function suppresses the notifications described in (3.15.1(3)(b) Alert forwarder. It includes:

  • Silence function

    Use this if you do not want to be temporarily notified of certain alerts.

■ Silence function

This feature temporarily suppresses certain notifications. You can set not to notify alerts that occur during temporary maintenance. Unlike when the common exclusion condition of JP1/IM - Manager is used, the notification suppression function does not notify JP1/IM - Manager itself.

While silence is enabled, you will not be notified when the alert status changes. When silence is disabled, if the state has changed compared to the state of the alert before silence was enabled, notification is given.

Here are two examples of when to notify:

Figure 3‒34: Cases where the state is different before and after disabling silence

[Figure]

The above figure shows an example in which the alert status is "Abnormal" when silence is enabled, and while silence is enabled, the alert status changes to "Normal", and then silence is disabled.

When the alert changes to "Normal", you will not be notified because silence is enabled. When silence is disabled, the alert status has changed from "abnormal" to "normal" before silence is enabled, so "normal" notification is given.

Figure 3‒35: Cases where the state is the same before and after enabling silence

[Figure]

The above figure shows an example in which the alert status changed to "normal" once, changed to "abnormal" again, and then disabled silence while silence was enabled.

When silence is disabled, notification is not performed because the alert status is the same "abnormal" as before silence was enabled.

If an alert fails to be sent and retries and silence is enabled to suppress the alert, the alert will not be retried.

- How to Configure silence

Silence settings (enable or disable) and retrieve the current silence settings are performed via REST API (GUI is not supported).

In addition, when configuring silence settings, integrated agent host must be able to communicate with Alertmanager port-number from the machine that you are operating.

For details about silence settings and REST API used to obtain current silence settings, see 5.21.3 Get silence list of Alertmanager, 5.21.4 Silence creation of Alertmanager, and 5.21.5 Silence Revocation of Alertmanager in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

(4) Communication function

(a) Communication protocols and authentication methods

The following shows the communication protocols and authentication methods used by integrated agent.

Connection source

Connect to

Protocol

Authentication method

Prometheus server

JP1/IM agent control base

HTTP

No authentication

Alertmanager

Prometheus server

Alertmanager

HTTP

No authentication

Exporter

Blackbox exporter

monitored

HTTP/HTTPS

Basic Authentication

Basic Authentication

No authentication

HTTPS

Server Authentication

With client authentication

No client authentication

ICMP#

No authentication

Yet another cloudwatch exporter

Amazon CloudWatch

HTTPS

AWS IAM Authentication

Promitor Scraper

Azure Monitor

HTTPS

No client authentication

Promitor Resource Discovery

Azure Resource Graph

HTTPS

No client authentication

Promitor Scraper

Promitor Resource Discovery

HTTP

No authentication

Prometheus

Fluentd

HTTP

No authentication

OracleDB exporter

Oracle listener

Oracle listener-specific (no encryption)

Authentication by username/password

#

ICMPv6 is not available.

(b) Network configuration

Integrated agent can be used in a network configuration with only a IPv4 environment or in a network configuration with a mix of IPv4 and IPv6 environments. Only IPv4 communication is supported in a network configuration with a mix of IPv4 and IPv6 environments.

You can use integrated agent in the following configurations without a proxy server:

Connection source

Connect to

Connection type

Prometheus server

JP1/IM agent control base

No proxy server

Alertmanager

Prometheus server

Alertmanager

Exporter

Blackbox exporter

Monitoring targets (ICMP monitoring)

Monitoring targets (HTTP monitoring)

  • No proxy server

  • Through a proxy server without authentication

  • Through a proxy server with authentication

Yet another cloudwatch exporter

Amazon CloudWatch

  • No proxy server

  • Through a proxy server without authentication

  • Through a proxy server with authenticationNo proxy server

Promitor Scraper

Azure Monitor

  • No proxy server

  • Through a proxy server without authentication

  • Through a proxy server with authenticationNo proxy server

Promitor Resource Discovery

Azure Resource Graph

OracleDB exporter

Oracle listener

No proxy server

Integrated agent transmits the following:

Connection source

Connect to

Transmitted data

Authentication method

Prometheus server

JP1/IM agent control base

Performance data in Protobuf format

Alertmanager

Alert information in JSON format#1

Prometheus server

Exporter

Prometheus textual performance data#2

Blackbox exporter

monitored

Response for each protocol

Yet another cloudwatch exporter

Amazon CloudWatch

CloudWatch data

Promitor Scraper

Azure Monitor

Azure Monitor data (metrics information)

  • Service principal

  • Managed ID

Promitor Resource Discovery

Azure Resource Graph

Azure Resource Graph data (resources exploration results)

OracleDB exporter

Oracle listener

Proprietary Oracle listener data

#1

For details, see the description of the message body for the request in 5.6.5 JP1 Event converter in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.

#2

For details, see the description of Prometheus text formatting in 5.23 API for scrape of Exporter used by JP1/IM - Agent in the JP1/Integrated Management 3 - Manager Command, Definition File and API Reference.