Job Management Partner 1/Performance Management - Agent Option for Platform Description, User's Guide and Reference

[Contents][Glossary][Index][Back][Next]


1.3.2 How to monitor performance

The following explains how to monitor performance for each system resource, and provides examples of performance data collection.

Organization of this subsection
(1) Processor
(2) Memory
(3) Disks
(4) Network
(5) Processes
(6) Example of collecting information about used ports
(7) Example of collecting performance data from multiple hosts on which PFM products are not installed

(1) Processor

This subsection explains how to monitor processor performance.

(a) Overview

By monitoring processor performance, you can check the performance trends for the entire system. In UNIX, as illustrated in the following figure, the processor operations include processor operations for the kernel and for user processes.

Figure 1-1 Relationship between the kernel and processes

[Figure]

The processor usage status can be checked by monitoring the CPU usage, which is the typical way, or by monitoring the number of queued jobs.

Jobs, such as processes, are executed by the CPU according to the schedule determined by the OS. The number of queued jobs is the number of jobs that are waiting to be executed by the CPU. When overall system load is high, the number of queued jobs tends to increase.

The monitoring template provides such functionality as CPU Usage alarms and CPU Status (Multi-Agent) reports.

To monitor processor performance with more detail than with the monitoring template, the processor usage per processor, processor usage per process, processor queue count, and processor interrupts from hardware can also be monitored.

The following table lists and describes the principal records and fields related to processor monitoring.

Table 1-1 Principal records and fields related to processor monitoring

Record Field Description (example)
PI 1-Minute Run Queue Avg The average number of threads that were queued for execution. If value of this field is large, there might be a problem with the efficiency of processor usage.
5-Minute Run Queue Avg
15-Minute Run Queue Avg
CPU % The CPU usage (the percentage of time that the CPU was being used). If value of this field is large, the CPU load is high.
Idle % The percentage of time that the CPU was not being used. If value of this field is large, the CPU load is low.
PI_CPUP CPU % The CPU usage for a processor.
System % The CPU usage for a processor executed in kernel mode.
If the value of this field is large, there might be a problem with the OS or operation.
User CPU % The CPU usage for a processor executed in user mode.
If the value of this field is large, there might be a problem with a specific application.
PD_PDI CPU % The CPU usage for a process.
If the value of this field is large, there might be a problem with the OS, operation, or a specific application.
System CPU The CPU usage for a processor executed in kernel mode.
If the value of this field is large, there might be a problem with the OS.
User CPU The CPU usage for a processor executed in user mode.
If the value of this field is large, there might be a problem with a specific application.

(b) Monitoring methods

l Monitoring kernel CPU usage

You can use the Kernel CPU alarm provided by the monitoring template to monitor system-wide kernel CPU usage.

For details, see 1.3.3(1)(a) Monitoring template.

l Monitoring user CPU usage

You can use the User CPU alarm provided by the monitoring template to monitor system-wide user CPU usage.

For details, see 1.3.3(1)(a) Monitoring template.

l Monitoring the CPU usage for each processor

By monitoring the CPU usage for each processor, you can check for problems with OS operation, such as an unbalanced CPU load. You can use the monitoring result of the CPU usage as a guideline for taking corrective measures.

Monitoring the kernel CPU usage, user CPU usage, and processor congestion is an effective way to monitor the CPU usage for each processor.

If the user CPU usage for a processor (the User CPU % field of the PI_CPUP record) is at or above the threshold, you might need to look for processes that excessively use the CPU and to take action.

If the kernel CPU usage for a processor (the System % field of the PI_CPUP record) is at or above the threshold, the system environment is inadequate for the processing. In this case, you might need to upgrade the processor or add processors.

For definition examples, see 1.3.3(1)(b) Definition examples other than for monitoring templates.

l Monitoring processor congestion

You can use the Run Queue alarm provided by the monitoring template to monitor processor congestion.

In addition to processor usage, you can monitor processor congestion (the number of queued requests) to monitor the processor load status. Monitoring both processor congestion and processor usage is an effective way to monitor the processor load status.

For details, see 1.3.3(1)(a) Monitoring template.

l Checking processes whose processor usage is high

By monitoring kernel CPU usage, user CPU usage, CPU usage for each processor, and processor congestion, you can determine if a bottleneck exists. If one does exist, you can use a real-time report (the CPU % field of the PD_PDI record) to find processes that are monopolizing the processor.

If no such processes exist, the system environment is inadequate for the processing. In this case, you might need to upgrade the processor or add processors.

For definition examples, see 1.3.3(1)(b) Definition examples other than for monitoring templates.

(2) Memory

This subsection explains how to monitor memory performance.

(a) Overview

You can monitor memory performance to detect physical memory shortages and incorrect process operations.

Memory consists of physical memory and a swap file, as illustrated below. However, because the causes of bottlenecks are not limited to a small amount of physical memory or a small swap file, the paging status, page faults, and other items related to efficient memory usage must be monitored as well.

The following figure illustrates the configuration of the memory space.

Figure 1-2 Conceptual diagram of the memory space

[Figure]

Insufficient physical memory degrades overall system performance.

Memory areas not accessed by programs for a long time are saved to the swap file, and are loaded into physical memory on demand. Physical memory is efficiently used in this manner. Note, however, that swap file access is markedly slower than physical memory access. Therefore, frequent swapping or frequent page faults will considerably delay system processing.

Because paging often occurs even in normal processing, measure performance when the system is operating stably before attempting to determine proper thresholds.

The Available Memory alarm is provided by the monitoring template. If you want to perform more detailed monitoring, see the following table, which lists and describes the principal records and fields related to memory monitoring.

Table 1-2 Principal fields related to memory monitoring

Record Field Description (example)
PI Total Physical Mem Mbytes The amount of physical memory.
Free Mem Mbytes The amount of available physical memory.
Alloc Mem Mbytes The amount of physical memory in use.
Alloc Mem % Physical memory usage.
Total Swap Mbytes The amount of virtual memory.
Free Mem Mbytes The amount of available virtual memory.
Alloc Swap Mbytes The amount of virtual memory in use. If the value of this field continues to be at or above the threshold (the Total Physical Mem Mbytes field value of the PI record), a larger amount of memory might be required.
Alloc Swap % Virtual memory usage. If the value of this field continues to be at or above the threshold (determined by the system load status), the swap area might need to be expanded.
Page Scans/sec The number of page scans that occurred per second. If the value of this field continues to be at or above the threshold (150), memory might be a bottleneck.
Swapped-Out Pages/sec The number of swapped-out pages per second. If the value of this field continues to be at or above the threshold (200), memory might be a bottleneck.
Buffers Mem %# The percentage of physical memory allocated to the file buffer.
Buffers Mem Mbytes# The amount of physical memory allocated to the file buffer (MB).
Cache Mem %# The percentage of physical memory allocated as cache memory.
Cache Mem Mbytes# The amount of physical memory allocated as cache memory (MB).
Effective Free Mem %# The percentage of physical memory available to applications.
Effective Free Mem Mbytes# The amount of physical memory available to applications (MB).

#
This field can be used only when PFM - Agent for Platform 08-11 or later is installed in Linux.

The cause of a system memory shortage is not always physical memory itself. A problem with a program can also cause a shortage. By monitoring memory usage for each process, you can identify the cause of a shortage. If there is a process improperly occupying memory or if the amount of memory used by a process continues to increase steadily, the program running the process is likely to be defective.

The following table lists and describes the principal records and fields related to monitoring the memory usage of a specific process.

Table 1-3 Principal fields related to the memory monitoring for each process

Record Field Description (example)
PD_PDI Real Mem Kbytes The amount of physical memory used by a process. If the value of this field is large, a specific process might be using a large amount of physical memory.
Virtual Mem Kbytes The amount of virtual memory used by a process. If the value of this field is large, a specific process might be using a large amount of virtual memory.
Swaps The number of times swapping occurred for a process. If the value of this field is large, you might need to look for a process that is causing a bottleneck due to frequent swapping.

(b) Monitoring methods

l Monitoring the memory usage status

You can use the usage status of virtual memory as a guideline for determining whether to increase physical memory.

Even when memory usage is temporarily high, if the high load status does not persist, performance degradation might be permissible. Therefore, monitoring the number of page scans and the number of swapped-out pages in addition to memory usage is recommended.

If the amount of used virtual memory (the Alloc Swap Mbytes field of the PI record) is larger than the total amount of physical memory (the Total Physical Mem Mbytes field of the PI record), more memory might be required.

For definition examples, see 1.3.3(2)(b) Definition example other than for monitoring templates.

l Monitoring the number of page scans

You can use the Pagescans alarm provided by the monitoring template to monitor the number of page scans (Page Scans/sec field of the PI record).

Monitoring the swap-out status and memory usage in addition to the number of page scans is recommended.

If you find processes that are performing many page scans, you can take appropriate action. If no such processes exist, the system environment is inadequate for the processing. In this case, you might need to increase the amount of physical memory.

For details, see 1.3.3(2)(a) Monitoring template.

l Monitoring the swap-out status

You can use the Swap Outs alarm provided by the monitoring template to monitor the swap-out processes (Swapped-Out Pages/sec field of the PI record).

Monitoring the number of page scans and the memory usage status in addition to the swap-out status is recommended.

If you find processes that are excessively swapping pages, take appropriate action. If no such processes exist, the system environment is inadequate for the processing. In this case, you might need to increase the amount of physical memory.

For details, see 1.3.3(2)(a) Monitoring template.

l Monitoring memory usage by process

If you think there is a problem with a process after monitoring the memory usage status, the number of page scans, and the swap-out status, you must identify the process causing the problem.

If server activities have not increased, you can use a real-time report to monitor the memory usage of each process (by using the Real Mem Kbytes field of the PD_PDI record, for example) for a number of minutes. In the results displayed as a line graph, you can then check for a process whose memory usage is increasing steadily.

If you identify a process that is causing a memory leak or is excessively swapping pages, contact the vendor or take other appropriate action.

For definition examples, see 1.3.3(2)(b) Definition example other than for monitoring templates.

l Checking the amount of memory that the system can actually use (Linux only)

In Linux, once information is stored in memory, Linux retains that information in memory as long as possible. For this reason, when the Free Mem Mbytes field of the PI record is used to monitor the amount of available memory, the value of the field gradually approaches 0. However, the stored information can be released from memory anytime and does not prevent applications from using memory. In PFM - Agent for Platform 08-11 and later versions, the amount of memory that can be freed up anytime can be monitored by using the Buffers Mem Mbytes and Cache Mem Mbytes fields of the PI record. PFM - Agent for Platform calculates the amount of memory that is available in reality from these fields and outputs the result to the Effective Free Mem Mbytes field of the PI record. You can use this field to monitor the amount of memory that is actually available to the system.

(3) Disks

This subsection explains how to monitor disk performance.

(a) Overview

You can monitor disk performance to detect disk resource shortages and bottlenecks caused by a disk. Continuous monitoring of disk performance allows you to check for trends in increased disk space usage so that you can determine an appropriate configuration for the system or determine when the system configuration should be expanded.

A disk stores programs, the data used by the programs, and other data. If the amount of free disk space becomes insufficient, data might be lost or system response might slow down.

If a program that is performing a disk I/O operation must pause (that is, wait for the response), the disk is becoming a bottleneck.

A disk bottleneck can cause any of several types of performance degradation, such as slow process response. For this reason, it is important to check that disk performance is not degrading.

When you monitor the number of disk I/O operations, note the following.

I/O information for a disk that PFM - Agent for Platform acquires is the I/O information that the OS has acquired from the disk device. This information is not related to the I/O operations on the actual disk. The following figure shows I/O processing that occurs between an application and a disk.

Figure 1-3 Conceptual figure for I/O processing

[Figure]

The fields related to disk I/O load are Avg Service Time and Busy %.

The Avg Service Time field indicates the average time required for one I/O operation. If a very large amount of information is input or output, or if an I/O operation is delayed, the value of this field becomes large.

The Busy % field indicates the percentage of time that the disk device was operating during a collection interval. The value of this field becomes large if I/O operations are concentrated in a short period of time.

As described above, the Avg Service Time and Busy % fields are related to the disk device load. You can therefore use these fields in the way that best suits your monitoring requirements.

The following figure explains what the Avg Service Time and Busy % field values indicate.

Figure 1-4 Values of the Avg Service Time and Busy % fields

[Figure]

The monitoring template provides the Disk Space alarm. If you want to perform more detailed monitoring, see the following table, which lists and describes the principal records and fields related to the monitoring of disk performance.

Table 1-4 Principal fields related to disk monitoring

Record Field Description (example)
PI_DEVD Avg Service Time The average time for I/O operations. If the value of this field is large, the amount of information being input or output might be very large.
Avg Wait Time The average wait time for I/O operations. If the value of this field is large, the amount of information being input or output might be very large.
Busy % The disk busy rate. If the value of this field is large, I/O operations might be concentrated on a specific disk.
I/O Mbytes The total amount of information input or output. If the value of this field is large, the amount of information being input or output might be very large.
Total I/O Ops The number of I/O operations that occurred. If the value of this field is large, I/O operations might be concentrated on a specific disk.
Queue Length The queue length. If the value of this field continues to be at or above the threshold, the device is congested.
PD_FSL Mbytes Free The amount of available disk space. If the value of this field is small, the amount of free disk space is insufficient.
Mbytes Free %
PD_FSR Mbytes Free
Mbytes Free %

l Monitoring the disk free space

You can use the File System Free(L) alarm or File System Free(R) alarm provided by the monitoring template to monitor the amount of free disk space.

You can use an alarm to effectively monitor the percentage of free logical-disk space.

If the amount of free logical-disk space (the Mbytes Free or Mbytes Free % field of the PD_FSL or PD_FSR record) falls below the threshold, you might need to take appropriate action, such as deleting unnecessary files or adding a disk.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring the disk I/O delay status

You can use the I/O Wait Time alarm provided by the monitoring template to monitor the disk I/O delay.

The alarm includes the Wait % field (of the PI record), with which you can monitor the disk I/O delay status. If the value of this field is large, you might need to take appropriate action, such as checking for a process that is performing too many I/O operations to update a database, for example.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring the disk I/O status

You can use the Disk Service Time alarm provided by the monitoring template to monitor the disk I/O.

The alarm includes the Avg Service Time field (of the PI_DEVD record), which enables you to check for a process that is inputting or outputting a very large amount of information.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring the disk busy rate

You can use the Disk Busy % alarm provided by the monitoring template to monitor the disk busy rate.

You can monitor the disk busy rate by using an alarm to check whether excessive paging (reading and writing of pages by processes) is occurring.

If the disk busy rate (the Busy % field of the PI_DEVD record) continues to be at or above the threshold, you might need to take action. For example, you might need to identify the processes that frequently request disk I/O operations, and then distribute the processing of these processes.

When you monitor the disk busy rate, monitoring the disk I/O delay status, disk I/O status, and disk congestion is also recommended.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring disk congestion

You can use the Disk Queue alarm provided by the monitoring template to monitor disk congestion.

You can monitor disk congestion by using an alarm to check whether I/O requests have been excessive.

If the disk congestion level (the Queue Length field of the PI_DEVD record) continues to be at or above the threshold, you might need to take action. For example, you might need to identify those processes that frequently request disk I/O, and then distribute the processing of the processes.

When you monitor disk congestion, monitoring the disk I/O delay status, disk I/O status, and disk busy rate is also recommended.

For details, see 1.3.3(3)(a) Monitoring template.

(4) Network

This subsection explains how to monitor network performance.

(a) Overview

You can monitor network information to check the response time of system functionality.

Continuous monitoring of network data traffic allows you to plan network reconfiguration or expansion.

The following table lists and describes the principal records and fields related to monitoring network performance.

Table 1-5 Principal fields related to network monitoring

Record Field Description (example)
PI_NIND Pkts Rcvd/sec The number of packets received per second. If the value of this field is large, many packets have been received successfully.
PI_NINS
PI_NIND Pkts Xmitd/sec The number of packets sent per second. If the value of this field is large, many packets have been sent successfully.
PI_NINS
PI_NIND Max Transmission Unit The maximum packet size. In an environment in which an MTU is automatically allocated, if the value of this field is large (1500 or more), splitting of sent or received data occurs. If the value of this field is small (below 1500), the number of control signals and blocks increases. This increase could cause a network bottleneck.

(b) Monitoring methods

l Monitoring for data traffic that exceeds the NIC bandwidth (the maximum amount of data that can be transferred per unit of time)

You can use the Network Rcvd/sec alarm provided by the monitoring template to monitor the bandwidth of a network interface card.

You can monitor the number of packets sent or received over the network by using an alarm to monitor the bandwidth of a network interface card (NIC).

If the number of packets continues to be at or above the threshold, you might need to upgrade the NIC or the physical network.

For details, see 1.3.3(4)(a) Monitoring template.

(5) Processes

This subsection explains how to monitor process performance.

(a) Overview

Because system functionality is provided by individual processes, understanding the operating status of processes is essential for stable system operation.

If one of the processes that provides system functionality terminates abnormally, the system stops with serious consequences. In order to detect such an abnormal condition early and take appropriate action, it is necessary to monitor the status of processes, including their generation and disappearance.

Note that PFM - Agent for Platform performs a process check at the same intervals that information is collected. Accordingly, the time that the disappearance of a process is detected is the time that PFM - Agent for Platform collects information, not the actual time that the process disappeared.

The following table lists and describes the principal records and fields related to the monitoring of processes.

Table 1-6 Principal fields related to the monitoring of processes

Record Field Description (example)
PI_WGRP Process Count The number of processes. If the value of this field is the threshold or less (the minimum number of processes that need to be activated), some or all of the required processes are not running.#
PD_PDI Program The name of a process. If this record is not collected, the process is not running.

#
The /opt/jp1pc/agtu/agent/wgfile file must be setup to collect this record.

(b) Monitoring methods

l Monitoring process disappearance

You can use the Process End alarm provided by the monitoring template to monitor process disappearance.

If a process terminates abnormally, the system stops with serious consequences. You can monitor the disappearance of processes by using an alarm, enabling prompt recovery of the system.

For details, see 1.3.3(5)(a) Monitoring template.

l Monitoring process generation

You can use the Process Alive alarm provided by the monitoring template to monitor process generation.

You can use an alarm to monitor the generation of processes for each application or the status of scheduled processes, enabling you to check the operation status of the production system.

By setting a workgroup in the wgfile file and using the PI_WGRP record, you can perform several types of monitoring. For example, you can monitor the following items: process generation, process disappearance, the number of processes that have the same name, the number of processes for each application, and the number of processes activated for each user.

For details, see 1.3.3(5)(a) Monitoring template.

(6) Example of collecting information about used ports

PFM - Agent for Platform provides functionality to convert user-specific performance data output by users to text files (user-created data) into a format that can be stored in records provided by PFM - Agent for Platform (user data files). For details about user-specific performance data, see 4.2.3 Settings for collecting user-specific performance data.

The following shows an example for collecting used port information in PI_UPIB records as user-specific performance data. The following table shows the format in which used port information is stored.

Table 1-7 Format for user-created data

Option Value
tt "TCP"
ks The host name
lr The total number of TCP ports for the host
lr The number of currently active TCP ports for the host
lr The number of listening TCP ports for the host

To collect information:

  1. Create a shell script for collecting information about used ports.
    In this example, a shell script is used to collect information about used ports. The following shows an example of creating a shell script.

    Example of creating a shell script in Linux (/homework/sample.sh):
    #!/bin/sh
    echo "Product Name=PFM-Agent for Platform (UNIX)" > /homework/userdata.tcp
    echo "FormVer=0001" >> /homework/userdata.tcp
    echo "tt ks lr lr lr" >> /homework/userdata.tcp
    #All TCP port
    ALL_TCP=`netstat -at | wc -l`
    ALL_TCP=`expr $ALL_TCP - 2`
    #Active TCP port
    ACTIVE_TCP=`netstat -at | grep ESTABLISHED | wc -l`
    #Listen TCP port
    LISTEN_TCP=`netstat -at | grep LISTEN | wc -l`
    #Output
    echo "TCP `uname -n` $ALL_TCP $ACTIVE_TCP $LISTEN_TCP" >> /homework/userdata.tcp
    Note
    Note that the example shell script shown here was created for Linux. This shell script might not operate correctly in other OSs, and might not always operate correctly in Linux due to differences in environments.
  2. Execute the shell script created in step 1.
    The following shows the user-created data created as a result of executing the shell script.

    User-created data (/homework/userdata.tcp):
    Product Name=PFM-Agent for Platform (UNIX)
    FormVer=0001
    tt ks lr lr lr
    TCP jp1ps05 15 3 12
  3. Convert the user-created data created in step 2 to a user data file.
    The following shows an example of executing the jpcuser command to convert user-created data into a user data file.

    Example of jpcuser command execution:
    /opt/jp1pc/agtu/agent/jpcuser/jpcuser PI_UPIB 
    -file /homework/userdata.tcp
  4. Use PFM - Agent for Platform to collect the user data file output in step 3.
    When PFM - Agent for Platform collects records, the contents of the user data file are stored in user records.

(7) Example of collecting performance data from multiple hosts on which PFM products are not installed

You can use the user-created data collection functionality provided by PFM - Agent for Platform to collect performance data specific to hosts on which PFM products are not installed. You can also monitor the status of multiple hosts at the same time by converting the performance data for the hosts into a single user data file. In this case, a script such as a shell script needs to be prepared because user-created data will be created on each host on which PFM products have not been installed. The following shows an example for collecting performance data from hosts on which PFM products are not installed, and outputting the data as PFM - Agent for Platform record information.

(a) Collected data

The following example obtains information using the user-created data created in (6) Example of collecting information about used ports.

(b) Prerequisites

The prerequisites for collecting performance data from multiple hosts on which PFM products are not installed are as follows:

(c) Procedures for data collection

The following figure shows the flow of data collection for hosts on which PFM products are not installed.

Figure 1-5 Flow of data collection for hosts on which PFM products are not installed

[Figure]

The following uses the numbering in the figure to explain processing. To collect performance data from multiple hosts, perform these steps for each host.

  1. Create user-created data for hosts on which PFM products are not installed.
    Execute the script to collect performance data, and generate user-created data. The user-created data generated in (6) Example of collecting information about used ports is used here.
  2. Copy files between remote hosts.
    Copy the user-created data created in step 1 to the hosts on which PFM products are installed. Here, user-created data is copied to the /nfshome/ area shared between hosts by using NFS. The following shows an example of the cp command.

    Example of the cp command:
    #/homework/sample.sh
    #cp /homework/userdata.tcp /nfshome/userdata.tcp
    Note
    When collecting user-created data from multiple hosts, make sure that the file names are unique. If file names are duplicated, files might be overwritten during file copying.
  3. Execute the jpcuser command on hosts on which PFM products are installed.
    Execute the jpcuser command on hosts on which PFM products are installed to convert the user-created data copied in step 2 to user data files. The following shows an example in which the user-created data obtained through steps 1 and 2 from the hosts on which PFM products are not installed is converted into a single user data file.

    Example of the jpcuser command:
    /opt/jp1pc/agtu/agent/jpcuser/jpcuser PI_UPIB 
    -file user-created-data-1 -file user-created-data-2 -file user-created-data-3
  4. Collect record data for hosts on which PFM products are installed.
    On the hosts on which PFM products are installed, collect the contents of the user data file output in step 3 as record data.

[Contents][Back][Next]


[Trademarks]

All Rights Reserved. Copyright (C) 2009, Hitachi, Ltd.