Job Management Partner 1/Performance Management - Agent Option for Platform Description, User's Guide and Reference

[Contents][Glossary][Index][Back][Next]


1.3.2 How to monitor performance

The following explains how to monitor performance for each system resource, and provides examples of performance data collection.

Organization of this subsection
(1) Processor
(2) Memory
(3) Disks
(4) Network
(5) Processes and services
(6) Event logs
(7) Active Directory monitoring examples
(8) Example of collecting information about used ports
(9) Example of collecting performance data from multiple hosts on which PFM products are not installed

(1) Processor

This subsection explains how to monitor processor performance.

(a) Overview

By monitoring processor performance, you can check performance trends for the entire system.

In Windows, as illustrated in the following figure, processes are executed in two processor access modes: user mode and kernel mode. This figure provides an overview of the Windows architecture.

Figure 1-1 Overview of the Windows architecture

[Figure]

You can also check performance trends for the entire system by monitoring the number of queued jobs.

Jobs, such as processes, are executed by the CPU according to the schedule determined by the OS. The number of queued jobs is the number of jobs that are waiting to be executed by the CPU. When the overall system load is high, the number of queued jobs tends to increase.

The monitoring templates provide functionality such as CPU Usage alarms and CPU Status (Multi-Agent) reports.

To monitor processor performance with more detail than with the monitoring templates, the processor usage per processor, processor usage per process, processor queue count, and processor interrupts from hardware can also be monitored.

The following table lists and describes the principal records and fields related to processor monitoring.

Table 1-1 Principal records and fields related to processor monitoring

Record Field Description (example)
PI_PCSR CPU % The CPU usage for a processor. If the value of this field continues to be at or above the threshold (normally 85%), the processor might be a system bottleneck.#
Privileged CPU % The percentage of time that the workgroup was using the processor in privileged mode. If the value of the CPU % field in the PI_PCSR record continues to be at or above the threshold, there might be a problem with a specific application process (including a service) or system process (including a service).#
User CPU % The CPU usage for a processor executed in the user mode. If the value of the CPU % field in the PI_PCSR record continues to be at or above the threshold, there might be a problem with a specific application process (including a service).#
PI_SVRQ Queue Length The current length of the CPU server operation queue. If the value of this field continues to be at or above the threshold (2), the processor is busy.
PI Processor Queue Length The number of threads ready to be executed in the processor queue. If this value continues to be at or above threshold (2), this indicates that the processor is congested.
CPU % The processor usage (%). That is, the percentage of time that the processor was executing non-idle threads. The maximum value is 100, even in a multi-processor environment.
Privileged CPU % The CPU usage in the kernel mode. If the value of the CPU % field in the PI record continues to be at or above the threshold, there might be a problem with a specific application process (including a service) or system process (including a service).
User CPU % The CPU usage in the user mode. If the value of the CPU % field in the PI record continues to be at or above the threshold, there might be a problem with a specific application process (including a service).
Total Interrupts/sec The number of hardware interrupts processed per second. If the value of this field has increased substantially when there are no system activities, there might be a hardware problem (for example, there is a slow device burdening the processor with hardware interrupts).
PI_PCSR Interrupts/sec The number of hardware interrupts processed by a processor per second. This field is used when the Total Interrupts/sec field in the PI record is monitored for each processor.

#
This field is used when monitoring is to be performed for each processor.

In a multi-processor environment, the system CPU usage is represented by the average usage of all CPUs. Therefore, check the CPU usage for each CPU.

To identify processes that are causing a bottleneck, check the CPU usage for each process.

The following table lists and describes the principal records and fields related to the process monitoring in a multi-processor environment.

Table 1-2 Principal records and fields related to processor monitoring in a multi-processor environment

Record Field Description (example)
PD_PDI CPU % The CPU usage for a processor. If the value of this field continues to be at or above the threshold, the processor might be a processor bottleneck.#
Privileged CPU % The percentage of time that the workgroup was using the processor in privileged mode. If the value of the CPU % field continues to be at or above the threshold and the Privileged CPU % value is close to the CPU % value, an API function issued by the process might be a processor bottleneck.#
User CPU % The CPU usage for a process executed in the user mode. If the value of the CPU % field continues to be at or above the threshold and the User CPU % value is close to the CPU % value, the process's processing might be a processor bottleneck.#

#
In a multi-processor environment, the maximum usage value that can be displayed is equal to the number of processors x 100 (%).

(b) Monitoring methods

l Monitoring processor usage

System-wide processor usage can be monitored using the CPU Usage alarm provided by the monitoring templates.

The processor usage (the CPU % field of the PI record) allows you to monitor the processor load status. For details, see 1.3.3(1)(a) Monitoring template.

l Monitoring processor congestion

In addition to processor usage, you can monitor processor congestion (the number of queued requests) to monitor the processor load status.

Monitoring both processor congestion and processor usage is an effective way to monitor the processor load status.

If a value at or above the threshold is displayed for the processor usage and queue length (Queue Length field in the PI_SVRQ record), the processor is probably congested.

Note that the threshold for the number of threads in the queue (Processor Queue Length field in the PI record) is about 2. If a value of 10 or more is displayed for this value, the system capacity is being exceeded. This value can be used as a guideline for determining whether to upgrade the processor or whether to add processors.

For definition examples, see 1.3.3(1)(b) Definition examples other than for monitoring templates.

l Checking processes whose processor usage is high

If you decide that a bottleneck might have occurred after monitoring processor usage and process congestion, use a real-time report (the CPU % field of the PD_PDI record) to find processes that are monopolizing the processor.

If no such processes exist, the system environment is inadequate for the processing. In this case, you might need to upgrade the processor or add processors.

For definition examples, see 1.3.3(1)(b) Definition examples other than for monitoring templates.

(2) Memory

This subsection explains how to monitor memory performance.

(a) Overview

You can monitor memory performance to detect physical memory shortages and incorrect process operations.

Memory consists of physical memory and a paging file, as illustrated below. However, because the causes of bottlenecks are not limited to a small amount of physical memory or a small paging file, the paging status, page faults, and other items related to efficient memory usage must be monitored as well.

The following figure illustrates the configuration of the memory space.

Figure 1-2 Conceptual diagram of the memory space

[Figure]

Insufficient physical memory degrades overall system performance. Memory areas not accessed by programs for a long time are saved to the paging file, and are loaded into physical memory on demand. Physical memory is efficiently used in this manner. Note, however, that paging file access is markedly slower than physical memory access. Therefore, frequent paging or frequent page faults will considerably delay system processing.

Because paging often occurs even in normal processing, measure performance when the system is operating stably before attempting to determine proper thresholds.

The Available Memory alarm is provided by the monitoring templates. If you want to perform more detailed monitoring, see the following table, which lists and describes the principal records and fields related to memory monitoring.

Table 1-3 Principal fields related to memory monitoring

Record Field Description (example)
PI Pages/sec The number of operations that caused paging per second. If the value of this field continues to be at or above the threshold (5), memory might be a system bottleneck. Note, however, that if this status is temporary, the maximum allowable value can be 20, depending on the case.
Page Faults/sec The number of page faults occurring per second. If the value of this field continues to be at or above the threshold (5), memory might be a system bottleneck.
Data Map Hits % The percentage of the number of requests that mapped a page to the file system cache. If the value of this field continues to be low, memory might be a system bottleneck.
Total Physical Mem Mbytes The amount of physical memory.
Available Mbytes The amount of available physical memory.
Used Physical Mem Mbytes The amount of physical memory in use.
% Physical Mem Physical memory usage.
Commit Limit Mbytes The amount of virtual memory.
Non Committed Mbytes The amount of available virtual memory.
Committed Mbytes The amount of virtual memory in use. If the value of this field continues to be at or above the threshold (the Total Physical Mem Mbytes field value of the PI record), a larger amount of memory might be required.
% Committed Bytes in Use Virtual memory usage. If the value of this field continues to be at or above the threshold (determined by the system load status), the paging file might need to be expanded.
PD_PAGF % Usage Paging file usage. If the value of this field continues to be at or above the threshold (determined by the system load status), the paging file might need to be expanded.

The cause of a system memory shortage is not always physical memory itself. A problem with a program can also cause a shortage. By monitoring memory usage for each process, you can identify the cause of shortages. If there is a process improperly occupying memory or if the amount of memory used by a process continues to increase steadily, the program running the process is likely to be defective.

The following table lists and describes the principal records and fields related to monitoring the memory usage of a specific process.

Table 1-4 Principal fields related to the memory monitoring for each process

Record Field Description (example)
PI Pool Nonpaged Bytes The amount of physical memory that is being used and cannot be paged out. If the value of this field continues to increase even when server activities are not increasing, a process causing a memory leak might exist.
PD_PDI Page Faults/sec The number of page faults occurring per second. A process causing a bottleneck can be detected from the frequency of process-specific page faults.
Pool Nonpaged Kbytes The amount of other types of memory and the number of handles being used. If the value of any of these fields continues to increase, a process causing a memory leak might exist.
Pool Paged Kbytes
Working Set Kbytes
Page File Kbytes
Private Kbytes
Handle Count

(b) Monitoring methods

l Monitoring the amount of available physical memory

The unused size for physical memory (Available Mbytes field in the PI record) can be monitored using the Available Memory alarm provided by the monitoring templates.

For details, see 1.3.3(2)(a) Monitoring template.

l Monitoring the usage status of virtual memory

You can use the usage status of virtual memory as a guideline for determining whether to increase physical memory.

Even when memory usage is temporarily high, if the high load status does not persist, performance degradation might be permissible. Therefore, monitoring both the load status of virtual memory and the usage status of virtual memory is recommended.

If the amount of used virtual memory (the Committed Mbytes field of the PI record) is larger than the total amount of physical memory (the Total Physical Mem Mbytes field of the PI record), more memory might be required.

For definition examples, see 1.3.3(2)(b) Definition examples other than for monitoring templates.

l Monitoring the load status of virtual memory

You can use the load status of virtual memory as a guideline for determining whether to increase physical memory.

Even though memory usage is temporarily high, if the high load status does not persist, performance degradation might be permissible. Therefore, monitoring both the load status of virtual memory and the usage status of virtual memory is recommended.

If the number of page faults (the Page Faults/sec field of the PI record) is at or above the threshold, the amount of memory allocated on the server might be less than the amount of memory secured by applications.

If the paging frequency (the Pages/sec field of the PI record) is at or above the threshold, the amount of physical memory might be insufficient.

For definition examples, see 1.3.3(2)(b) Definition examples other than for monitoring templates.

l Checking whether a memory leak has occurred

A memory leak, which decreases the amount of available memory, might prevent the entire system from operating correctly. You can detect memory leaks by checking the line graph of a historical report for whether the amount of nonpaged-pool memory (the Pool Nonpaged Bytes field of the PI record) is increasing steadily.

If the amount of nonpaged-pool memory (the Pool Nonpaged Bytes field of the PI record) is increasing steadily even when the number of active processes has not changed, a process causing a memory leak might exist.

For definition examples, see 1.3.3(2)(b) Definition examples other than for monitoring templates.

l Monitoring the amount of memory used by processes

If you think a memory leak has occurred, you can identify the process that is causing the memory leak.

To do so, in a status in which server activities are not increasing, use a real-time report to monitor the amount of memory used by each process for a period from a few to some tens of minutes. For this monitoring, you can use, for example, the Working Set Kbytes field of the PD_PDI record. Then, in the displayed line graph, check for a process whose memory usage is increasing.

If you identify a process causing a memory leak, you can contact the vendor or take other appropriate action.

For definition examples, see 1.3.3(2)(b) Definition examples other than for monitoring templates.

(3) Disks

This subsection explains how to monitor disk performance.

(a) Overview

You can monitor disk performance to detect disk resource shortages and bottlenecks caused by a disk. Continuous monitoring of disk performance allows you to check for trends in increased disk space usage so that you can determine an appropriate configuration for the system or determine when the system configuration should be expanded.

A disk stores programs, the data used by the programs, and other data. If the amount of free disk space becomes insufficient, data might be lost or the system response might slow down.

If a program that is performing a disk I/O operation must pause (that is, wait for a response), the disk is becoming a bottleneck.

A disk bottleneck can cause any of several types of performance degradation, such as slow process response. For this reason, it is important to check that disk performance is not degrading.

If you think a disk bottleneck has occurred, first make sure that the disk is not fragmented. Next, make sure that there is enough free disk space by making sure that no invalid files are occupying disk space. If invalid files exist, you must identify the programs that created the files and take appropriate action.

The Disk Space alarm is provided by the monitoring templates. If you want to perform more detailed monitoring, see the following table, which lists and describes the principal records and fields related to the monitoring of disk performance.

Table 1-5 Principal fields related to disk monitoring

Record Field Description (example)
PI_LOGD,
PI_PHYD
% Disk Time The disk busy rate. If the value of this field continues to be at or above the threshold (50% or more, or close to 100%), the load on the disk is high.
Current Disk Queue Length The number of queued requests. If the value of this field continues to be at or above the threshold (3), the disk is congested.
Avg Disk Bytes/Xfer The number of bytes transferred between disks in one I/O operation. The larger the value of this field, the more efficiently the system is operating.
Disk Bytes/sec The number of bytes transferred between disks per second. The larger the value of this field, the more efficiently the system is operating.
PI_LOGD % Free Space The percentage of free disk space. If the percentage is low, the amount of free disk space is insufficient.
Free Mbytes The amount of available disk space. If the value of this field is small, the amount of free disk space is insufficient.

(b) Monitoring methods

l Monitoring the percentage of free logical-disk space

The percentage of the amount of free space on a logical disk can be monitored using the Disk Space alarm provided by the monitoring templates.

When the percentage of free logical-disk space is near or at the threshold value (the % Free Space field of the PI_LOGD record), file defragmentation might be affected.

If the disk capacity is large, the system might operate normally even when the percentage of free logical-disk space is near or at threshold value. Therefore, monitoring the amount of free logical-disk space, as well as the percentage, is recommended.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring the amount of free logical-disk space

The amount of free space on a logical disk can be monitored using the Logical Disk Free alarm provided by the monitoring templates.

You can effectively detect a low disk space level by using an alarm to monitor the amount of free logical-disk space.

The threshold for the amount of free logical-disk space (the Free Mbytes field of the PI_LOGD record) can be used as a guideline for determining whether to take action, such as deleting unnecessary files, compressing files, or adding a disk.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring the disk busy rate

You can use the Disk Busy % alarm provided by the monitoring template to monitor the disk busy rate.

You can monitor the disk busy rate by using an alarm to check whether excessive paging (reading and writing of pages by processes) is occurring.

If the disk busy rate (the % Disk Time field of the PI_PHYD or PI_LOGD record) continues to be at or above the threshold, you might need to take action. For example, you might need to identify the processes that frequently request disk I/O operations, and then distribute the processing of these processes.

When you monitor the disk busy rate, monitoring disk congestion is also recommended.

For details, see 1.3.3(3)(a) Monitoring template.

l Monitoring disk congestion

Disk congestion can be monitored using the Logical Disk Queue alarm or Physical Disk Queue alarm provided by the monitoring templates.

You can monitor disk congestion by using an alarm to check whether I/O requests have been excessive.

If the disk congestion level (the Current Disk Queue Length field of the PI_PHYD or PI_LOGD record) continues to be at or above the threshold, you might need to take action. For example, you might need to identify those processes that frequently request disk I/O, and then distribute the processing of the processes.

When you monitor disk congestion, monitoring the disk busy rate is also recommended.

For details, see 1.3.3(3)(a) Monitoring template.

(4) Network

This subsection explains how to monitor network performance.

(a) Overview

You can monitor network information to check the response time of system functionality.

Continuous monitoring of network data traffic allows you to plan network reconfiguration or expansion.

The following table lists and describes the principal records and fields related to monitoring of the network performance.

Table 1-6 Principal fields related to network monitoring

Record Field Description (example)
PI_NETI Bytes Total/sec The amount of data sent and received per second. In an environment that always uses an NIC, if the value of this field frequently falls below the threshold (the larger the value, the better), the NIC might be a bottleneck.#
Bytes Sent/sec The amount of data sent per second. In an environment that always uses an NIC, if the value of this field frequently falls below the threshold (the larger the value, the better), the NIC might be a bottleneck.#
PI Bytes Rcvd/sec The amount of data received per second. Compare the number of bytes that the server receives from the network to the total bandwidth of the NIC (the maximum amount of data that can be transferred per unit of time over the network). If the number of bytes is equal to or greater than 50% of the total bandwidth, the network connection might be a bottleneck.

#
If the value of this field is large, a large amount of data has been transferred successfully.

(b) Monitoring methods

l Monitoring for data traffic that exceeds the NIC bandwidth (the maximum amount of data that can be transferred per unit of time)

You can use the Network Received alarm provided by the monitoring templates to monitor the bandwidth of a network interface card.

You can monitor network traffic by using an alarm to monitor the bandwidth of a network interface card (NIC).

If the data traffic continues to be at or above the threshold, you might need to upgrade the NIC or the physical network.

For details, see 1.3.3(4)(a) Monitoring template.

(5) Processes and services

This subsection explains how to monitor process performance and service performance.

(a) Overview

Because system functionality is provided by individual processes and services, understanding the operating status of processes and services is essential for stable system operation.

If one of the processes or services that provide system functionality terminates abnormally, the system stops with serious consequences. In order to detect such an abnormal condition early and take appropriate action, it is necessary to monitor the status of processes and services including their generation and disappearance.

Note that PFM - Agent for Platform performs a process check at the same intervals that information is collected. Accordingly, the time that the disappearance of a process is detected is the time that PFM - Agent for Platform collects information, not the actual time that the process disappeared.

The following table lists and describes the principal records and fields related to the monitoring of processes and services.

Table 1-7 Principal fields related to the monitoring of processes and services

Record Field Description (example)
PI_WGRP Process Count The number of processes. If the value of this field is the threshold or less (the minimum number of processes that need to be activated), some or all of the required processes are inactive.#
PD_PDI Program The name of a process. If this record is not collected, the process is inactive.
PD_SVC Service Name The name and status of a service. If the status of the application service (process) is not RUNNING, the service is inactive.
Display Name
State

#
The collection data addition utility must be set up to collect this record.

(b) Monitoring methods

l Monitoring process disappearance

You can use the Process End alarm provided by the monitoring templates to monitor process disappearance.

If a process terminates abnormally, the system stops with serious consequences. You can monitor the disappearance of processes by using an alarm, enabling prompt recovery of the system.

For details, see 1.3.3(5)(a) Monitoring template.

l Monitoring process generation

You can use the Process Alive alarm provided by the monitoring templates to monitor process generation.

You can use an alarm to monitor the generation of processes for each application or the status of scheduled processes, enabling you to check the operating status of the production system.

By using the PI_WGRP record and specifying the workgroup settings of the collection data addition utility, you can perform several types of monitoring. For example, you can monitor the following items: process generation, process disappearance, the number of processes that have the same name, the number of processes for each application, and the number of processes activated for each user.

For details, see 1.3.3(5)(a) Monitoring template.

l Monitoring for service stoppages

Service stoppage can be monitored using the Service (Service Nm) alarm or Service (Display Nm) alarm provided by the monitoring templates.

If a service terminates abnormally, the production system stops with serious consequences.

You can monitor a service for stoppages by using an alarm, enabling prompt recovery of the system.

For details, see 1.3.3(5)(a) Monitoring template.

(6) Event logs

This subsection explains how to monitor event logs.

(a) Overview

The OS and applications output errors, warnings, and other types of events to Event Viewer. By monitoring the Event Viewer event logs, you can detect a problem with the OS or an abnormal process operation, enabling prompt recovery of the system.

The following table lists and describes the principal records and fields related to the monitoring of the event logs.

Table 1-8 Principal fields related to the event log monitoring

Record Field Description (example)
PD_ELOG Log Name The event log type. Event logs include the following types of logs: Application, Security, and System
Event Type Name The event type identification name, such as Error or Warning.
Source Name The name of the application that output the event. This information identifies the application that output the event.
Event ID The event ID. This information uniquely identifies each logged event for an application.
Description The description (details) of the event.

(b) Monitoring methods

l Monitoring all error and warning events output to the event logs

All errors and warnings output to the event log can be monitored using the Event Log (all) alarm provided by the monitoring templates.

You can use an alarm to monitor the error and warning events output to the event logs.

For details, see 1.3.3(6)(a) Monitoring template.

l Monitoring an MSCS cluster

The operation of an MSCS cluster can be monitored using the Event Log (System) alarm provided by the monitoring templates.

You can use an alarm to monitor the events output by MSCS.

For details, see 1.3.3(6)(a) Monitoring template.

(7) Active Directory monitoring examples

The following explains how to monitor Active Directory on PFM - Agent for Platform, and gives monitoring examples.

(a) Active Directory monitoring information

For versions 08-11 and later of PFM - Agent for Platform, Active Directory Overview (PI_AD) records can be used to collect Active Directory monitoring information. PI_AD records can be referenced to monitor the execution status and results for replication, session connection status, database cache hit rate, and wait time required for database log output. This allows users to check load, and that Active Directory is operating normally.

The following explains the Active Directory configuration and monitoring information used for each monitoring objective.

The following figure shows the configuration for Active Directory.

Figure 1-3 Active Directory configuration

[Figure]

The following explains the monitoring information for each monitoring objective, according to the numbers in the figure.

Table 1-9 Monitoring information for monitoring point 1

Monitoring objective Bottleneck Monitoring method and countermeasure example PI_AD record monitoring field
Monitoring whether any invalid login attempts have been performed User login If the number of authentication requests is significantly larger than the number of currently logged-in users, this might be due to invalid login attempts (users for which login fails continuously). If invalid login is a possibility, take appropriate measures. Kerberos Authentications,
NTLM Authentications,
LDAP Client Sessions
Preventing degraded performance when distributing request from users across multiple domain controllers Obtain the number of connection sessions to each domain controller, and compare it with the number of logins. Based on the comparison, redistribute the users on each domain controller, to spread the load. LDAP Client Sessions

Table 1-10 Monitoring information for monitoring point 2

Monitoring objective Bottleneck Monitoring method and countermeasure example PI_AD record monitoring field
Monitoring for databases significantly impacting Active Directory performance Active Directory database cache In the following cases, increase the amount of memory allocated to the cache.
  • The value of the Cache % Hit field or Table Open Cache % Hit field is at or below the baseline.
  • The value of the Cache Page Fault Stalls/sec field or Table Open Cache Misses/sec field is at or above the baseline.
Cache % Hit,
Cache Page Fault Stalls/sec,
Cache Page Faults/sec,
Cache Size,
Table Open Cache % Hit,
Table Open Cache Hits/sec,
Table Open Cache Misses/sec,
Table Opens/sec
Active Directory database log buffer If the value of the Log Record Stalls/sec field is at or above the baseline, increase the amount of memory allocated to the log buffer. Log Record Stalls/sec,
Log Threads Waiting,
Log Writes/sec

Table 1-11 Monitoring information for monitoring point 3

Monitoring objective Bottleneck Monitoring method and countermeasure example PI_AD record monitoring field
Monitoring the replication status to prevent large amounts of communication between domain controllers due to replication Intrasite DC communication Fields related to traffic are monitored to see whether they are at or above the baseline. Consider the following countermeasures if the fields are at or above the baseline:
  • Increase the speed of the network.
  • Reschedule intrasite replication for when CPU usage is low.
DRA In Total/sec,
DRA Out Total/sec
Preventing loss of Active Directory functionality performance due to replication, as well as file loss or damage due to folder contention If the value of the DRA Sync Requests Made field minus the value of the DRA Sync Requests Successful field increases monotonically, Active Directory functionality might be degraded.#1 In this case, reschedule intrasite replication for when CPU usage is low. DRA In Total/sec,
DRA Out Total/sec
Preventing large amounts of intrasite network traffic from occurring If the SAM Password Changes/sec field is at or above the baseline, password change requests might be causing a network traffic bottleneck.#2 In this case, redistribute the users on each domain controller, to spread the load. SAM Password Changes/sec

#1
Degradation in response time is often the cause of increased amounts of requests waiting for file replication. When monitoring DRA-related fields, if the value of the DRA Sync Requests Made field minus the value of the DRA Sync Requests Successful field does not increase consistently, this can be deemed normal, as file replication is not failing.

#2
Large amounts of password change requests cause increased network traffic. The SAM Password Changes/sec field is monitored, and if it is less than the number of password changes expected for users, these should be no issues.

Table 1-12 Monitoring information for monitoring point 4

Monitoring objective Bottleneck Monitoring method and countermeasure example PI_AD record monitoring field
Monitoring network traffic between sites DC communication between sites If the byte count is at or above the baseline after compression, consider the following countermeasures:
  • Reschedule replication between sites to when CPU usage is low.
  • Integrate the sites.
DRA In Total/sec,
DRA Out Total/sec
Zone transfer Monitor to check whether zone transfer is consuming network bandwidth between sites. If network bandwidth between sites is being consumed, consider integrating the sites. Zone Transfer Failure, Zone Transfer Request Received,
Zone Transfer SOA Request Sent,
Zone Transfer Success
Monitoring the replication status to prevent large amounts of communication between domain controllers due to replication DC communication between sites Fields related to traffic are monitored to see whether they are at or above the baseline. Consider the following countermeasures if the fields are at or above the baseline:
  • Increase the speed of the network.
  • Reschedule replication between sites to when CPU usage is low.
DRA In Total/sec,
DRA Out Total/sec

(b) Prerequisites for collecting Active Directory monitoring information

To obtain performance data about Active Directory, first install Active Directory. Active Directory monitoring information cannot be obtained for environments in which Active Directory is not enabled. For details about how to perform installation, see Installing Active Directory in 5. Records.

(c) Monitoring Active Directory

To determine whether Active Directory is running properly, several basic performance information alarms are created to perform constant monitoring. If the status of these alarms is Abnormal or Warning, issues can be resolved by analyzing detailed reports. The following explains monitoring for basic performance information.

l Monitoring the operation status of the domain controller on which Active Directory is running

The basic performance of a server on which Active Directory is running is greatly affected by the performance of Active Directory itself. The following shows the alarm and reports for monitoring the operation status of servers on which Active Directory is running:

l Monitoring performance information specific to Active Directory

The following shows the PI_AD record fields for monitoring performance information specific to Active Directory.

(d) Active Directory monitoring example

When performance related to Active Directory degrades, PI_AD records can be collected and monitored to help resolve issues. The following describes the items monitored to identify bottlenecks when various problems occur:

The following explains monitoring examples for when the above problems occur. Note that these monitoring examples are for reference, and might differ depending on the user environment. Adjust the thresholds and other settings to suit the user environment.

l When the domain controller load is constantly high

High load on a domain controller is often due to frequent disk access by the Active Directory database. In this case, the issue can be resolved by revising the memory cache or buffer allocation.

Monitoring the Active Directory database cache
With Active Directory databases, records can be accessed without incurring file operations on disk by setting an appropriate cache size. This cache usage can be monitored to adjust the cache, and increase database access performance. The following table describes the fields monitored for database cache usage.

Table 1-13 Fields monitored for database cache usage

Field Description
Cache % Hit The percentage of database file page requests performed without incurring file operations, by using the database cache.
Cache Page Fault Stalls/sec The number of page faults per second for which service could not be received, because there was no page allocated from the database cache.
Cache Page Faults/sec The number of database file page requests per second required because the database cache manager allocated a new page from the database cache.
Cache Size The amount of system memory used to maintain information frequently used by the database cache manager from database files.
Table Open Cache % Hit The percentage of database tables opened using cached schema information.
Table Open Cache Hits/sec The number of database tables opened per second using cached schema information.
Table Open Cache Misses/sec The number of database tables opened per second without using cached schema information.
Table Opens/sec The number of database tables opened per second.

Monitoring examples
When the following conditions are satisfied, performance might degrade due to insufficient cache capacity:
  • Cache % Hit and Table Open Cache % Hit fall below the baseline.
  • Cache Page Fault Stalls/sec rises above the baseline.

Countermeasure example
Increase the amount of memory allocated to the Active Directory database cache.

Monitoring the status of database log writes
The wait time for writing logs can be reduced by monitoring the buffer usage status for database logs, and adjusting the capacity of the log buffer accordingly. Unlike the information from Active Directory database cache monitoring, this is information about log buffer performance.

Table 1-14 Fields for monitoring the status of database log writes

Field Description
Log Record Stalls/sec The number of log records per second that could not be added per second due to lack of log buffer space.
Log Threads Waiting The number of threads standing by for writing log buffer data to log files, while waiting for database update to complete.
Log Writes/sec The number of times per second that log buffer data is written to log files.

Monitoring examples
When the following condition is satisfied, performance might degrade due to insufficient log buffer space:
  • Log Record Stalls/sec rises above the baseline.

Countermeasure example
Increase the amount of memory allocated to the log buffer.

l When logins are concentrated on a specific domain

Check the following fields to determine the number of sessions currently being used due to Active Directory.

Table 1-15 Fields for monitoring the number of current sessions

Field Description
AB Client Sessions The number of client sessions for the connected address book.
LDAP Client Sessions The number of session for the connected LDAP client.

Monitoring example
When the following condition is satisfied, logins are likely concentrated on a specific domain:
  • LDAP Client Sessions rises above the baseline.

Countermeasure example
  • Even out the number of users allocated to each domain controller.
  • Distribute the number of users, such as by increasing the number of domain controllers.

l When intrasite network load is high

Intrasite network load might be high because Active Directory is performing large-scale replication within the site. The following table lists the fields for monitoring intrasite replication.

Table 1-16 Fields for monitoring intrasite replication traffic

Field Monitoring target Description
DRA In Not Compress Inbound replication The number of bytes for uncompressed data (amount of input).
DRA In Not Compress/sec The number of bytes per second for uncompressed data (input frequency).
DRA Out Not Compress Outbound replication The number of bytes for uncompressed data (amount of output).
DRA Out Not Compress/sec The number of bytes per second for uncompressed data (output frequency).

Monitoring example
When the following conditions are satisfied, intrasite network load might be high due to replication traffic within the site:
  • DRA In Not Compress/sec and DRA Out Not Compress/sec rise above the baseline.

Countermeasure example
Distribute the load, such as by increasing the number of domain controllers.

l When network load between sites is high

The network load between sites might be high because Active Directory is performing large amount of replication between sites. Unlike intrasite replication, communication for replication between sites involves compression. The replication operation itself does not change. The following fields are for monitoring replication traffic between sites.

Table 1-17 Fields for monitoring replication traffic between sites

Field Monitoring target Description
DRA In After Compress Inbound replication The number of bytes for compressed data (amount of input).
DRA In After Compress/sec The number of bytes per second for compressed data (frequency of input).
DRA In Before Compress The number of bytes for uncompressed data (amount of input).
DRA In Before Compress/sec The number of bytes per second for uncompressed data (frequency of input).
DRA Out After Compress Outbound replication The number of bytes for compressed data (amount of output).
DRA Out After Compress/sec The number of bytes per second for compressed data (frequency of output).
DRA Out Before Compress The number of bytes for uncompressed data (amount of output).
DRA Out Before Compress/sec The number of bytes per second for uncompressed data (frequency of output).

Monitoring example
When the following conditions are satisfied, network load might be high between sites due to replication traffic between sites.
  • DRA In After Compress/sec, DRA In Before Compress/sec, DRA Out After Compress/sec, and DRA Out Before Compress/sec rise above the baseline.

Countermeasure example
  • Schedule replication between sites when CPU usage is low.
  • Consider integrating the sites, to reduce communication between the sites.

Hint
Replication is functionality for distributing the load of a database management system. If multiple copies of the database are distributed across the network, the load on lines and machines is reduced. Replication functionality can be used with Active Directory to provide advanced directory services while distributing load across machines.
Replication is an important part of directory services using Active Directory. By monitoring replication traffic, the current load can be better understood to determine any necessary steps to be taken.
Active Directory operates on the assumption that the network connection within a site is fast and reliable. Accordingly, data is not compressed when intrasite replication is performed, which avoids the overhead of compression processing.
However, when replication is performed between the domain controllers of sites, costs can be incurred due to the distances involved in normal communication between sites. This is why data is compressed when replication is performed between sites.

(8) Example of collecting information about used ports

PFM - Agent for Platform provides functionality to convert user-specific performance data output by users to text files (user-created data) into a format that can be stored in records provided by PFM - Agent for Platform (user data files). For details about user-specific performance data, see 3.2.6 Settings for collecting user-specific performance data.

The following shows an example for collecting used port information in PI_UPIB records as user-specific performance data. The following table describes the format in which used port information is stored.

Table 1-18 Format for user-created data

Option Value
tt TCP
ks The host name
lr The total number of TCP ports for the host
lr The number of currently active TCP ports for the host
lr The number of listening TCP ports for the host

To collect information:

  1. Create a batch operation for collecting information about used ports.
    In this example, a batch operation is used to collect information about used ports, as shown below.

    Batch creation example in Windows 2003 (D:\homework\sample.bat):
    @echo off
    echo Product Name=PFM-Agent for Platform (Windows) > D:\homework\userdata.tcp
    echo FormVer=0001 >> D:\homework\userdata.tcp
    echo tt ks lr lr lr >> D:\homework\userdata.tcp
    hostname > D:\homework\userdata.tmp
    netstat -ap tcp | find "TCP" /C >> D:\homework\userdata.tmp
    netstat -ap tcp | find "ESTABLISHED" /C >> D:\homework\userdata.tmp
    netstat -ap tcp | find "LISTENING" /C >> D:\homework\userdata.tmp
    (
    set /p ks=
    set /p lr1=
    set /p lr2=
    set /p lr3=
    ) < D:\homework\userdata.tmp
    del D:\homework\userdata.tmp
    echo TCP %ks% %lr1% %lr2% %lr3% >> D:\homework\userdata.tcp
    Note
    As the example batch operation shown here was created for Windows 2003, it might not operate correctly in other OSs, and might not always operate correctly on Windows 2003 due to differences in environments.
  2. Execute the batch operation created in step 1.
    The following shows the user-created data created as a result of batch execution.

    User-created data (D:\homework\userdata.tcp):
    Product Name=PFM-Agent for Platform (Windows)
    FormVer=0001
    tt ks lr lr lr
    TCP jp1ps05 15 3 12
  3. Convert the user-created data created in step 2 to a user data file.
    The following shows example of executing the jpcuser command to convert user-created data into a user data file.

    Example of jpcuser command execution:
    "C:\Program Files\HITACHI\jp1pc\agtt\agent\jpcuser\jpcuser" PI_UPIB 
    -file D:\homework\userdata.tcp
  4. Use PFM - Agent for Platform to collect the user data file output in step 3.
    When PFM - Agent for Platform collects records, the contents of the user data file are stored in user records.

(9) Example of collecting performance data from multiple hosts on which PFM products are not installed

You can use the user-created data collection functionality provided by PFM - Agent for Platform to collect performance data specific to hosts on which PFM products are not installed. You can also monitor the status of multiple hosts at the same time by converting the performance data for the hosts into a single user data file. In this case, a script such as a shell script needs to be prepared because user-created data will be created on each host on which PFM products have not been installed. The following shows an example for collecting performance data from hosts on which PFM products are not installed, and outputting it as PFM - Agent for Platform record information.

(a) Collection data

The following example obtains information using the user-created data created in (8) Example of collecting information about used ports.

(b) Prerequisites

The prerequisites for collecting performance data from multiple hosts on which PFM products are not installed are as follows:

(c) Procedures for data collection

The following figure shows the flow of data collection for hosts on which PFM products are not installed.

Figure 1-4 Flow of data collection for hosts on which PFM products are not installed

[Figure]

The following uses the numbering in the figure to explain processing. To collect performance data from multiple hosts, perform these steps for each host.

To collect data:

  1. Create user-created data for hosts on which PFM products are not installed.
    Execute the script to collect performance data, and generate user-created data. The user-created data generated in (8) Example of collecting information about used ports is used here.
  2. Copy files between remote hosts.
    Copy the user-created data created in step 1 to the hosts on which PFM products are installed. Here, user-created data is copied to the F:\nethome\ area shared between hosts, using network drive allocation. The following shows an example of the copy command.

    Example of the copy command:
    copy D:\homework\userdata.tcp F:\nethome\userdata.tcp
    Note
    When collecting user-created data from multiple hosts, make sure that the file names are unique. If file names are duplicated, files might be overwritten during file copying.
  3. Execute the jpcuser command on hosts on which PFM products are installed.
    Execute the jpcuser command on hosts on which PFM products are installed to convert the user-created data copied in step 2 to user data files. The following shows an example in which the user-created data from hosts without PFM from steps 1 and 2 is converted into a single user data file.

    Example jpcuser command:
    "C:\Program Files\HITACHI\jp1pc\agtt\agent\jpcuser\jpcuser" PI_UPIB 
    -file user-created-data-1 -file user-created-data-2 -file user-created-data-3
  4. Collect record data for hosts on which PFM products are installed.
    For hosts on which PFM products are installed, collect the contents of the user data file output in step 3 as record data.

[Contents][Back][Next]


[Trademarks]

All Rights Reserved. Copyright (C) 2009, Hitachi, Ltd.