Hitachi

JP1 Version 12 JP1/Performance Management Planning and Configuration Guide


2.4.3 Examining the items to be monitored

The purpose of an operation monitoring system is to detect in advance that the system has reached a critical state, and to prevent problems. As such, examining the items to be monitored in the system is of the utmost importance. When examining items to be monitored, you must decide which items to be monitored and how to monitor them. When selecting the items to be monitored, see the manual for each PFM - Agent or PFM - RM.

The process flow for examining items to be monitored is as follows:

  1. Examine alarms.

    After deciding what items are to be monitored in the system, you need to decide the thresholds for the items. For example, to keep a shared server from malfunctioning, you can monitor the percentage of free space on the server's logical disk drive, and decide an appropriate threshold.

    Next, you decide the triggers for sending alarms. The possible triggers are as follows:

    • When a monitored item exceeds the preset threshold for the first time

    • Each time alarm evaluation is performed as long as a monitored item exceeds the preset threshold

    You also decide the method for notifying the system administrator when a threshold is reached. For example, a notification can be sent by an email or SNMP trap.

  2. Examine reports.

    To analyze the cause of some situation and understand the current status, you can decide how the report is to be displayed when a threshold is exceeded and an alarm occurs. For example, for the items to be monitored and the display method, you might use a bar graph to show the ten logical hard drives with the least amount of free space.

When selecting items to be monitored, you can use the monitoring template provided by Performance Management. Using the monitoring template as is, or customizing a part of the definition reduces the system administrator's work required to define the items to be monitored.

Point:

Selecting parameters for programs being monitored is not just a technical matter. We recommend that you also consider both job characteristics and operating structure of the system.

Organization of this subsection

(1) Examining alarms

(a) How to set the threshold

Performance Management can issue an alarm event when the performance data collected by PFM - Agent or PFM - RM reaches a pre-defined threshold. The system administrator needs to decide which values, when exceeded for the items being monitored, will cause an alarm event. With Performance Management, the conditions that cause an alarm event to be issued can be defined for specified periods of time.

For example, you could set up the following configurations:

  • Specify separate settings for the processes to be monitored during the day and during the night

  • Because a system operator is constantly monitoring from a monitoring center during the daytime, you can specify that notification by a blinking icon on the monitoring console. You can specify that, at night, an email to be sent to the system administrator's mobile phone.

The system administrator decides the time periods during which the system is to be monitored.

Performance Management can be configured so that an alarm event does not occur if the threshold is exceeded due to a temporary load increase. By configuring Performance Management to send out a notification only after the threshold has been exceeded a certain number of times and monitoring intervals, notification will occur only when the CPU has a heavy, continuous load. By suppressing notifications for temporary load increases, you can ensure that alarms are issued efficiently, taking into account the system attributes.

(b) Triggers for sending alarms

You can send alarm events when the triggers below occur. The system administrator needs to select the appropriate triggers based on the monitoring requirements.

  • Send an alarm when a monitored item exceeds the preset threshold for the first time.

    In this case, further options are available:

    • Send an alarm if the alarm status changes.

    • For multi-instance records, send an alarm if the alarm status for an instance in a record changes.

  • Send an alarm each time alarm evaluation is performed as long as a monitored item exceeds the preset threshold.

(c) Action to be taken if an alarm is sent

The system administrator needs to decide the following in case a monitored item reaches a critical level: how to locate the problem, the primary measures to be taken, the person who takes them, and how to resolve the problem if the primary measures fail. Performance Management can automatically execute the following actions when an alarm event occurs:

  • Notify the system administrator by an email

  • Send an SNMP trap

  • Execute commands, including net send (messenger service) and wall, to notify the system administrator without sending an email or SNMP trap

  • Issue a JP1 event to link with other JP1 products

The system administrator examines measures to take when an alarm is sent by the operation monitoring system, including the above functions.

The following figure shows an example of the process to follow when an alarm occurs.

Figure 2‒10: Example of the process to follow when a monitored system reaches a critical state

[Figure]

Point:

When an alarm event occurs, if the system administrator wants to automatically execute a recovery program and return the system to the normal operating status, we recommend that the system be set up to issue a JP1 event and link with systems such as a job management system.

(2) Examining reports

(a) What type of report to use

Performance Management can create a real-time report to indicate the current operation status and a historical report to show long-term trends in the operation status. Based on the performance data, the system administrator examines what types of report need to be created in order to check the operation status. Creating easy-to-understand reports allows problems with the system to be understood correctly.

Performance Management can display reports every day, week, month, or year. You can configure Performance Management to display a report from the alarm icon when an alarm event occurs or define an association between related reports that allows you to open a different report using a drill-down operation.

Performance Management can also display multiple reports combined on the same graph, allowing you to determine the operating status of the overall system in a comprehensive manner.

If reports need to be output regularly in order to analyze operations over a long period of time, the formats in which to output the reports are considered. In Performance Management, you can display reports in the PFM - Web Console GUI, or use an operation command (the jpcrpt command) to output the report to a text file in CSV or HTML format.

For details about how to define and output a report, see the chapter that explains the creation of reports for operation analysis in the JP1/Performance Management User's Guide.

(b) What items to save in the database

The system administrator decides the following regarding the performance data collected by monitoring agents:

  • Whether to record the performance data in the Store database

    To display the performance data as a historical report, the performance data to be displayed must be configured so that it is saved in the Store database.

  • Performance data collection interval and timing

    When many items are monitored, system performance might suffer because data collection and recording processing might be concentrated at specific time periods. In such a case, the load on the system can be distributed over time if the performance data collection is staggered by item.

    For example, if the performance data for two items is being collected every minute, the offset of one can be configured to 0 seconds and the other to 20 seconds in order to stagger the collection starting time by 20 seconds. When changing the offset, consider the load due to performance data collection and then specify the setting.

    Figure 2‒11: Example of the performance data collection interval and offset configuration

    [Figure]

For details about how to record the performance data, see the chapter that describes the management of operation monitoring data in the JP1/Performance Management User's Guide.