Hitachi

JP1 Version 12 JP1/Integrated Management 2 - Manager Overview and System Design Guide


3.7.4 Suggestion of response actions appropriate to the system status

When a failure occurs in the system, our previous procedures required a series of processes to be carried out to address the failure, including reviewing the operational rules, identifying the business operations affected by the failure, checking and determining the system operation status, determining the actions that can be taken to properly address the failure, and implementing an appropriate action. This meant spending a considerable amount of time before the actual action can be undertaken to address the problem, which has been a huge burden for system operators.

The suggestion of response actions appropriate to the system status, which is based on the established operational rules and past operational records, offers a quick way to execute a suggested response action appropriate to the status of the system. It works like this: When one of the abnormalities being monitored for is detected, an operator requests suggestions from JP1/IM. In response to this, the Intelligent Integrated Management Base suggests several response actions that are appropriate to the current status of the system. All that the operator has to do to address the abnormality is select a suggestion and execute the corresponding response action.

This solution minimizes the efforts required to check the operational rules and to identify the business operations affected by the problem, and leads to a considerable reduction in the amount of time it takes to address failures in the system. It also reduces human errors and operational mistakes and enables smooth transition from system operation that relies heavily on human input to system operation management that is unaffected by the skills of the individuals using the system.

The following figure shows an overview of the process through which the Intelligent Integrated Management Base suggests response actions that are appropriate to the system status.

Figure 3‒20: Overview of how response actions appropriate to the system status are suggested

[Figure]

You can set the operational rules (suggestion activation criteria and response actions) to be used to suggest response actions appropriate to the system status by using the suggestion definition files. The use of linked products and the implementation of plug-ins enable advanced operation.

The following figure shows the relationship between operational rules and suggestion definition files.

Figure 3‒21: Relationship between operational rules and suggestion definition files

[Figure]

The following table describes the types of information items that can be set as suggestion activation criteria. You can combine these information items by using an OR condition or AND condition.

Table 3‒11: Information items that can be set as suggestion activation criteria

JP1 event

Performance information (trend information)

Relation

Plug-in

Judgment of each attribute

Correlations between multiple events

The number of time-series data items whose values have increased

Relations between jobs and infrastructures, or between jobs

Execution results of plug-ins, REST APIs, and commands

The following table describes the response actions that can be executed. Note that the information items referenced by suggestion activation criteria can be used as variables.

Table 3‒12: Response actions that can be executed

Change of the event status

Window display

Plug-ins, etc.

Changes can be made to the event status of JP1 events.

The Web window can be displayed to check the business impact and linked products.

Plug-ins, REST APIs, and commands can be executed.

You can define suggested response actions in suggestion definition files and apply them by executing the jddupdatesuggestion command. In the tree view in the Integrated Operation Viewer window, the [Figure] icon is displayed beside the IM management nodes for which the Suggestion tab can be displayed.Furthermore, the following suggestion icons are displayed beside IM management nodes where a failure defined as Emergency/Alert/Critical, Error, or Warning has occurred:

When a user clicks an IM management node with one of these icons displayed beside it, the suggestion definitions mapped to the clicked IM management nodes are acquired, and a judgment is made as to whether the JP1 permission granted to the logged-in user matches the suggestion display criteria defined in the suggestion definitions. If the user's permission level matches the suggestion display criteria, a list of response actions shows up on the Suggestion tab.

When the user clicks the Suggestion button, the Intelligent Integrated Management Base automatically determines the system status, and activates and displays the suggestion definitions matching the suggestion activation criteria so that one of them can be executed as the response action.

After examining the details of the defined criteria and checking the specifics of the response actions that can be executed by the activated suggestions, the user determines which suggestion to execute. When the user clicks the Execute button, the response action is executed. This completes the response action necessary to address the failure.

Organization of this subsection

(1) Criterion information cache

The criterion information collected for judgment of suggestion activation criteria can be cached. This reduces the load placed on the location where criterion information originates from. Whether to create or refer to the cache is specified in the suggestion definition file on a suggestion-activation-criterion basis. If this is not specified, the cache is not created or referred to.

Each suggestion activation criterion has a key specified in it. The criterion information is cached on a per-key basis. When judgment of a suggestion activation criterion having the same key has to be made while the corresponding cache entry has still not expired, the cache entry is used for judgment instead of collecting the relevant criterion information. Caches are available on a per-system basis. This means the same cache is referenced and updated by all users logging in to the same system.

You can view the response actions executed last time in the Suggestion details area on the Suggestion tab. Alternatively, you can acquire the same information by using the previous execution history acquisition API.

The history of response action executed in the past, including those executed last time, is output to the response action execution history file (jddSuggestionHistory.log). The start, end, and failure of response actions are output as JP1 events.

In suggestion activation criteria that place a load on the execution host of commands and the REST API, the criterion information cache can reduce the number of runs of the commands and REST API. For such suggestion activation criteria, consider the use of the criterion information cache.

(2) Suggestion template

JP1/IM offers several suggestion templates to cater to the operational needs that are expected based on past inquiries and requests from customers. You can apply the suggestion templates to the system by using the procedure below.

For details about the individual suggestion templates, see the manuals of the respective products.

  1. See the document of the product to which you apply the suggestion template to check the prerequisites and procedure for doing so.

  2. If you apply all the suggestion templates, specify the en or ja suggestion template file storage folder according to the language you use, and execute the jddupdatesuggestion command.

  3. If you apply only part of the suggestion templates, or if you customize a suggestion template, perform either of the following procedures:

    • Remove or edit the conf file in the en or ja suggestion template file storage folder, specify the en or ja folder, and then execute the jddupdatesuggestion command.

    • Copy the conf file for the suggestion template you want to apply into a given directory, specify the directory where the conf file is copied, and then execute the jddupdatesuggestion command.

For details about the storage location of the suggestion template files, see Appendix A. Files and Directories.

For details about the jddupdatesuggestion command, see jddupdatesuggestion in Chapter 1. Commands in the manual JP1/Integrated Management 2 - Manager Command, Definition File and API Reference.

The operation examples with the suggestion templates are described in the rest of the subsection.

(a) Suggestions provided to help investigate the business impact of the problem and respond to it (Linkage with JP1/AJS3 and JP1/PFM)

Suppose that after the occurrence of a JP1 event reporting the shutdown of a host, JP1/PFM detects that the host is still down. In this case, suggestions are made as to how to identify the business operations that are potentially affected by this problem. Quickly identifying the business operations affected by the shutdown of the host can help avoid abnormal termination of jobs and prevent the problem from causing widespread impacts on business operations.

Figure 3‒22: Overview of the suggestion of methods to check the business operations that are potentially affected by the shutdown of the host

[Figure]

By selecting one of the root jobnets retrieved by the search (that is, the root jobnets that are potentially affected by the problem) and opening JP1/AJS3 - Web Console, you can check the details of the selected root jobnet. By checking the detailed execution status of the root jobnet, you can take actions such as suspending the root jobnet or making temporary changes to it.

To search for and check the execution agents that are potentially affected by the problem, filter the retrieved execution agents by name of the target host whose shutdown has been detected, and then open JP1/AJS3 - Web Console. In this way, you can manipulate execution agents that are relevant to your operation.

(b) Suggestions provided to help determine whether corrective action must be taken (linkage with JP1/PFM)

Based on the current status of resources that is automatically evaluated based on CPU usage levels, the Intelligent Integrated Management Base suggests response actions appropriate to the current resource status. For example, when CPU usage is constantly high, it suggests ways to investigate the cause of high CPU usage, and when CPU usage is normal, it suggests marking the event as Processed.

Figure 3‒23: Overview of suggestions provided to help determine whether corrective action must be taken

[Figure]

When there are multiple JP1 events that no longer need to be addressed because the processes in question have recovered their normal state, you can mark all such JP1 events as Processed in one go. In this way, you can directly escalate only abnormal metrics without having to go through the trouble of isolating them first.

Figure 3‒24: Overview of a suggestion to mark all JP1 events that no longer need to be addressed as Processed in one go

[Figure]

(c) Suggestion to take alternative action to accommodate maintenance (linkage with JP1/PFM)

Based on the pattern of JP1 events issued as determined based on health checks and the frequency of alarms, the Intelligent Integrated Management Base determines the possibility that maintenance is currently being carried out, and suggests suspension of monitoring. Suspended monitoring can be resumed from the same window in which it was suspended.

Figure 3‒25: Overview of suggestion to take alternative action to accommodate maintenance

[Figure]