4.6.3 Example of execution for predictive error detection in the performance of systems running monitored services and the corrective action support methodology (working with Performance Management)
This subsection explains by way of example how to use SLM to execute predictive error detection in the performance of systems running monitored services and the corrective actions to take (when working with Performance Management), based on given conditions.
- Organization of this subsection
(1) Prerequisites
The conditions for this execution example are as follows:
Registration of monitored services and the setup required for predictive error detection have been completed and monitoring has already started.
The following figure shows the relationship among personnel involved in this task.
Figure 4‒13: Relationship among personnel involved in predictive error detection in the performance of systems running monitored services and the corrective action support methodology (execution example) Person who monitors all services
Instructs the monitor to perform monitoring. If notified of a warning sign of a system performance error, this person investigates the cause. Upon determining that further investigation is needed, this person asks the system administrator to investigate.
Monitor
Uses the Home window to monitor the monitoring items for all monitored services that have been set up by the person who monitors all services. In the event of a warning or error, this person reports it immediately to the person who monitors all services.
System administrator
If requested by the person who monitors all services, this person investigates the status of the system that is providing the monitored service, such as a host or middleware, and takes corrective action.
(2) Predictive error detection in the performance of a monitored service
- Tasks in SLM
While the monitor was monitoring the status of monitored services in the Home window, a warning constituting a warning sign of a service performance error was displayed.
The following figure shows a display example of the Home window when a warning is displayed for a monitored service.
Figure 4‒14: Display example of the Home window that contains a warning for a monitored service Details of the warning displayed in this figure are as follows:
-
When detected: 2020-02-14 01:14:00
-
Type: OUTLIER
-
Details: UPPER LIMIT
-
Service group: Group01
-
Service: Service01
-
Monitored target: Agent01
-
Monitor item: CPU<Drive name>=<C>
This warning indicates that CPU<Drive name>=<C> of Service01 belonging to Group01 that was obtained at 01:14:00 on February 14, 2020, constituted an out-of-range value (a value exceeding the upper limit) and differed significantly from the usual value for the monitored service.
This example indicates that an abnormality was also detected on the monitored host.
-
- Results of the task
The monitor reported the warning to the person who monitors all services.
Because the warning might lead to an error if left unattended, the person who monitors all services decided to take corrective action immediately.
(3) Corrective action taken after a warning sign was detected in the performance of a monitored service
- Tasks in SLM
After being notified of the warning displayed in the Home window, the person who monitors all services decided to use the Troubleshoot window to investigate the timing of the event detected as warning, and then take corrective action.
The following figure shows a display example of the Troubleshoot window in which a warning is displayed for a monitored service.
Figure 4‒15: Display example of the Troubleshoot window in which a warning is displayed for a monitored service This performance chart of CPU<Drive name>=<C> indicates that the event causing the warning occurred between 01:04:52 and 01:48:52.
The person who monitors all services decided to display configuration information to check system performance. The following figure shows a display example of the Troubleshoot window that displays the configuration information.
Figure 4‒16: Display example of Troubleshoot window displaying the configuration information In this example, a warning occurred concerning the CPU of Agent01. This indicates that some problem occurred in the computer that is providing the monitored service.
- Results of tasks
The details of the warning and the timing of the event causing the warning, which became clear from the data provided in the Troubleshoot window, indicate that this is most likely a system performance problem. Therefore, the person who monitors all services contacted the system administrator and requested a root cause investigation and corrective action.
(4) Verifying the system performance after taking corrective action
- Tasks in SLM
After corrective action was taken by the system administrator based on the results of the root cause investigation, the person who monitors all services decided to use the Real-time Monitor window to verify that system performance has returned to normal.
The following figure shows a display example of the Real-time Monitor window showing that the system performance has returned to normal after corrective action was taken.
Figure 4‒17: Display example of the Real-time Monitor window showing that system performance has returned to normal As shown in this figure, when system performance has returned to normal, the (normal) icon is displayed in the System performance information area.
- Results of tasks
The person who monitors all services has verified that service performance and system performance have returned to normal. This concludes the handling of the warning sign of an error in a monitored service.