4.1 Analyzing the root cause of a failure
When a failure occurs, the Monitoring Manager uses the functionality for analyzing root causes to investigate and filter the correlations among the large number of events that occur. The Monitoring Manager analyzes the failure based on the Layer 2 topology and Layer 3 topology to identify the root cause, and then reports the root cause as an incident. The Monitoring Manager manages the progress of incident handling (lifecycle state), from the occurrence of the problem to its solution.
The following uses an example of monitoring a network device (router) to check how to analyze the root cause.
If the Router03 node goes down, there is no response from a large number of interfaces and IP addresses that Router03 has.
A large number of failure events occur due to interface failures and no response from IP addresses.
The Monitoring Manager decides that the lack of responses from IP addresses was caused by the interface failures, and then suppresses the corresponding incidents.
Based on the fact that communication was lost at neighboring nodes, the Monitoring Manager decides the root cause is the Router03 node going down. The Monitoring Manager also decides that the interface failures were caused by the node going down, and associates the interface failures with the Router03 node going down.
The Router03 node going down is reported as the root-cause incident.
In addition, the Monitoring Manager effectively uses Layer 2 topology information to analyze the root cause even for multiple nodes that compose a network. The following table shows examples of analyzing the root cause by using a Layer 2 topology network configuration.
Analysis of a Layer 2 topology |
Description |
---|---|
Normal time |
The Monitoring Manager is connected to the top-level switch S1, and the networks that are being monitored are all in the normal status. |
When a failure occurs in the top-level switch |
Detailed failure: The top-level switch S1 went down. Events that occur:
The Monitoring Manager handles this situation as follows:
As a result, the Monitoring Manager reports only the failure of S1 as the root-cause incident. |
When a failure occurs in a middle-level switch |
Detailed failure: The middle-level switch C2 went down. Events that occur:
The Monitoring Manage handles this situation as follows:
As a result, the Monitoring Manager reports only the failure of C2 as the root-cause incident. |
The Monitoring Manager can also analyze many other correspondences between an event and a root cause.