4.1 Analyzing the root cause of a failure

When a failure occurs, the Monitoring Manager uses the functionality for analyzing root causes to investigate and filter the correlations among the large number of events that occur. The Monitoring Manager analyzes the failure based on the Layer 2 topology and Layer 3 topology to identify the root cause, and then reports the root cause as an incident. The Monitoring Manager manages the progress of incident handling (lifecycle state), from the occurrence of the problem to its solution.

The following uses an example of monitoring a network device (router) to check how to analyze the root cause.

[Figure]

If the Router03 node goes down, there is no response from a large number of interfaces and IP addresses that Router03 has.
A large number of failure events occur due to interface failures and no response from IP addresses.
The Monitoring Manager decides that the lack of responses from IP addresses was caused by the interface failures, and then suppresses the corresponding incidents.
Based on the fact that communication was lost at neighboring nodes, the Monitoring Manager decides the root cause is the Router03 node going down. The Monitoring Manager also decides that the interface failures were caused by the node going down, and associates the interface failures with the Router03 node going down.
The Router03 node going down is reported as the root-cause incident.

In addition, the Monitoring Manager effectively uses Layer 2 topology information to analyze the root cause even for multiple nodes that compose a network. The following table shows examples of analyzing the root cause by using a Layer 2 topology network configuration.

Analysis of a Layer 2 topology	Description
Normal time	The Monitoring Manager is connected to the top-level switch `S1`, and the networks that are being monitored are all in the normal status.
When a failure occurs in the top-level switch	Detailed failure: The top-level switch `S1` went down. Events that occur: Communication with `S1` is unavailable. Communication with other switches via `S1` is unavailable. The Monitoring Manager handles this situation as follows: Detects the failure of the `S1` node. Decides that the failure of `S1` caused communication via `S1` to be unavailable, suppresses the corresponding incidents, and then decides that the status is unclear. As a result, the Monitoring Manager reports only the failure of `S1` as the root-cause incident.
When a failure occurs in a middle-level switch	Detailed failure: The middle-level switch `C2` went down. Events that occur: Communication with `C2` is unavailable. Each node interface connecting with `C2` went down. The Monitoring Manage handles this situation as follows: Detects the failure of the `C2` node. Decides that each interface connecting with `C2` went down because of the failure of `C2`, and then suppresses the corresponding incidents. As a result, the Monitoring Manager reports only the failure of `C2` as the root-cause incident.

Analysis of a Layer 2 topology

Description

Normal time

[Figure]

The Monitoring Manager is connected to the top-level switch S1, and the networks that are being monitored are all in the normal status.

When a failure occurs in the top-level switch

[Figure]

Detailed failure: The top-level switch S1 went down.

Events that occur:

Communication with S1 is unavailable.
Communication with other switches via S1 is unavailable.

The Monitoring Manager handles this situation as follows:

Detects the failure of the S1 node.
Decides that the failure of S1 caused communication via S1 to be unavailable, suppresses the corresponding incidents, and then decides that the status is unclear.

As a result, the Monitoring Manager reports only the failure of S1 as the root-cause incident.

When a failure occurs in a middle-level switch

[Figure]

Detailed failure: The middle-level switch C2 went down.

Events that occur:

Communication with C2 is unavailable.
Each node interface connecting with C2 went down.

The Monitoring Manage handles this situation as follows:

Detects the failure of the C2 node.
Decides that each interface connecting with C2 went down because of the failure of C2, and then suppresses the corresponding incidents.

As a result, the Monitoring Manager reports only the failure of C2 as the root-cause incident.

The Monitoring Manager can also analyze many other correspondences between an event and a root cause.

To Page Top