Hitachi

JP1 Version 12 JP1/Network Node Manager i Setup Guide


D.6 Failure scenarios

The following subsections describe the network fault scenarios that the NNMi Causal Engine analyzes and how the failures are diagnosed. The following table shows incident examples indicated by these scenarios, together with other examples:

Table D‒1: Incident definition

Incident name

Description

AddressNotResponding

IPv4 address does not respond to ICMP. The following are possible reasons:

  1. The node is down.

  2. Because of an error in the device (such as a router) configuration, several IPv4 addresses cannot be reached.

InterfaceDown

The interface is not up.

ConnectionDown

Both (or all) connection endpoints are down.

NodeDown

This incident indicates that the NmsApa service has determined that the node is down based on the following analysis:

  • 100% of the IPv4 addresses assigned to this node cannot be reached.

  • The SNMP agents installed in this machine are not responding. Indicates that at least two neighboring devices can be reached, but reports that there is a problem in the connectivity to this node.

NodeOrConnectionDown

This incident indicates that the node is not responding to ICMP or SNMP queries. Moreover, because only one of the neighboring interfaces is down, the NmsApa service cannot determine whether the node is down or the connection is down.

Organization of this subsection

(1) SNMP agent not responding to SNMP queries

[Figure]

Scenario: The SNMP agent is not responding. For example, the community string for this SNMP agent has been changed, or NNMi's communication configuration settings have not yet been updated, but the node is operational (IPv4 address can be pinged).

Root cause: The SNMP agent is not responding.

Incident: An SNMPAgentNotResponding incident is generated.

Status: The SNMP agent is in Critical status.

Conclusion: SNMPAgentNotResponding

Effect: The node status is Minor. The conclusion on the node is UnresponsiveAgentInNode. All polled interfaces have Unknown status because they cannot be managed by NNMi. The conclusion on each interface is InterfaceUnmanageable.

(2) SNMP agent responding to SNMP queries

[Figure]

Scenario: This scenario continues the previous (1) SNMP agent not responding to SNMP queries scenario. An NNMi administrator has updated the community configuration settings to include the new community string. The SNMP agent for the managed node starts responding to SNMP queries.

Root cause: SNMP agent is responding.

Incident: None generated. The SNMPAgentNotResponding incident is closed.

Status: SNMP agent is in Normal state.

Conclusion: SNMPAgentResponding

Effect: The node status is Normal. The conclusion on the node is ResponsiveAgentInNode. InterfaceUnmanageable is cleared from all polled interfaces and the interfaces return to their previous status.

(3) IPv4 address not responding to ICMP

[Figure]

Scenario: IPv4 address 1 on Server 1 (S1) is not responding. For example, the route on Router 1 (R1) has changed from Interface 1 to Interface 2, so that packets destined for Interface 1 on Server 1 are now routed out of Interface 2 on Router 1. The associated interface is operational, and the node can be reached because you can ping some IPv4 addresses. The SNMP agent is up.

Root cause: IPv4 address is not responding.

Incident: An AddressNotResponding incident is generated.

Status: IPv4 address is in Critical status.

Conclusion: AddressNotResponding

Effect: The node status is Minor. The conclusion on the node is SomeUnresponsiveAddressesInNode.

(4) IPv4 address responding to ICMP

[Figure]

Scenario: This scenario continues the previous (3) IPv4 address not responding to ICMP scenario. The IPv4 address is now responding, the associated interface is operational, and the node can be reached (for example, you can ping some IPv4 addresses, or the SNMP agent is up, or both).

Root cause: IPv4 address is responding.

Incident: None generated. The AddressNotResponding incident is closed.

Status: The IPv4 address is in Normal state.

Conclusion: AddressResponding

Effect: The node status is Normal. The conclusion on the node is ResponsiveAddressesInNode.

(5) Interface is operationally down

[Figure]

Scenario: Interface 1 for R1 is operationally down (ifOperStatus=down) and administratively up (ifAdminStatus=up). Router 1 sends a LinkDown trap. Router 1 can be reached because some IPv4 addresses, such as IPv4 address 2, respond to ping. The SNMP agent is up. IPv4 address 1 is associated with Interface 1 and has stopped responding to ICMP.

Root cause: The interface is down.

Incident: An InterfaceDown incident is generated. The LinkDown incident is correlated beneath the InterfaceDown incident.

Status: The interface is in Critical status.

Conclusion: InterfaceDown

Effect: The node status is Minor. The conclusion on the node is InterfacesDownInNode. No AddressNotResponding incident is associated with the IPv4 address.

(6) Interface is operationally up

[Figure]

Scenario: This scenario continues the previous (5) Interface is operationally down scenario. Interface 1 for R1 is now operationally up (ifOperStatus=up). The node can be reached. All of its IPv4 addresses respond to ping. The SNMP agent is up.

Root cause: The interface is up.

Incident: None generated. The InterfaceDown incident is closed.

Status: The interface is in Normal status.

Conclusion: InterfaceUp

Effect: The node status is Normal. The conclusion on the node is InterfacesUpInNode.

(7) Interface is administratively down

[Figure]

Scenario: Interface 1 for R1 is administratively down (ifAdminStatus=down), but the node can be reached. For example, Interface 2 responds to ping and the SNMP agent is up. Disabling Interface 1 for R1 brings that interface operationally down. The IPv4 address associated with this interface, IPv4 address 1, stops responding to ICMP.

Root cause: Interface 1 for R1 is disabled.

Incident: None generated.

Status: The interface is in Disabled status.

Conclusion: InterfaceDisabled

Effect: The IPv4 address associated with Interface 1 for R1 has a status of Disabled. The conclusion on the IPv4 address is AddressDisabled.

(8) Interface is administratively up

[Figure]

Scenario: This scenario continues the previous (5) Interface is operationally down scenario. Interface 1 for R1 is now administratively up (ifAdminStatus=up). The node can be reached because some of the IPv4 addresses of that interface respond to ping. The SNMP agent is up. Enabling Interface 1 for R1 brings it operationally up. The IPv4 address associated with this interface starts responding to ICMP.

Root cause: The interface is enabled.

Incident: None generated.

Status: The interface is in Normal status.

Conclusion: InterfaceEnabled

Effect: The IPv4 address associated with Interface 1 for R1 has a status of Enabled. The conclusion on the IPv4 address is AddressEnabled.

(9) Connection is operationally down

[Figure]

Scenario: The connection between the interface on Switch 3 connecting to Switch 1 (If13) and the interface on Switch 1 connecting to Switch 3 (If31) is down. Traffic flows from the Management server through Switch 1 (SW1) and Switch 2 (SW2). Both If13 and If31 are marked down.

Root cause: The connection between If13 and If31 is down.

Incident: A ConnectionDown incident is generated. The InterfaceDown incident from If13 and If31 are correlated beneath ConnectionDown.

Status: The connection is in Critical status.

Conclusion: ConnectionDown

(10) Connection is operationally up

[Figure]

Scenario: This scenario continues the previous (9) Connection is operationally down scenario. The connection between If13 and If31 is now up.

Root cause: The connection between If13 and If31 is up.

Incident: None generated. The ConnectionDown incident is closed.

Status: The connection is in Normal status.

Conclusion: ConnectionUp

(11) Directly connected node is down

[Figure]

Scenario: Access switches ASW11, ASW12, ASW21, and ASW22 are redundantly connected to the distribution routers as shown. Distribution routers DR1 and DR2 are directly connected to each another. Distribution router DR1 goes down.

Root cause: Node DR1 is down according to neighbor analysis.

Incident: A NodeDown incident is generated. The InterfaceDown incidents from one-hop neighbors are correlated beneath the NodeDown incident.

Status: The node is in Critical status.

Conclusion: NodeDown

(12) Directly connected node is up

[Figure]

Scenario: This scenario continues the previous (11) Directly connected node is down scenario. Distribution router DR1 comes back up.

Root cause: Node DR1 is up.

Incident: None generated. The NodeDown incident is closed.

Status: The node is in Normal status.

Conclusion: NodeUp

(13) Indirectly connected node is down

[Figure]

Note

The diagram is conceptual. It does not represent an actual NNMi topology map or workspace view.

Scenario: This scenario can occur with any indirect connection where NNMi cannot discover the intermediate devices. In this example, Routers R1 and R2 appear to be directly connected in NNMi topology maps, but in reality these two routers are indirectly connected through optical repeaters (because the optical repeaters do not respond to SNMP or ICMP queries, they are not discovered by NNMi).

Router 2 becomes unreachable, either because its connected interface is down or because the connection between the optical repeaters is down. The interface on Router 1 that indirectly connects it to Router 2 is still up because its optical repeater is still up.

Root cause: Router 2 is down according to neighbor analysis.

Incident: A NodeDown incident is generated.

Status: Node router R2 is in Critical status.

Conclusion: NodeDown

(14) Indirectly connected node is up

[Figure]

Note

The diagram is conceptual. It does not represent an actual NNMi topology map or workspace view.

Scenario: This scenario continues the previous (13) Indirectly connected node is down scenario. The failed connection comes back up. Router 2 becomes reachable.

Root cause: The connection between Router 1 and Router 2 is up.

Incident: None generated. The NodeDown incident is closed.

Status: Router 2's status is Normal. The connection Status is Normal.

Conclusion: NodeUp

(15) Directly connected node is down and creates a shadow

[Figure]

Scenario: Router 2 (R2) goes down as shown above.

Root cause: Node (Router 2) is down according to NNMi's neighbor analysis.

Incident: A NodeDown incident is generated. The InterfaceDown incidents from one-hop neighbors are correlated beneath the NodeDown incident.

Status: The node is in Critical status.

Conclusion: NodeDown

Effect: All of the access switches are unreachable. The status of all nodes in the shadow is Unknown and the conclusion on each of them is NodeUnmanageable.

(16) Directly connected node is up, clearing the shadow

[Figure]

Scenario: This scenario continues the previous (15) Directly connected node is down and creates a shadow scenario. Router 2 comes back up.

Root cause: Node Router 2 is up.

Incident: None generated. The NodeDown incident is closed by the NodeUp incident.

Status: The node is in Normal Status.

Conclusion: NodeUp

Effect: All of the access switches are now reachable. The status of all nodes in the shadow is Normal.

(17) Important node is unreachable

Scenario: A node that is part of the Important Nodes node group cannot be reached.

Note

You must add a node to the Important Nodes node group before the NmsApa service can analyze it. If a node becomes unreachable before being added to the Important Nodes node group, the NmsApa service does not generate a NodeDown incident.

Root cause: The node is down. The NmsApa service does not do neighbor analysis, but concludes that the node is down because it was marked as important.

Incident: A NodeDown incident is generated. There are no correlated incidents.

Status: The node is in Critical status.

Conclusion: NodeDown

(18) Important node is reachable

Scenario: This scenario continues the previous (17) Important node is unreachable scenario. The important node comes back up and can be reached.

Root cause: The node is up.

Incident: None generated. The NodeDown incident is closed by the NodeUp incident.

Status: The node is in Normal status.

Conclusion: NodeUp

(19) Node or connection is down

[Figure]

Scenario: There is no redundancy for Router 2 (R2). Either Router 2 is down or the connection between Router 1 (R1) and Router 2 is down.

Root cause: The node or the connection is down.

Incident: The NodeOrConnectionDown incident is generated. The source node in this scenario is Router 2.

Status: The Node is in Critical status. The connection is in Minor status.

Conclusion: NodeOrConnectionDown

(20) Node or connection is up

[Figure]

Scenario: This scenario continues the previous (19) Node or connection is down scenario. Router 2 is now up.

Root cause: NodeUp

Incident: None generated. The NodeOrConnectionDown incident is closed.

Status: The node is in Normal status. The connection is in Normal status.

Conclusion: NodeUp

(21) Island group is down

[Figure]

Note

The diagram is conceptual. It does not represent an actual NNMi topology map or workspace view.

Scenario: NNMi has partitioned your network into two island groups. The NNMi management server is connected to a node in Island Group 1. Island Group 2 has become unreachable due to problems in your service provider's WAN.

Note

Island groups contain highly connected sets of nodes that are not connected or are only minimally connected to the rest of the network. For example, NNMi can identify multiple island groups for an enterprise network with geographically distributed sites connected by a WAN. Island groups are created by NNMi and cannot be modified by the user. For details about island groups, see NNMi Console in NNMi Help.

Root cause: Island Group 2 is down according to neighbor analysis.

Incident: The IslandGroupDown incident is generated. NNMi chooses a representative node from Island Group 2 as the source node for the incident.

Status: The status of Island Group 2 is set to Unknown. Objects in Island Group 2 have Unknown status. The connecting interface from Island Group 1 is up because the connection from the interface to the WAN is still up.

Conclusion: Not applicable for island groups.

(22) Island group is up

[Figure]

Note

The diagram is conceptual. It does not represent an actual NNMi topology map or workspace view.

Scenario: This scenario continues the previous (21) Island group is down scenario. The service provider's WAN problems are fixed, and Island Group 2 can be reached.

Root cause: The WAN connection to Island Group 2 is back up.

Incident: None generated. The IslandGroupDown incident is closed.

Status: The status for Island Group 2 is set to Normal. Objects in Island Group 2 return to Normal status.

Conclusion: Not applicable for island groups.

(23) Link aggregated ports (NNMi Advanced)

Aggregator is up

[Figure]

Scenario: All ports within the port aggregator are operationally and administratively up.

Root cause: All operational and administrative states are up.

Incident: No incident is generated.

Status: The status of the aggregator is set to Normal.

Conclusion: AggregatorUp

Aggregator is degraded

[Figure]

Scenario: Some (but not all) ports within the port aggregator are operationally down.

Root cause: Operational states on some ports are down.

Incident: An AggregatorDegraded incident is generated.

Status: The status of the aggregator is set to Minor.

Conclusion: AggregatorDegraded

Aggregator is down

[Figure]

Scenario: All ports within the port aggregator are operationally down.

Root cause: Operational states on all ports are down.

Incident: An AggregatorDown incident is generated.

Status: The status of the aggregator is set to Critical.

Conclusion: AggregatorDown

(24) Link aggregated connections (NNMi Advanced)

Link aggregated connection is up

[Figure]

Scenario: All port aggregator members of the connection are up.

Root cause: The aggregator is up on all members of the connection.

Incident: No incident is generated.

Status: The status of the aggregated connection is set to Normal.

Conclusion: AggregatorLinkUp

Link aggregated connection is degraded

[Figure]

Scenario: Some (but not all) port aggregator members of the connection are down.

Root cause: The aggregator is down on some members of the connection.

Incident: An AggregatorLinkDegraded incident is generated.

Status: The status of the aggregated connection is set to Minor.

Conclusion: AggregatorLinkDegraded

Link aggregated connection is down

[Figure]

Scenario: All port aggregator members of the connection are down.

Root cause: The aggregator is down on all members of the connection.

Incident: An AggregatorLinkDown incident is generated.

Status: The status of the aggregated connection is set to Critical.

Conclusion: AggregatorLinkDown

(25) Router redundancy groups: HSRP and VRRP (NNMi Advanced)

Router redundancy group has no primary

[Figure]

Scenario: A router redundancy group does not have a primary member. A properly functioning HSRP or VRRP router redundancy group must have one operational primary router and one operational secondary router.

Root cause: This scenario could be the result of an interface on the primary router failing, when the secondary was not active or there was a misconfiguration of the router redundancy group.

Incident: An RrgNoPrimary incident is generated. The RrgNoPrimary incident is impacted. If there is an identified root cause such as InterfaceDown, the InterfaceDown incident is correlated under the RrgNoPrimary incident as an impact correlation.

Status: The status of the router redundancy group is set to Critical.

Conclusion: RrgNoPrimary

Router redundancy group has multiple primaries

[Figure]

Scenario: A router redundancy group has multiple routers reporting as the primary router. A properly functioning HSRP or VRRP router redundancy group must have only one operational primary router.

Root cause: This scenario could be due to a faulty configuration of the router redundancy group.

Incident: An RrgMultiplePrimary incident is generated. The RrgMultiplePrimary incident is impacted.

Status: The status of the router redundancy group is set to Major.

Conclusion: RrgMultiplePrimary

Router redundancy group has failed over

[Figure]

Scenario: A router redundancy group has had a failure on the primary router and the secondary router has taken over as primary. The standby usually becomes the secondary, which is not a problem; the group is functioning as intended. The incident generated for this scenario is for informational purposes to report that the group has had a failover.

Root Cause: This scenario is most likely due to a failure on the primary router.

Incident: An RrgFailover incident is generated. The correlation nature of RrgFailover is service impact. If an identified root cause such as InterfaceDown exists, the InterfaceDown incident is correlated under the RrgFailover incident as an impact correlation.

Status: None generated.

Conclusion: RrgFailover

Router redundancy group has no secondary

[Figure]

Scenario: A router redundancy group has had a failure on the secondary router. Either there is no standby or the standby did not take over as the secondary.

Root Cause: This scenario could be due to an interface failure on the router or some misconfiguration of the router redundancy group.

Incident: An RrgNoSecondary incident is generated. The correlation nature of RrgNoSecondary is service impact. If an identified root cause such as InterfaceDown exists, the correlation nature between the RrgNoSecondary and InterfaceDown interfaces is service impact.

Status: The status of the router redundancy group is set to Minor.

Conclusion: RrgNoSecondary

Router redundancy group has multiple secondaries

[Figure]

Scenario: A router redundancy group has multiple routers reporting as the secondary router. A properly functioning HSRP or VRRP router redundancy group must have only one operational secondary router.

Root Cause: This scenario could be due to misconfiguration of the router redundancy group.

Incident: An RrgMultipleSecondary incident is generated. The correlation nature of RrgMultipleSecondary is service impact.

Status: The status of the router redundancy group is set to Minor.

Conclusion: RrgMultipleSecondary

Router redundancy group has degraded

[Figure]

Scenario: The router redundancy group has experienced some change. The group is functioning, and there is one primary router and one secondary router, but there is some non-normal condition that could be an issue. For example, there might be several routers not in Standby state.

Root Cause: This scenario could be due to some misconfiguration of the router redundancy group.

Incident: An RrgDegraded incident is generated. The correlation nature of RrgDegraded is service impact.

Status: The status of the router redundancy group is set to Warning.

Conclusion: RrgDegraded

(26) Node component scenarios

Fan failure or malfunctioning

Scenario: A fan sensor detects a failed fan in a chassis.

Incident: A FanOutOfRangeOrMalfunctioning incident is generated.

Status: The status of the fan sensor node component is Critical. The status of Major is propagated to the node.

Conclusion: FanOutOfRangeOrMalfunctioning

Power supply failure or malfunctioning

Scenario: A power supply sensor detects a failed power supply in a chassis.

Incident: A PowerSupplyOutOfRangeOrMalfunctioning incident is generated.

Status: The status of the power supply node component is Critical. The status of Major is propagated to the node.

Conclusion: PowerSupplyOutOfRangeOrMalfunctioning

Temperature exceeded or malfunctioning

Scenario: A temperature sensor detects a high temperature in a chassis.

Incident: A TemperatureOutOfRangeOrMalfunctioning incident is generated.

Status: The status of the temperature sensor node component is Critical. The status of the node does not change.

Conclusion: TemperatureOutOfRangeOrMalfunctioning

Voltage out of range or malfunctioning

Scenario: A voltage sensor detects a voltage problem in a chassis.

Incident: A VoltageOutOfRangeOrMalfunctioning incident is generated.

Status: The status of the voltage sensor node component is Critical. The status of the node does not change.

Conclusion: VoltageOutOfRangeOrMalfunctioning