16.4.2 Application failover scenarios
There are several possible problems that can cause the active NNMi management server to stop sending heartbeats and to initiate a failover.
This subsection explains failover of the NNMi management server because an error occurred under various scenarios.
Possible error |
Error location |
||
---|---|---|---|
Active server |
Standby server |
||
Server |
Server failure#1 |
Scenario 1 |
Scenario 6 |
OS stoppage |
Scenario 2 |
Scenario 6 |
|
Process |
nnmcluster stoppage |
Scenario 3 |
Scenario 6 |
Stoppage of process other than nnmcluster |
Scenario 5 |
N/A#2 |
|
Network |
Communication error |
Scenario 4 |
Scenario 4 |
- Organization of this subsection
(1) When a failover occurs
In the following scenarios 1 through 3, NNMi fails over to the standby server if automatic failover is enabled and continues to monitor the network:
-
Scenario 1: The active NNMi management server fails.
A hardware or OS error causes the active server to stop without going through the OS's shutdown process. The standby server detects that the other server has stopped, goes active to automatically start NNMi, and continues to monitor the network. The original active server runs as a standby server if started.
-
Scenario 2: The system administrator shuts down or restarts the active NNMi management server.
The active server has stopped after going through the OS's shutdown process. The standby server detects that the other server has stopped, goes active to automatically start NNMi, and continues to monitor the network. The original active server runs as a standby server if started.
Note that if the NNMi management server is running a UNIX operating system and a termination script is run when the OS is shut down, the ovstop command will be executed automatically. This will result in the application failover being disabled and failover will not occur.
-
Scenario 3: The NNMi administrator shuts down the cluster.
The administrator or some other cause has stopped the cluster manager (nnmcluster process). The standby server detects that the other server has stopped, goes active to automatically start NNMi, and continues to monitor the network.
- Important note
-
If some factor causes only the nnmcluster process of the active server to stop and other NNMi processes continue to remain active, the state is the same as in scenario 3. As a result, NNMi might become active on both the original active server and the new active server. In such a case, recover from the problem by restarting the OS of the original active server.
(2) When no failover occurs
If an event occurs that is not covered by any of the scenarios described in 16.4.2(1) When a failover occurs, failover does not occur. The following are examples:
-
Scenario 4: The network connection between the active and the standby NNMi management servers fails.
The two servers can no longer communicate with each other. Because heartbeat communications cannot be performed between the cluster managers (nnmcluster process), the following states result:
-
The active server detects that the other server has stopped; the active server continues to run.
-
The standby server also detects that the other server has stopped, goes active, and starts NNMi.
In scenario 4, both NNMi management servers run in the active state. When the network device comes back online, the two NNMi management servers automatically negotiate which server will become the new active server. The other server becomes the standby server and stops NNMi.
The state in which both servers become active causes a problem called split brain in HA configurations based on cluster software. However, because application failover uses a different framework, it recovers without any problem, as described below, when communication is restored:
-
When communication is restored, one of the servers becomes the standby server, restoring the normal configuration.
-
In application failover, the database does not use a shared disk. Instead, the standby server requests the active server to transfer the database and synchronizes it. Therefore, no consistency problem occurs even if NNMi runs on both servers.
-
-
Scenario 5: The NNMi processes have stopped.
Stoppage for any reason of an NNMi process other than the cluster manager (nnmcluster process) does not result in failover.
This is because NNMi processes on the local server are not monitored, although server operations are monitored mutually by the cluster manager's heartbeat communications.
If you want failover to occur whenever an NNMi process stops, employ an HA configuration using cluster software.
-
Scenario 6: A failure occurred on the standby server.
An error described in scenarios 1 through 3 (server failure, OS stoppage, or stoppage of the nnmcluster process) has occurred on the standby server. In such a case, the standby server is removed as a cluster configuration member, but NNMi on the active server continues to run and continues to monitor the network.
- Important note
-
No report is sent even when the standby server becomes available (resulting in single server operation).