18.4.1 Application failover behavior
The figure below shows the application failover configuration for two NNMi management servers using the NNMi database. Refer to this figure while reading the rest of this chapter.
A database error might result if you remove a standby server from a cluster and run that server as a stand-alone server, and then you add it back into the cluster. If this occurs, run the following command from the command line:
nnmcluster dbsync
NNMi 11-00 includes a streaming replication feature within application failover whereby database transactions are sent from the active server to the standby server, keeping the standby server in sync with the active server. This eliminates the need for database transaction logs to be imported to the standby server during failover (as was the case in earlier NNMi versions), thus greatly reducing the time needed for the standby server to take over as the active server. Another benefit of this feature is that database backup files are sent from one node to another only if and when needed. This means that, given the regular transmission of database transaction files, the need for sending large database backup files will normally be infrequent.
After you start both servers (active and standby), the standby server detects the active server and requests a database backup from the active server, but does not start network monitoring. This database backup is stored as a single ZIP file. If the standby server already has a ZIP file from a previous cluster-connection, and NNMi finds that the file is already synchronized with the active server, the file is not retransmitted.
While both the active and standby servers are running, the active server periodically sends database transaction logs to the standby server. You can modify the frequency of this data transfer by changing the value of the com.hp.ov.nms.cluster.timeout.archive parameter in the nms-cluster.properties file. These transaction logs accumulate on the standby server and are available on the standby server any time it needs to become active.
The standard data transfer frequency is as follows:
A full backup of the database is transferred every 6 hours.
Transaction logs (database update information) are transferred every 15 minutes. When a large volume of data in the database is updated, transaction logs are transferred more frequently in some cases.
Updates made while data is being transferred are not inherited.
When the standby server receives a full database backup from the active server, it places the information into the NNMi database. The standby server also creates a recovery.conf file to inform the NNMi database that it must incorporate all received transaction logs before it becomes available to other services. If the active server becomes unavailable for any reason, the standby server becomes active by executing the ovstart command to start the NNMi services. The standby NNMi management server imports the transaction logs before starting the remaining NNMi services.
Database files are stored under the following directory:
Windows: %NnmDataDir%shared\nnm\databases\
Linux: $NnmDataDir/shared/nnm/databases/
In the application failover configuration, three directories (Postgres, Postgres_standby, and Postgres_OLD) are created under this directory. These directories are used for the following purposes:
Postgres: Stores database data received during operation or for a standby purpose.
Postgres_standby: Stores data sent from the active server to the standby server.
Postgres.OLD: Used by the standby server for saving old Postgres data when new data is received.
If the active server fails, the standby server begins discovery and polling activities. This transition keeps NNMi monitoring and polling your network while you diagnose and repair the failed system.
- Important
NNMi performs resynchronization following a failover in an application failover configuration. This might delay the updating of statuses and incidents.
During resynchronization, a message similar to the following might be displayed, but this is not a problem:
Updating of statuses and incidents was delayed because the Causal Engine queue is full. This might be due to resynchronization performed after a failover, restoration of a backup, or manual resynchronization in an application failover configuration.