19.8.1 Common configuration mistakes
Some common HA configuration mistakes are listed here:
-
Disk configuration is not valid.
-
VCS or SCS: If a resource cannot be probed, there is something wrong with the configuration. If a disk cannot be probed, the disk might no longer be accessible by the operating system.
-
Test the disk configuration manually and confirm against HA products documentation that the configuration is appropriate.
-
-
The disk is in use and cannot be started for the HA resource group.
Always check that the disk is not activated before starting the HA resource group.
-
WSFC network configuration is not valid.
If network traffic is flowing across multiple NIC cards, RDP sessions fail when activating programs that consume a large amount of network bandwidth, such as the NNMi ovjboss process.
-
Some HA products do not restart automatically at startup.
Review the HA product documentation for details about how to configure automatic restart at startup.
-
NFS or other access is added directly to the OS.
The resource group configuration must manage this behavior.
-
Being in the shared disk mount point during a failover or when the HA resource group is being placed offline.
HA kills any processes that prevent the shared disk from being unmounted. Move to a different directory when a failover occurs or when the resource group becomes offline.
-
Reusing the HA cluster virtual IP address as the HA resource virtual IP address.
This works on one system and not the other. Configure different IP addresses to each system.
-
Timeouts are too frequent.
If the products are misbehaving, the HA product might timeout the HA resource and cause a failover.
In WSFC, check the value of the Time to wait for resource to start setting. NNMi sets this value to 15 minutes, but you can increase it.
-
Maintenance mode is not being used.
Maintenance mode was created for debugging HA failures. If you attempt to bring a resource group online on a system and it fails over shortly thereafter, use the maintenance mode to keep the resource group online to see what is failing.
-
Cluster logs are not being used.
Cluster logs can show many common mistakes.