6.13.3 Creating a server monitoring command
A server monitoring command is used to monitor program status. If you create a server monitoring command, you can use HA Monitor to restart the server and perform automatic hot standby processing in the event of a server failure. If the server consists of only those UAPs that are monitored by the monitor-mode program management function, there is no need to create a server monitoring command.
This subsection explains the timing of calling a server monitoring command and how to create a server monitoring command. For details about how to monitor servers, see 3.2.1 Monitoring a server in the monitor mode (for the active server). For details about the operating environment of commands that are executed from HA Monitor, see .6.22 Operating environment of commands that are executed from HA Monitor.
- Organization of this subsection
(1) Timing of calling a server monitoring command
A created server monitoring command is executed by HA Monitor with the superuser permissions.
The server monitoring command is started after startup of the active server is completed in the active system.
(2) How to create a server monitoring command (if the command is specified for the ptrlcmd_ex operand in the server environment definition)
Create a server monitoring command so that the following conditions are met:
-
If a particular operating environment is required for monitoring programs, set it. In RHEL7 or a later version, you can specify environment variables in the Unit settings file. However, the Unit settings file is overwritten when an overwrite installation of HA Monitor is performed. Therefore, we recommend that you specify an operating environment (environment variables) within the shell.
For details about the environment variables that are inherited when the server monitoring command is executed, see (2) How to create a server termination command in 6.13.2 Creating a server termination command.
-
The command must be able to receive the SIGTERM signal.
HA Monitor stops the monitoring command by sending the SIGTERM signal if a timeout occurs when the server is terminated normally, planned hot-standby switchover is performed, or the monitoring command is executed. Therefore, the command must be able to receive the SIGTERM signal.
-
Create a server monitoring command so that its execution does not loop indefinitely.
If processing of the monitoring command loops indefinitely for some reason (for example, periodic monitoring is performed within the monitoring command), HA Monitor detects the loop as a slowdown or hang of the monitoring command. Therefore, HA Monitor assumes that a server failure occurred on a monitor-mode server.
-
The monitoring command must return the monitoring result (EXIT code) as the termination result.
The following table describes the termination results (EXIT codes) that can be used. Note that EXIT codes 30 to 49 cannot be used because they have been reserved by HA Monitor.
Table 6‒12: Termination results (EXIT codes) that can be used according to the status of the monitoring-target job processing Status of the monitoring-target job processing
Termination results (EXIT codes) that can be used
Description
Normal
0
If this EXIT code is returned, HA Monitor assumes that the monitoring-target job processing is in the normal status or has been recovered from a failure.
Make sure that the command returns this code if the monitoring-target job processing is operating normally.
Status check failed
Job processing error
Indeterminable
1 to 9
If this EXIT code is returned, HA Monitor re-executes the monitoring command. HA Monitor operates as specified by the ptrlcmd_ex_retry operand in the server environment definition. Note that the re-execution interval is the same value that is specified for the ptrlcmd_ex_inter operand in the server environment definition.
Make sure that the command returns these codes if a problem that seems to be temporary has occurred for the target job processing (for example, a status check failed, a job processing error occurred, or the status is indeterminable).
Failure
10 to 19
If this EXIT code is returned, HA Monitor does not re-execute the monitoring command and assumes that a server failure occurred on the active server. HA Monitor operates as specified by the servexec_retry operand in the server environment definition.
Because HA Monitor does not re-execute the monitoring command, make sure that the command returns these codes in cases such as when it is possible to determine that the job processing has failed.
20 to 29
If this EXIT code is returned, HA Monitor does not re-execute the monitoring command and performs hot-standby switchover, assuming that a server failure occurred on the active server. HA Monitor immediately performs hot-standby switchover, ignoring the specification of the servexec_retry operand in the server environment definition.
Make sure that the command returns these codes in cases such as when HA Monitor can judge that job processing cannot continue even by restarting the server due to a disk failure or another permanent failure.
The following provides an overview of processing for each EXIT code.
If the EXIT code is 0 (normal), HA Monitor starts the monitoring command at regular intervals (the interval is specified for the ptrlcmd_ex_inter operand in the server environment definition). The following figure shows an overview of failure monitoring for a monitor-mode server (in the case where the EXIT code 0 is returned).
The command returns an EXIT code of 1 to 9 in such a case when a status check failed, a job processing error occurred, or the status is indeterminable. If one of these codes is returned a certain number of times, HA Monitor determines that the server has failed. If the number of times a status check fails exceeds the value specified for the ptrlcmd_ex_inter operand in the server environment definition, HA Monitor assumes that a server has failed. The following figure shows an overview of failure monitoring for a monitor-mode server (in the case where the EXIT code 1 to 9 is returned). Note that this figure is an example when 2 is specified for the ptrlcmd_ex_retry operand in the server environment definition.
If the command returns the EXIT code 10 to 19 or 20 to 29 (failure), HA Monitor does not re-execute the monitoring command and determines that a server failure occurred on the active server. The following figure shows an overview of failure monitoring for a server in the monitor mode (if the command returns the EXIT code 10 to 19 or 20 to 29).
If the command does not return the result even after the time specified for the ptrlcmd_ex_tmout operand in the server environment definition passes, HA Monitor assumes that a slowdown or hang of job processing occurred. In this case, HA Monitor determines that a server failure occurred on the active server. The following figure shows an overview of monitoring a failure on a server in the monitor mode (in the case no result is returned even after the time specified for the ptrlcmd_ex_tmout operand in the server environment definition passes).
Specify the server monitoring command that you created for the ptrlcmd_ex operand in the server environment definition. In addition, specify the following operands in the server environment definition:
-
ptrlcmd_ex_inter (cannot be omitted)
-
ptrlcmd_ex_retry (can be omitted)
-
ptrlcmd_ex_tmout (can be omitted)
For details about how to check the server logs if a server failure is detected by the server monitoring command, see 7.3 Checking the server logs.
The patrol_ex.sh file is provided as a sample file for a server monitoring command in the HA Monitor sample file directory. Please use this file if necessary.
#!/bin/sh
#******************************************************************************
#* *
#* Linux(x86) HA Monitor *
#* This is a sample of the patrol command. *
#* This is specified in the ptrlcmd_ex operand. *
#* (For monitor mode server) *
#* Attention: This can not be specified in the patrolcommand operand. *
#* *
#* All Rights Reserved. Copyright (C) 2017, Hitachi, Ltd. *
#* *
#******************************************************************************
set -x
# The object program to monitor
PROGRAM=/home/xxxx/yyyy
# The definition of command
PS=/bin/ps
GREP=/bin/grep
# Is the object program to monitor operating ?
EXIST=`$PS -efl | $GREP $PROGRAM | $GREP -v grep`
# When the object program to monitor is not operating,
# the variable EXIST is empty.
if [ "$EXIST" = "" ]
then
# This patrol command terminates, because the object program
# to monitor is not operating.
exit 10
fi
# Please describe commands and processes that can confirm that applications and tasks are running.
# For example, a command that throws a request to an application and receives a response.
# If it is a database, describe SQL to access the database.
# The end result (EXIT code) is returned according to the command execution result.
# For details, refer to the manual.
#
# ex)
# Describe the SQL to access the database.
# if [ "$?" -ne "0" ]
# then
# exit 1
# fi
exit 0
The EXIST=`$PS -efl | $GREP $PROGRAM | $GREP -v grep` entry checks whether there is a program process in the server monitoring command. You can check the process, for example, from the output results of the ps OS command.
In the section indicated by Describe the SQL to access the database, specify a command or another process that allows you to confirm that the application or job processing is operating. For example, you can specify a command that issues a request to an application and receives a reply or an SQL statement that accesses a database. The termination result (EXIT code) is returned as the execution result of the command. For details about the termination result (EXIT code), see Table 6-11 Termination results (EXIT codes) that can be used according to the status of the monitoring-target job processing.
(3) How to create a server monitoring command (if the command is specified for the patrolcommand operand in the server environment definition)
Create a server monitoring command so that the following conditions are met:
-
If a particular operating environment is required for monitoring programs, set it. In RHEL7 or a later version, you can specify environment variables in the Unit settings file. However, the Unit settings file is overwritten when an overwrite installation of HA Monitor is performed. Therefore, we recommend that you specify an operating environment (environment variables) within the shell.
For details about the environment variables that are inherited when the server monitoring command is executed, see (2) How to create a server termination command in 6.13.2 Creating a server termination command.
-
The monitoring-target programs are monitored periodically.
-
If a program failure is detected during monitoring, the monitoring command is terminated (by exit).
Any termination code can be used.
-
The command must be able to receive the SIGTERM signal.
HA Monitor stops the server monitoring command by sending the SIGTERM signal to the monitoring command. Therefore, the command must be able to receive the SIGTERM signal.
Specify the server monitoring command that you created for the patrolcommand operand in the server environment definition.
The patrol.sh file is provided as a sample file for a server monitoring command in the HA Monitor sample file directory. Please use this file if necessary.
The following shows an example of creating a B-shell-based monitoring command.
#!/bin/sh # The object program to monitor PROGRAM=/home/xxxx/yyyy # The definition of command PS=/bin/ps GREP=/bin/grep # Main loop while true do # Is the object program to monitor operating ? EXIST=`$PS -efl | $GREP $PROGRAM | $GREP -v grep` # When the object program to monitor is not operating, # the variable EXIST is empty. if [ "$EXIST" = "" ] then # This patrol command terminates, because the object program # to monitor is not operating. exit fi # The monitoring is continued, because the object program # to monitor is operating. sleep 5 done
In the preceding coding example, the server monitoring command has an internal loop and checks whether a program process exists inside the loop. You can check the process, for example, from the output results of the ps OS command. If a program process exists, the command continues processing of the loop to continue checking whether the process is alive.