Hitachi

For Linux(R) (x86) Systems HA Monitor Cluster Software


6.13.3 Creating a server monitoring command

A server monitoring command is used to monitor program status. If you create a server monitoring command, you can use HA Monitor to restart the server and perform automatic hot standby processing in the event of a server failure. If the server consists of only those UAPs that are monitored by the monitor-mode program management function, there is no need to create a server monitoring command.

This subsection explains the timing of calling a server monitoring command and how to create a server monitoring command. For details about how to monitor servers, see 3.2.1 Monitoring a server in the monitor mode (for the active server). For details about the operating environment of commands that are executed from HA Monitor, see .6.22 Operating environment of commands that are executed from HA Monitor.

Organization of this subsection

(1) Timing of calling a server monitoring command

A created server monitoring command is executed by HA Monitor with the superuser permissions.

The server monitoring command is started after startup of the active server is completed in the active system.

(2) How to create a server monitoring command (if the command is specified for the ptrlcmd_ex operand in the server environment definition)

Create a server monitoring command so that the following conditions are met:

The following provides an overview of processing for each EXIT code.

If the EXIT code is 0 (normal), HA Monitor starts the monitoring command at regular intervals (the interval is specified for the ptrlcmd_ex_inter operand in the server environment definition). The following figure shows an overview of failure monitoring for a monitor-mode server (in the case where the EXIT code 0 is returned).

Figure 6‒38: Failure monitoring for a monitor-mode server (in the case where the EXIT code 0 is returned)

[Figure]

The command returns an EXIT code of 1 to 9 in such a case when a status check failed, a job processing error occurred, or the status is indeterminable. If one of these codes is returned a certain number of times, HA Monitor determines that the server has failed. If the number of times a status check fails exceeds the value specified for the ptrlcmd_ex_inter operand in the server environment definition, HA Monitor assumes that a server has failed. The following figure shows an overview of failure monitoring for a monitor-mode server (in the case where the EXIT code 1 to 9 is returned). Note that this figure is an example when 2 is specified for the ptrlcmd_ex_retry operand in the server environment definition.

Figure 6‒39: Failure monitoring for a monitor-mode server (in the case where the EXIT code 1 to 9 is returned)

[Figure]

If the command returns the EXIT code 10 to 19 or 20 to 29 (failure), HA Monitor does not re-execute the monitoring command and determines that a server failure occurred on the active server. The following figure shows an overview of failure monitoring for a server in the monitor mode (if the command returns the EXIT code 10 to 19 or 20 to 29).

Figure 6‒40: Failure monitoring for a monitor-mode server (in the case where the EXIT code 10 to 19 or 20 to 29 is returned)

[Figure]

If the command does not return the result even after the time specified for the ptrlcmd_ex_tmout operand in the server environment definition passes, HA Monitor assumes that a slowdown or hang of job processing occurred. In this case, HA Monitor determines that a server failure occurred on the active server. The following figure shows an overview of monitoring a failure on a server in the monitor mode (in the case no result is returned even after the time specified for the ptrlcmd_ex_tmout operand in the server environment definition passes).

Figure 6‒41: Failure monitoring for a monitor-mode server (in the case where no result is returned even after the time specified for the ptrlcmd_ex_tmout operand passes)

[Figure]

Specify the server monitoring command that you created for the ptrlcmd_ex operand in the server environment definition. In addition, specify the following operands in the server environment definition:

For details about how to check the server logs if a server failure is detected by the server monitoring command, see 7.3 Checking the server logs.

The patrol_ex.sh file is provided as a sample file for a server monitoring command in the HA Monitor sample file directory. Please use this file if necessary.

#!/bin/sh
#******************************************************************************
#*                                                                            *
#*    Linux(x86) HA Monitor                                                   *
#*    This is a sample of the patrol command.                                 *
#*    This is specified in the ptrlcmd_ex operand.                            *
#*    (For monitor mode server)                                               *
#*    Attention: This can not be specified in the patrolcommand operand.      *
#*                                                                            *
#*    All Rights Reserved. Copyright (C) 2017, Hitachi, Ltd.                  *
#*                                                                            *
#******************************************************************************
set -x
# The object program to monitor
PROGRAM=/home/xxxx/yyyy

# The definition of command
PS=/bin/ps
GREP=/bin/grep

# Is the object program to monitor operating ?
EXIST=`$PS -efl | $GREP $PROGRAM | $GREP -v grep`

# When the object program to monitor is not operating,
# the variable EXIST is empty.
if [ "$EXIST" = "" ]
         then
         # This patrol command terminates, because the object program
         # to monitor is not operating.
         exit 10
fi

# Please describe commands and processes that can confirm that applications and tasks are running.
# For example, a command that throws a request to an application and receives a response,
# If it is a database, describe SQL to access the database.
# The end result (EXIT code) is returned according to the command execution result.
# For details, refer to the manual.
#
# ex)
#        Describe the SQL to access the database.
#        if [ "$?" -ne "0" ]
#        then
#                  exit 1
#        fi

exit 0

The EXIST=`$PS -efl | $GREP $PROGRAM | $GREP -v grep` entry checks whether there is a program process in the server monitoring command. You can check the process, for example, from the output results of the ps OS command.

In the section indicated by Describe the SQL to access the database, specify a command or another process that allows you to confirm that the application or job processing is operating. For example, you can specify a command that issues a request to an application and receives a reply or an SQL statement that accesses a database. The termination result (EXIT code) is returned as the execution result of the command. For details about the termination result (EXIT code), see Table 6-13 Termination results (EXIT codes) that can be used according to the status of the monitoring-target job processing.

(3) How to create a server monitoring command (if the command is specified for the patrolcommand operand in the server environment definition)

Create a server monitoring command so that the following conditions are met:

Specify the server monitoring command that you created for the patrolcommand operand in the server environment definition.

The patrol.sh file is provided as a sample file for a server monitoring command in the HA Monitor sample file directory. Please use this file if necessary.

The following shows an example of creating a B-shell-based monitoring command.

#!/bin/sh
# The object program to monitor
PROGRAM=/home/xxxx/yyyy
 
# The definition of command
PS=/bin/ps
GREP=/bin/grep
 
# Main loop
while true
do
    # Is the object program to monitor operating ?
    EXIST=`$PS -efl | $GREP $PROGRAM | $GREP -v grep`
 
    # When the object program to monitor is not operating,
    # the variable EXIST is empty.
    if [ "$EXIST" = "" ]
    then
        # This patrol command terminates, because the object program
        # to monitor is not operating.
        exit
    fi
 
    # The monitoring is continued, because the object program
    # to monitor is operating.
    sleep 5
done

In the preceding coding example, the server monitoring command has an internal loop and checks whether a program process exists inside the loop. You can check the process, for example, from the output results of the ps OS command. If a program process exists, the command continues processing of the loop to continue checking whether the process is alive.