Data acquisition settings using failure detection time commands (Systems for executing J2EE applications)

This subsection describes how to specify the settings to acquire the data for troubleshooting using the failure detection time commands. Note that you can collect the material acquired by the failure detection time commands as the snapshot log.

There are two types of failure detection time commands; commands that the system provides and commands that the user creates.

According to the default settings, when an error occurs in a logical server, the failure detection time commands provided by the system are executed and thread dumps and trace based performance analysis are acquired. The snapshot log is collected before terminating the logical server where the error occurs. For the information that can be acquired by executing the failure detection time commands provided by the system, see 2.3.2(1) Information that can be acquired by executing the failure detection time commands provided by the system.

To change the operation settings of the failure detection time commands provided by the system, (1) Environment settings in the Management Server and (2) Environment settings in the Administration Agent are necessary. Also, when using the user created failure detection time commands, (1) Environment settings in the Management Server, (2) Environment settings in the Administration Agent, and (3) Creating a command file of the user created failure detection time commands are necessary. Respective settings are described in points from (1) to (3).

Note: To collect the material acquired by the user created failure detection time commands as the snapshot log, you need to add the collection destination of that material to the snapshot log collection destination. For details about addition of the snapshot log collection destination, see 3.3.3(3) Customizing the snapshot log collection destination.

Organization of this subsection: (1) Environment settings in the Management Server; (2) Environment settings in the Administration Agent; (3) Creating a command file of the user created failure detection time commands

(1) Environment settings in the Management Server

Use mserver.properties (environment settings file of Management Server) to specify the operation of the failure detection time commands.

Specify the operation of the failure detection time commands in the following keys.

Key	Description	Setting requirement
Key	Description	System	User
com.cosminexus.mngsvr.sys_cmd.abnormal_end.enabled	Specify whether to use system provided failure detection time commands. The default setting is true (use).	O	--
com.cosminexus.mngsvr.usr_cmd.abnormal_end.enabled	Specify whether to use the user created failure detection time commands. The default setting is false (do not use).	--	R
com.cosminexus.mngsvr.sys_cmd.abnormal_end.timeout	Specify the waiting period for termination of system provided failure detection time commands. If the command does not terminate even after the specified time lapses, the user recovery process continues.	O	--
com.cosminexus.mngsvr.usr_cmd.abnormal_end.timeout	Specify the waiting period for termination of the user created failure detection time commands.	--	O
com.cosminexus.mngsvr.snapshot.auto_collect.enabled	Specify whether to acquire the snapshot log when an error occurs or for batch restart. The default setting is true (acquire the snapshot log).	O	O
com.cosminexus.mngsvr.snapshot.collect.point	Specify one of the following as the snapshot log collection timing: Before terminating the logical server Before restarting the J2EE server The default timing is before terminating the logical server.	O	O

Legend:: System: You need to set the system provided failure detection time commands.; User: You need to set the user created failure detection time commands.; R: Required; O: Required only when changing the default settings.; --: Not required

(2) Environment settings in the Administration Agent

Use adminagent.properties (Administration Agent property file) to specify the material to be acquired by the failure detection time commands.

In the following keys of adminagent.properties, specify the count of the material to be acquired, application of collection using the failure detection time commands, and the path of the failure detection time commands. For details on the files defining the snapshot log collection target, see 3.3.3(3) Customizing the snapshot log collection destination

Key	Description	Setting requirement
Key	Description	System	User
adminagent.snapshotlog.num_snapshots	Specify the number of snapshot log files to be collected as the primary delivery data for each logical server.	O	O
adminagent.snapshotlog.listfile.2.num_snapshots	Specify the number of snapshot log files to be collected as the secondary delivery data for each logical server.	O	O
adminagent.j2ee.sys_cmd.abnormal_end.threaddump	Specify whether to acquire thread dumps using the system provided failure detection time commands.	O	--
adminagent.sys_cmd.abnormal_end.prftrace	Specify whether to acquire the trace based performance analysis file using the system provided failure detection time commands.	O	--
adminagent.logical-server-type.usr_cmd.abnormal_end	Specify the path of failure detection time commands to be executed for each type of logical server.	--	R

Legend:: System: You need to set the system provided failure detection time commands.; User: You need to set the user created failure detection time commands.; R: Required; O: Required only when changing the default settings.; --: Not required

(3) Creating a command file of the user created failure detection time commands

You can code the user created failure detection time commands in a command file (batch file or shell script file). At this time, you can code the environment variables described in the following table, in the command file to execute the commands using the information of the logical server where the error occurred and the information related to the error.

Table 3-7 Environment variables that you can code in the command file of the user created failure detection time commands

Environment variable	Description
COSMI_MNG_LSNAME	Logical server name of the logical server where the error occurred. When an error occurs in the naming service of the logical CTM, the logical server name of logical CTM will be set up.
COSMI_MNG_RSNAME	Actual server name of the logical server where the error occurred. For a logical server other than a J2EE server or SFO server, the logical server name is set.
COSMI_MNG_LSPID	Process IDs to be monitored when the logical server starts. When monitoring multiple process IDs on an indirectly started logical user server, the process IDs are specified, demarcated by commas (,) in the order in which the process IDs are acquired by the command executed for acquiring the process IDs when the logical user server is started.
COSMI_MNG_LSARGS	Command line when the logical server is started.
COSMI_MNG_TIME_SUSPENDED	Time at which hang up is detected. Time lapsed (unit: ms) from 0 hour before January 1, 1970 of the universal coordinated time (UTC). Note that the value is set only if the response is detected.
COSMI_MNG_TIME_TERMINATED	Time at which abnormal termination (process down) is detected. Time lapsed (unit: ms) from 0 hour before January 1, 1970 of the universal coordinated time (UTC). Note that the value is not set if hang up occurs.
COSMI_MNG_WEB_SYSTEM	Web system affiliated to the logical server where an error occurs. The value is not required if you do not use the Smart Composer function.
COSMI_MNG_TIER	Physical tier affiliated to the logical server where an error occurs. The value is not required if you do not use the Smart Composer function.
COSMI_MNG_UNIT	Service unit affiliated to the logical server where an error occurs. The value is not required if you do not use the Smart Composer function.
COSMI_MNG_HWS	Cosminexus HTTP Server installation directory.

The Management Server cannot acquire the standard output and standard error output from the commands executed as commands to detect error. To acquire the standard output and standard error output of a command, information must be output to a file during command execution.

(a) Examples of obtaining the user dump or core dump

The following examples describe the execution of the drwtsn32 or kill command when an error is detected in the J2EE server and the collection of user dumps or core dumps:

In Windows

 
Determine whether the rem error has occurred because the process is down or hung up, from the date and time at which it is detected that the process is down.
if defined COSMI_MNG_TIME_TERMINATED goto END
 
Acquire a user dump because the rem error has occurred due to hang-up of the process.
"C:\WINDOWS\system32\drwtsn32.exe" -p %COSMI_MNG_LSPID%
 
:END

In UNIX

 
#!/bin/sh
 
# Determine whether the rem error has occurred because the process is down or hung up, from the date and time at which it is detected that the process is down.
if [ "$COSMI_MNG_TIME_TERMINATED" = "" ] ; then
 
# Acquire a core dump because the error occurred due to hang-up of the process.
/bin/kill -6 $COSMI_MNG_LSPID
fi

(b) Example of obtaining the thread dump

The following is an example of the case in which the cjdumpsv command is executed to obtain the J2EE server (real server name: J2EEServer) thread dump when a Web server error occurs.

In this example, the cjdumpsv command is executed multiple times to check the status transition of each thread in accordance with the lapsed time. As a standard, the cjdumpsv command is executed about ten times every three seconds.

In Windows

 
Determine whether the rem error has occurred because the process is down or hung up, from the environment variables.
if defined COSMI_MNG_TIME_TERMINATED goto END
 
Acquire the thread dump because the rem error has occurred due to the hung-up process.
set COUNT=10
set INTERVAL=3000
for /l %%n in (1,1,%COUNT%) do (
"C:\Program Files\Hitachi\Cosminexus\CC\server\bin\cjdumpsv.exe" J2EEServer
 if not "%%n" == "%COUNT%" (
 rem Stand by until the next thread dump is collected.(milliseconds)
 echo WScript.sleep %INTERVAL% > sleep.vbs
 "C:\WINDOWS\\system32\cscript.exe" sleep.vbs > NUL
 del sleep.vbs
 )
)
:END

In UNIX

 
#!/bin/sh
 
# Determine whether the error has occurred because the process is down or hung up, from the environment variables.
if [ "$COSMI_MNG_TIME_TERMINATED" = "" ] ; then
 
# Acquire the thread dump because the error has occurred due to the hung-up process.
COUNT=10
INTERVAL=3
for num in 'seq $COUNT'
do
 /opt/Cosminexus/CC/server/bin/cjdumpsv J2EEServer
 if [ "$num" -ne "$COUNT" ]; then
 # Stand by until the next thread dump is collected.(Seconds)
 sleep $INTERVAL
 fi
done
fi

(c) Operating user-created failure detection time commands

The logical CTM starts, stops, and monitors two processes; the global CORBA Naming Service and the CTM daemon. There are different execution commands for the case when an error is detected in the global CORBA Naming Service and in the CTM daemon respectively, within the logical server.

When detecting an error in global CORBA Naming Service
The command specified in the adminagent.naming.usr_cmd.abnormal_end key is executed.
When detecting an error in CTM daemon
The command specified in the adminagent.ctm.usr_cmd.abnormal_end key is executed.