uCosminexus Application Server, Maintenance and Migration Guide
This subsection describes how to specify the settings to acquire the data for troubleshooting using the failure detection time commands. Note that you can collect the material acquired by the failure detection time commands as the snapshot log.
There are two types of failure detection time commands; commands that the system provides and commands that the user creates.
According to the default settings, when an error occurs in a logical server, the failure detection time commands provided by the system are executed and thread dumps and trace based performance analysis are acquired. The snapshot log is collected before terminating the logical server where the error occurs. For the information that can be acquired by executing the failure detection time commands provided by the system, see 2.3.2(1) Information that can be acquired by executing the failure detection time commands provided by the system.
To change the operation settings of the failure detection time commands provided by the system, (1) Environment settings in the Management Server and (2) Environment settings in the Administration Agent are necessary. Also, when using the user created failure detection time commands, (1) Environment settings in the Management Server, (2) Environment settings in the Administration Agent, and (3) Creating a command file of the user created failure detection time commands are necessary. Respective settings are described in points from (1) to (3).
Use mserver.properties (environment settings file of Management Server) to specify the operation of the failure detection time commands.
Specify the operation of the failure detection time commands in the following keys.
Key | Description | Setting requirement | |
---|---|---|---|
System | User | ||
com.cosminexus.mngsvr.sys_cmd.abnormal_end.enabled | Specify whether to use system provided failure detection time commands. The default setting is true (use). | O | -- |
com.cosminexus.mngsvr.usr_cmd.abnormal_end.enabled | Specify whether to use the user created failure detection time commands. The default setting is false (do not use). | -- | R |
com.cosminexus.mngsvr.sys_cmd.abnormal_end.timeout | Specify the waiting period for termination of system provided failure detection time commands. If the command does not terminate even after the specified time lapses, the user recovery process continues. | O | -- |
com.cosminexus.mngsvr.usr_cmd.abnormal_end.timeout | Specify the waiting period for termination of the user created failure detection time commands. | -- | O |
com.cosminexus.mngsvr.snapshot.auto_collect.enabled | Specify whether to acquire the snapshot log when an error occurs or for batch restart. The default setting is true (acquire the snapshot log). | O | O |
com.cosminexus.mngsvr.snapshot.collect.point | Specify one of the following as the snapshot log collection timing:
|
O | O |
Use adminagent.properties (Administration Agent property file) to specify the material to be acquired by the failure detection time commands.
In the following keys of adminagent.properties, specify the count of the material to be acquired, application of collection using the failure detection time commands, and the path of the failure detection time commands. For details on the files defining the snapshot log collection target, see 3.3.3(3) Customizing the snapshot log collection destination
Key | Description | Setting requirement | |
---|---|---|---|
System | User | ||
adminagent.snapshotlog.num_snapshots | Specify the number of snapshot log files to be collected as the primary delivery data for each logical server. | O | O |
adminagent.snapshotlog.listfile.2.num_snapshots | Specify the number of snapshot log files to be collected as the secondary delivery data for each logical server. | O | O |
adminagent.j2ee.sys_cmd.abnormal_end.threaddump | Specify whether to acquire thread dumps using the system provided failure detection time commands. | O | -- |
adminagent.sys_cmd.abnormal_end.prftrace | Specify whether to acquire the trace based performance analysis file using the system provided failure detection time commands. | O | -- |
adminagent.logical-server-type.usr_cmd.abnormal_end | Specify the path of failure detection time commands to be executed for each type of logical server. | -- | R |
You can code the user created failure detection time commands in a command file (batch file or shell script file). At this time, you can code the environment variables described in the following table, in the command file to execute the commands using the information of the logical server where the error occurred and the information related to the error.
Table 3-7 Environment variables that you can code in the command file of the user created failure detection time commands
Environment variable | Description |
---|---|
COSMI_MNG_LSNAME | Logical server name of the logical server where the error occurred. When an error occurs in the naming service of the logical CTM, the logical server name of logical CTM will be set up. |
COSMI_MNG_RSNAME | Actual server name of the logical server where the error occurred. For a logical server other than a J2EE server or SFO server, the logical server name is set. |
COSMI_MNG_LSPID | Process IDs to be monitored when the logical server starts. When monitoring multiple process IDs on an indirectly started logical user server, the process IDs are specified, demarcated by commas (,) in the order in which the process IDs are acquired by the command executed for acquiring the process IDs when the logical user server is started. |
COSMI_MNG_LSARGS | Command line when the logical server is started. |
COSMI_MNG_TIME_SUSPENDED | Time at which hang up is detected. Time lapsed (unit: ms) from 0 hour before January 1, 1970 of the universal coordinated time (UTC). Note that the value is set only if the response is detected. |
COSMI_MNG_TIME_TERMINATED | Time at which abnormal termination (process down) is detected. Time lapsed (unit: ms) from 0 hour before January 1, 1970 of the universal coordinated time (UTC). Note that the value is not set if hang up occurs. |
COSMI_MNG_WEB_SYSTEM | Web system affiliated to the logical server where an error occurs. The value is not required if you do not use the Smart Composer function. |
COSMI_MNG_TIER | Physical tier affiliated to the logical server where an error occurs. The value is not required if you do not use the Smart Composer function. |
COSMI_MNG_UNIT | Service unit affiliated to the logical server where an error occurs. The value is not required if you do not use the Smart Composer function. |
COSMI_MNG_HWS | Cosminexus HTTP Server installation directory. |
The Management Server cannot acquire the standard output and standard error output from the commands executed as commands to detect error. To acquire the standard output and standard error output of a command, information must be output to a file during command execution.
The following examples describe the execution of the drwtsn32 or kill command when an error is detected in the J2EE server and the collection of user dumps or core dumps:
Determine whether the rem error has occurred because the process is down or hung up, from the date and time at which it is detected that the process is down. if defined COSMI_MNG_TIME_TERMINATED goto END Acquire a user dump because the rem error has occurred due to hang-up of the process. "C:\WINDOWS\system32\drwtsn32.exe" -p %COSMI_MNG_LSPID% :END |
#!/bin/sh # Determine whether the rem error has occurred because the process is down or hung up, from the date and time at which it is detected that the process is down. if [ "$COSMI_MNG_TIME_TERMINATED" = "" ] ; then # Acquire a core dump because the error occurred due to hang-up of the process. /bin/kill -6 $COSMI_MNG_LSPID fi |
The following is an example of the case in which the cjdumpsv command is executed to obtain the J2EE server (real server name: J2EEServer) thread dump when a Web server error occurs.
In this example, the cjdumpsv command is executed multiple times to check the status transition of each thread in accordance with the lapsed time. As a standard, the cjdumpsv command is executed about ten times every three seconds.
Determine whether the rem error has occurred because the process is down or hung up, from the environment variables. if defined COSMI_MNG_TIME_TERMINATED goto END Acquire the thread dump because the rem error has occurred due to the hung-up process. set COUNT=10 set INTERVAL=3000 for /l %%n in (1,1,%COUNT%) do ( "C:\Program Files\Hitachi\Cosminexus\CC\server\bin\cjdumpsv.exe" J2EEServer if not "%%n" == "%COUNT%" ( rem Stand by until the next thread dump is collected.(milliseconds) echo WScript.sleep %INTERVAL% > sleep.vbs "C:\WINDOWS\\system32\cscript.exe" sleep.vbs > NUL del sleep.vbs ) ) :END |
#!/bin/sh # Determine whether the error has occurred because the process is down or hung up, from the environment variables. if [ "$COSMI_MNG_TIME_TERMINATED" = "" ] ; then # Acquire the thread dump because the error has occurred due to the hung-up process. COUNT=10 INTERVAL=3 for num in 'seq $COUNT' do /opt/Cosminexus/CC/server/bin/cjdumpsv J2EEServer if [ "$num" -ne "$COUNT" ]; then # Stand by until the next thread dump is collected.(Seconds) sleep $INTERVAL fi done fi |
The logical CTM starts, stops, and monitors two processes; the global CORBA Naming Service and the CTM daemon. There are different execution commands for the case when an error is detected in the global CORBA Naming Service and in the CTM daemon respectively, within the logical server.
Moreover, an error is detected in either of the two processes (CTM daemon or global CORBA Naming Service) in the logical CTM, therefore, the log that reports the startup of the failure detection time commands in the logical server (CTM) will be output in a Management Server log.
All Rights Reserved. Copyright (C) 2013, Hitachi, Ltd.