7.4.2 Collecting error information
This subsection explains the error information and how to collect it in the event of a failure.
-
HA Monitor error information
Use the troubleshooting information collection command (monts command) to collect HA Monitor error information.
-
OS error information and statistical information
Use OS commands to collect OS error information and statistical information.
This subsection explains the HA Monitor error information and the troubleshooting information collection command (monts command). For details about the OS error information and statistical information, see the OS documentation.
- Organization of this subsection
(1) HA Monitor error information
HA Monitor provides the error information listed below. Use the troubleshooting information collection command (monts command) to collect all of this information. The data collected by the monts command is not compressed.
-
Definition information
Definition files (all files under /opt/hitachi/HAmon/etc)
-
Module trace information
Core files (all files beginning with /opt/hitachi/HAmon/core)
-
Message log information
System log files (all files beginning with /var/log/messages)
-
Memory information
If HA Monitor is running, its memory information is collected.
-
Trace information
Trace files (all files under /opt/hitachi/HAmon/spool)
The trace files store the history of HA Monitor operations and the execution results of the commands that have been issued.
You can use the trace files listed below to determine the causes of errors. Note that the files required for HA Monitor operation are stored under /opt/hitachi/HAmon/spool. Make sure that you do not change or delete these files, and that you do not perform any operations on this directory.
Table 7‒4: Files used to determine the causes of HA Monitor errors File name
Description
Purpose
-
sms
-
oldsms
This file collects information about host and server failures and slowdowns.
When the trace information in the file reaches 100 KB, it is wrapped around to the backup file named oldsms. The contents of the sms file are then cleared and rewritten in the oldsms file.
Determining the causes of host and server failures
-
server-alias-name.fslog
-
server-alias-name.fslog_old
These files collect the execution results of the OS commands (fsck (or xfs_repair), mount, fuser, and umount) that are executed when HA Monitor switches file systems.
When the file size during trace collection exceeds the value specified in the fs_log_size operand in the HA Monitor environment settings, a backup file named server-alias-name.fslog_old is created.
Determining the causes of file system switchover errors
-
volume-group-name.vglog
-
volume-group-name.vglog_old
This file collects the execution results of the OS command (vgchange command) that is executed when HA Monitor connects or disconnects volume groups.
When the file size during trace collection exceeds 65,535 bytes, a backup file named volume-group-name.vglog_old is created.
Determining the causes of shared disk connection errors
-
-
Monitoring history
Monitoring history file (/opt/hitachi/HAmon/history/patrol_history)
This file is used to collect host and server slowdown information. It provides more detailed information about slowdown periods than the SMS trace information file.
Monitoring history is useful when you re-evaluate the definition files and the system configuration to prevent failures that might be caused by heavy load. For details about the operation method using monitoring history to reduce failures, see 7.6 Operation for preventing failures caused by heavy load.
(2) Using the monts HA Monitor command to collect error information
If you execute the troubleshooting information collection command (monts command) after a failure, you can store the collected error information as archive files or onto a portable medium.
You must execute the monts command with the superuser permissions. The following figure shows the procedure for collecting error information by using the monts troubleshooting information collection command.
When you execute the monts troubleshooting information collection command, you can specify whether the error information is to be stored as an archive file or onto a portable medium.