Job Management Partner 1/Automatic Job Management System 3 Administration Guide
This subsection describes the flow of processing if an error occurs in JP1/AJS3 - Manager and the failover is performed, and how to inherit information when a start condition or an event job is defined.
- Organization of this subsection
- (1) Flow of processing after node switching
(1) Flow of processing after node switching
The following figure shows the processing if node switching occurs in JP1/AJS3 - Manager during operation.
Figure 11-4 Processing if node switching occurs in JP1/AJS3 - Manager
The flow of events in the system processing is as follows:
- The system kills the jobnets or jobs that are being executed by JP1/AJS3 - Manager when the failover occurs. The statuses of the jobnets or jobs being executed by JP1/AJS3 - Agent remain Now Running.
This status is managed in the JP1/AJS3 database in a shared disk.
- The contents of the JP1/AJS3 database are inherited by the secondary node.
- The JP1/AJS3 - Manager service of the secondary node system manager starts.
- The status of jobs and jobnets changes automatically when the JP1/AJS3 service starts, according to the service start mode.
To check the service start mode, execute the following commands, and then check the start mode output to the environment setting parameter STARTMODE.
/opt/jp1base/bin/jbsgetcnf
-h {JP1_DEFAULT|logical-host-name}# \
-c JP1AJSMANAGER \
-n scheduler-service-name
The status changes and the processing flow of the system after the changes are described below for each service start mode.
- #
- In the {JP1_DEFAULT|logical-host-name} part, specify JP1_DEFAULT if the host is a physical host, and the logical host name if the host is a logical host.
For details about the STARTMODE environment setting parameter that changes the service start mode, see 2.2 Setting up the scheduler service environment in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 2.
- When the service start mode is set to Cold-start:
The secondary node system manager inherits only the definition information for the jobnets and jobs immediately before the failover occurs. All jobnets are placed in the Not registered status. To restart the operation, re-register the jobnets for execution.
Perform a cold start when it is safer to restart the jobnet from the beginning than to have the operator check the job statuses. Also, make sure that there is no harm in starting identical jobs or executing a job twice.
- When the service start mode is set to Warm-start:
The secondary node system manager inherits the status immediately before the failover occurs. The secondary node system manager changes the status of the job (Waiting to execute, Now queuing, or Now running) to the actual status when the service is started. However, if no job is executed, the job status is switched to Not executed + Ended. If a job is being executed or the status of a job cannot be acquired, the status is switched to Unknown end status.
The status of the jobnet is changed to Interrupted status.
The jobnets that were not started will be started on schedule. For the jobnets that abnormally terminated because of a warm start, check the changed statuses and then manually re-execute the jobnets. If the start condition is monitored, the secondary node system manager will inherit the events received before the error occurs.
Perform a warm start when you want to have the operator check the statuses of the jobs that were being executed to decide whether to continue the operation.
- When the service start mode is set to Hot-start:
The secondary node system manager inherits the status immediately before the failover occurs. The secondary node system manager gets information about the jobs in Now running status from the servers where the jobs were running, and automatically reproduces the actual status of each job if possible.
If the actual status of each job is successfully acquired, the jobnet resumes execution automatically as defined, without needing to be re-executed. If a start condition was being monitored, the secondary node system manager inherits information about events received before the failover occurred.
If the secondary node system manager fails to get information from the servers where the jobs were running, the jobs are placed in Ended abnormally status. In this case, you must check the job statuses and manually re-execute the jobnet.
Specify a hot start to resume operation after a failover.
For details about the setting procedure, see 4.2 Environment setting parameter settings in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1.
- Manually re-execute the jobs and jobnets whose status was changed in step 4 if needed, and resume the system operation.
- Operating a cluster system when a start condition is changed:
- If you change a start condition during operation, the change becomes effective in the next execution schedule. Therefore, if node switching occurs in JP1/AJS3 - Manager of the current system and the secondary node takes over the processing, monitoring continues with the old start condition.
- For example, imagine that schedule rule 1 defines 11:00 as the start time and schedule rule 2 defines 13:00 as the start time.
- When you change the start condition to 11:30, schedule rule 1 is monitored using the old start condition and schedule rule 2 is monitored using the new start condition.
- If node switching occurs between 11:00 and 12:00, schedule rule 1 inherits monitoring using the old start condition (only when the restart is within the valid time period). Schedule rule 2 is monitored using the new start condition.
- Operating a cluster system while JP1/AJS3 - View is connected:
- The ajsmonsvr process is generated when JP1/AJS3 - View is connected. If there is a remaining ajsmonsvr process accessing the shared disk at node switching, the shared disk cannot be unmounted. To stop the ajsmonsvr process, stop the ajsinetd process.
- Note that cluster middle software forcibly terminates any process accessing the shared disk at node switching. Therefore, you do not need to explicitly stop the ajsinetd process. However, you should stop the ajsinetd process if an unfavorable event occurs such as displaying a message when the process is forcibly terminated.
- Operating a cluster system with JP1/AJS3 Console:
- Since the processes of JP1/AJS3 Console end as shown below at a failover, they may remain for a while. To end a JP1/AJS3 Console process immediately, set the cluster software to restart (stop or start) JP1/AJS3 Console services at a failover.
- For JP1/AJS3 Console Manager
The ajscmmonsvr process and the ajscmstatd process are executed when JP1/AJS3 Console Manager is accessed from JP1/AJS3 Console View, and they remain for a while at a failover. Since these processes use the shared disk, they are forcibly ended when the cluster software sets the shared disk offline (depending on the specifications of the cluster software). Alternatively, the processes stop automatically when a communication error is detected.
- For JP1/AJS3 Console Agent
The ajscagtd process is executed when JP1/AJS3 Console Agent is accessed from JP1/AJS3 Console Manager and the process remains for a while at a failover. This process stops automatically when a communication error is detected.
- Operating a cluster system while a submitted job is executed:
- When a failover has occurred during execution of a submit job registered by a job execution control command, if the job is being executed with JP1/AJS3 - Manager, the job is forcibly terminated. Note, however, that if termination of the job is not reported, the status of the job becomes Waiting to execute, Being held, or Killed according to the specified setting in effect when the job was submitted. If the job was submitted by the jpqjobsub command, the status of the job becomes the status specified in the -rs option. The default is Being held.
Copyright (C) 2009, 2010, Hitachi, Ltd.
Copyright (C) 2009, 2010, Hitachi Solutions, Ltd.