Hitachi

JP1 Version 12 JP1/Automatic Job Management System 3 Administration Guide


10.2.1 Node switching caused by an error in JP1/AJS3 - Manager

This subsection describes the flow of processing if an error occurs in JP1/AJS3 - Manager and the failover is performed, and how to inherit information when a start condition or an event job is defined.

The following figure shows the processing that is performed when node switching occurs in JP1/AJS3 - Manager during operation.

Figure 10‒4: Processing if node switching occurs in JP1/AJS3 - Manager

[Figure]

The flow of events in the system processing is as follows:

  1. An error occurs on the primary node system manager, and the JP1/AJS3 service of JP1/AJS3 - Manager stops.

  2. The contents of the shared disk are inherited by the secondary node.

  3. The JP1/AJS3 service of the secondary node system manager starts.

  4. The status of jobs and jobnets changes automatically when the JP1/AJS3 service starts, according to the service start mode.

    To check the service start mode, execute the following commands, and then check the start mode output to the environment setting parameter STARTMODE.

    /opt/jp1base/bin/jbsgetcnf 
                            -h {JP1_DEFAULT|logical-host-name}# \
                            -c JP1AJSMANAGER \
                            -n scheduler-service-name
    #

    In the {JP1_DEFAULT|logical-host-name} part, specify JP1_DEFAULT if the host is a physical host, and the logical host name if the host is a logical host.

    The status changes and the processing flow of the system after the changes are described below for each service start mode.

    • When the service start mode is set to Cold-start:

      The secondary node system manager inherits only the definition information for the jobnets and jobs immediately before the failover occurs. All jobnets are placed in the Not registered status. To restart the operation, re-register the jobnets for execution.

      Perform a cold start when it is safer to restart the jobnet from the beginning than to have the operator check the job statuses. Also, make sure that there is no harm in starting identical jobs or executing a job twice.

    • When the service start mode is set to Warm-start:

      The secondary node system manager inherits the status immediately before the failover occurs. The secondary node system manager changes the status of the job (Waiting to execute, Now queuing, or Now running) to the actual status when the service is started. However, if no job is executed, the job status is switched to Not executed + Ended. If a job is being executed or the status of a job cannot be acquired, the status is switched to Unknown end status.

      The status of the jobnet is changed to Interrupted status.

      The jobnets that were not started will be started on schedule. For the jobnets that abnormally terminated because of a warm start, check the changed statuses and then manually re-execute the jobnets. If the start condition is monitored, the secondary node system manager will inherit the events received before the error occurs.

      Perform a warm start when you want to have the operator check the statuses of the jobs that were being executed to decide whether to continue the operation.

    • When the service start mode is set to Hot-start:

      The secondary node system manager inherits the status immediately before the failover occurs. The secondary node system manager gets information about the jobs in Now running status from the servers where the jobs were running, and automatically reproduces the actual status of each job if possible.

      If the actual status of each job is successfully acquired, the jobnet resumes execution automatically as defined, without needing to be re-executed. If a start condition was being monitored, the secondary node system manager inherits information about events received before the failover occurred.

      If the secondary node system manager fails to get information from the servers where the jobs were running, the jobs are placed in Ended abnormally status. In this case, you must check the job statuses and manually re-execute the jobnet.

      Specify a hot start to resume operation after a failover.

    For details about the STARTMODE environment setting parameter that changes the service start mode, see 20.4 Setting up the scheduler service environment in the JP1/Automatic Job Management System 3 Configuration Guide.

    For details about the setting procedure, see 4.2 Environment setting parameter settings in the JP1/Automatic Job Management System 3 Configuration Guide.

    For details about job statuses when a failover occurs, see 6.2.1(3) Jobnet and job statuses for each start mode.

  5. Manually re-execute the jobs and jobnets whose status was changed in step 4 if needed, and resume the system operation.

Operating a cluster system when a start condition is changed:

If you change a start condition during operation, the change becomes effective in the next execution schedule. Therefore, if node switching occurs in JP1/AJS3 - Manager of the current system and the secondary node takes over the processing, monitoring continues with the old start condition.

For example, imagine that schedule rule 1 defines 11:00 as the start time and schedule rule 2 defines 13:00 as the start time.

When you change the start condition to 11:30, schedule rule 1 is monitored using the old start condition and schedule rule 2 is monitored using the new start condition.

If node switching occurs between 11:00 and 12:00, schedule rule 1 inherits monitoring using the old start condition (only when the restart is within the valid time period). Schedule rule 2 is monitored using the new start condition.

Operating a cluster system while JP1/AJS3 - View is connected:

The ajsmonsvr process is generated when JP1/AJS3 - View is connected. If there is a remaining ajsmonsvr process accessing the shared disk at node switching, the shared disk cannot be unmounted. To stop the ajsmonsvr process, stop the ajsinetd process.

Note that cluster middle software forcibly terminates any process accessing the shared disk at node switching. Therefore, you do not need to explicitly stop the ajsinetd process. However, you should stop the ajsinetd process if an unfavorable event occurs such as displaying a message when the process is forcibly terminated.

Operating a cluster system while a submit job is executed:

When a failover has occurred during execution of a submit job registered by a job execution control command, if the job is being executed with JP1/AJS3 - Manager, the job is forcibly terminated. Note, however, that if termination of the job is not reported, the status of the job becomes Waiting to execute, Being held, or Killed according to the specified setting in effect when the job was submitted. If the job was submitted by the jpqjobsub command, the status of the job becomes the status specified in the -rs option. The default is Being held.