Hitachi

JP1 Version 12 JP1/Automatic Job Management System 3 System Design (Configuration) Guide


2.5.5 Considering reduction of job distribution delay

When a job is distributed simultaneously from the manager host to multiple execution agents, if a communication failure has occurred on three or more of those agents, job distribution to the other execution agents might be delayed. When this occurs, the manager host can check the status of communication with each execution agent to suppress job distribution to execution agents on which a communication failure has occurred. As a result, delays in job distribution can be reduced. In JP1/AJS3, this function is called job distribution delay reduction function.

For details about how to enable the job distribution delay reduction function, see 21.5 Setting up the job distribution delay reduction function in the JP1/Automatic Job Management System 3 Configuration Guide.

Organization of this subsection

(1) Overview of the job distribution delay reduction function

If the job distribution delay reduction function is enabled, the manager host manages the status of each execution agent. The following table describes the statuses of execution agents.

Table 2‒42: List of statuses of execution agents

No.

Status

Description

1

Not checked

The status of the execution agent has not been identified yet.

Job distribution to the execution agent is possible.

2

Connectable

The manager host can normally communicate with the execution agent.

Job distribution to the execution agent is possible.

3

Unconnectable

The execution agent cannot receive jobs because a communication failure has occurred.

Job distribution to the execution agent is not possible. (Jobs to be distributed to this execution agent enter the Queuing status on the manager host.)

4

Unavailable

Job distribution to the execution agent is explicitly suppressed by using the ajsagtalt command.

Job distribution to the execution agent is not possible. (Jobs to be distributed to this execution agent enter the Queuing status on the manager host.)

When no communication failures have been detected on any execution hosts, the manager host assumes the status of each execution agent to be Not checked (the default status). The manager host distributes a job to execution agents with the Not checked status.

Upon detecting a failure in communication with an execution agent, the manager host starts monitoring the status of the execution agents. The following describes the status transitions of execution agents:

  1. The manager host checks the communication status of execution agents whose status is Not checked or Connectable and for which jobs are queuing, as well as the communication status of execution agents whose status is Not checked or Connectable and that are connected to a group of execution agents for which jobs are queuing. This is called the communication status check.

    The status of the execution agents changes to one of the following according to the result of the communication status check:

    • Connectable

    • Unconnectable

  2. The manager host distributes the job to the execution agents for which the status is Connectable, and does not distribute the job to the execution agents for which the status is Unconnectable.

    The job distributed to Connectable execution agents is executed on the execution agents. One hour after the status of an execution agent changes to Connectable, the status changes to Not checked.

    A job that could not be distributed to an execution agent for which the status is Unconnectable remains queuing, and waits for the communication failure on the execution agent to be corrected. If an execution agent has not recovered when the wait time for error recovery elapses# (in the case of the event job, even if the retransmission of the unreported information is complete), the job status changes to Failed to start.

    #

    For details about the wait time for error recovery of agents, see 6.2.12 Changing the wait time for recovery when an agent has failed in the JP1/Automatic Job Management System 3 Configuration Guide (for Windows) or see 15.2.12 Changing the wait time for recovery when an agent has failed in the JP1/Automatic Job Management System 3 Configuration Guide (for UNIX).

    The following figure shows an overview of distributing a job according to the status of execution agents.

    Figure 2‒51: Distribution of a job according to the status of execution agents

    [Figure]

  3. The manager host periodically polls Unconnectable execution agents to check the connection status. This is called the communication recovery check.

    The communication recovery check is repeated until the execution agent recovers from the communication failure.

  4. The status of each execution agent changes to one of the following statuses according to the result of repeated communication recovery checks:

    • Connectable

      When an execution agent recovers from a communication failure, the execution agent's status changes to Connectable, and job distribution starts. If there are no execution agents whose status is Unconnectable, the communication recovery check ends.

    • Not checked

      If an execution agent does not recover from a communication failure within 24 hours (at the default settings), the execution agent's status changes from Unconnectable to Not checked, and the communication recovery check stops.

Supplementary note
  • If an execution agent group is specified as the job destination, the job is distributed to the execution agents with the highest priority among the Connectable or Not checked execution agents associated with the group.

  • Even if JP1/AJS3 version 11-10 or earlier is installed on the job-destination agent hosts, the agent hosts are subject to communication status check and communication recovery check.

(2) Jobs subject to job distribution delay reduction

The job distribution delay reduction function can be applied to the following types of jobs:

(3) Monitoring interval of the job distribution delay reduction function

By using the environment setting parameters for the job distribution delay reduction function, you can set the connection timeout period, the interval at which to perform communication recovery checks, and the amount of time after which communication checks are to stop. The following figure shows the time periods that can be set by using environment setting parameters.

Figure 2‒52: Time periods that can be set by using environment setting parameters

[Figure]

  1. The time required to carry out a communication status check or communication recovery check (connection timeout). This value is specified by using the AGMCONNECTTIMEOUT environment setting parameter. The default is 10 seconds.

  2. The interval at which to perform a communication recovery check. This value is specified by using the AGMINTERVALFORRECOVER environment setting parameter. The default is 180 seconds.

  3. The time before the status of an execution agent changes from Unconnectable to Not checked. This value is specified by using the AGMERRAGTSTATRESETTIME environment setting parameter. The default is 24 hours.

(4) Forcibly terminating a job when the job distribution delay reduction function is used

If the job distribution delay reduction function is used, job distribution is not delayed even when a communication failure occurs during distribution of a forced-termination request for a job. In the same way that jobs are distributed, the request is distributed to the execution agents whose status is Connectable or Not checked. Distribution of the request is suppressed for execution agents whose status is Unconnectable or Unavailable. If the forced-termination request for a job is suppressed, the status of the job changes to Ended abnormally, and the message KAVU4221-E is output to the integrated trace log. Note that user programs executed by the job are not forcibly terminated.

(5) When you want to stop an execution agent for a scheduled purpose

If the job distribution delay reduction function is enabled and you want to stop an execution agent for a scheduled purpose such as maintenance, specify the following settings for the execution agent you want to stop, in order to prevent delays in job distribution to other running execution agents:

If you do not change the status of the execution agent to Unavailable, the communication status check reports the status to be Unconnectable. In this case, because the communication recovery check continues for the execution agent that has stopped, recovery detection might be delayed until the status of the execution agent changes to Not checked. To redistribute a job to an execution agent whose status you changed to Unavailable, use the ajsagtalt command to change the status of the execution agent to Not checked.

Note that you can also perform these operations from JP1/AJS3 - Web Console.

(6) Relationship between the job transfer restriction and the job distribution delay reduction function

The following table describes the job status transitions that occur based on the job transfer restriction status and the status of the execution agent according to the job distribution delay reduction function. For details about job transfer restriction, see 5.2 Restricting job transfer in the manual JP1/Automatic Job Management System 3 Overview.

Table 2‒43: Job status transitions that occur when job transfer restriction is used in conjunction with the job distribution delay reduction function

Job transfer restriction status

Execution agent status

Job status transition

Event job status transition

Effective

Not checked

Connectable

Now queuingNow running Ended

Now queuingNow running Ended

(The transfer of event jobs cannot be restricted.)

Unconnectable

Unavailable

Now queuingFailed to start

(The status changes to Failed to start after the wait time for the error recovery of agents has elapsed.)

Now queuingFailed to start

Ineffective

Not checked

Connectable

Immediately Failed to start

(Now queuing jobs have the same status as Effective jobs.)

Now queuingNow running Ended

(The transfer of event jobs cannot be restricted.)

Unconnectable

Unavailable

Immediately Failed to start

(The status of Now queuing jobs changes to Failed to start after the wait time for the error recovery of agents has elapsed.)

Now queuingFailed to start

Hold

Not checked

Connectable

Now queuing

Now queuingNow running Ended

(The transfer of event jobs cannot be restricted.)

Unconnectable

Unavailable

Now queuing

Now queuingFailed to start

Blockade

Not checked

Connectable

Immediately Failed to start

(Now queuing jobs have the same status as Hold jobs.)

Now queuingNow running Ended

(The transfer of event jobs cannot be restricted.)

Unconnectable

Unavailable

Immediately Failed to start

(Now queuing jobs have the same status as Hold jobs.)

Now queuingFailed to start

(7) Note on the job distribution delay reduction function

One hour after the status of an execution agent is determined to be Connectable, the status of the execution agent changes to Not checked. Suppose that a communication failure occurs on three or more execution agents determined to be Connectable before their status changes from Connectable to Not checked. In this case, an attempt to simultaneously distribute jobs to those execution agents might cause a delay in job distribution. If you are going to enable the job distribution delay reduction function, do not stop the agent monitoring process (ajsagtmond). If you stop this process, the following operations are performed: