Hitachi

JP1 Version 12 JP1/Automatic Job Management System 3 Overview


5.4.8 Monitoring the status of registered jobs

The manager host monitors the agent hosts and polls the jobs that are registered for execution. In JP1/AJS3, the status of each job is automatically reported by the agent host to the manager host. If the manager host fails or if a communication error occurs between the manager host and the agent host that is executing a job, the job status might not be reported correctly. JP1/AJS3 performs monitoring to enable recovery from such errors.

Organization of this subsection

(1) Monitoring active jobs

The manager host polls active jobs at five-minute intervals. As each job finishes, the manager host places it in Ended status when notified by the agent host that the job has ended. If the manager host fails to receive notification due to a temporary communication error or other fault, it detects that the job has ended by polling its status. If polling fails due to a communication or other error, and the status of the active job cannot be verified for 12 to 30 minutes (depending on the monitoring interval set for the agent host and the execution start time of the job), the manager host changes its status. In a configuration where jobs are executed by multiple agent hosts, the manager host checks the status of active jobs on a host-by-host basis. As a result, the frequency of communication increases in proportion to the number of agent hosts assumed to have failed, resulting in more time being required to detect errors in jobs.

If the following message is output to the integrated trace log, there might be a delay in the checking of the job status:

KAVU4222-E The job confirmation request to the agent (agent-host-name) failed. (reason code:reason-code)

To check the frequency of communication between the manager host and managed agent hosts, check the number of KAVU4222-E messages that were output to the integrated trace log during the relevant time period.

If the status of a job defined in a jobnet cannot be verified, the manager host places it in Killed status and sets its return code to -1. If the job was executed by the jpqjobsub command, its status changes to the status specified in the -rs option. For details, see jpqjobsub in 4. Commands Used for Special Operation in the manual JP1/Automatic Job Management System 3 Command Reference.

At this time, the following message appears in the integrated trace log:

KAVU4534-W No response was received from the agent (agent-host-name), so the status of job (job-number) was changed to recovered (status).

The manager host monitors the status of standard jobs (other than QUEUE jobs executed from another system), HTTP connection jobs, action jobs, and custom jobs.

(2) Monitoring execution hosts (agents)

If the manager host gets no response from the agent host when it attempts to register a job for execution, it recognizes that an error has occurred on the agent host or that the agent host has stopped. On detection of failed or stopped status, the manager host continues to poll the agent host at five-minute intervals. While the agent host remains in failed or stopped status, jobs are queued and wait for the agent host to recover. When the manager host detects that the agent host has recovered (from failed or stopped status), it resumes registering jobs for execution on that host. However, if the agent host does not recover for 10 or more minutes after execution registration of a job on that agent host fails, the job is placed in Failed to start status#. Because the manager host checks the job status for each agent host, the frequency of communication increases in proportion to the number of agent hosts that are being managed for failures.

The time required to handle a communication error or timeout includes the time required to detect that the agent host has recovered normally.

If the following message is output to the integrated trace log, there might be a delay in the checking of the operating status of the agent host:

KAVU4223-E The operating status confirmation request to the agent (agent-host-name) failed. (reason code:reason-code)

To check the frequency of communication between the manager host and managed agent hosts, check the number of KAVU4223-E messages that were output to the integrated trace log during the relevant time period.

At this time, the following message appears in the integrated trace log:

KAVU4593-W An executable the agent does not exist.

Applicable jobs here are standard jobs (jobs other than QUEUE jobs running in another system), HTTP connection jobs, action jobs, and custom jobs.

#

If multiple execution agent hosts are connected to an execution agent group and no jobs can be run because all the hosts are in failed or stopped state, the jobs are placed in Failed to start status. However, if any of the execution agent hosts are not in failed or stopped state, the jobs are not placed in Failed to start status because the situation is not assumed to be abnormal. This case applies even when an agent usage rate is 100%.

(3) Monitoring jobs on another system

To check the status of a job registered for execution on another system (JP1/NQSEXEC or JP1/OJE, for example), the manager host polls the job at five-minute intervals. If no response is received for approximately one hour or longer, the following error message is output to the integrated trace log and the job is placed in abnormally terminated status:

KAVU6218-W In the job status notification process, the job information was not acquired because the error occurred during TCP/IP communication. But the job might have ended normally. (manager-descriptor, job-number)

The other system might not support the functionality for notifying the manager whenever a status change occurs in a job. In this case, it might take as long as five minutes for the manager host to acquire the job status through polling. See the relevant system documentation to find out whether the other system supports the status notification functionality.

The manager host does not perform five-minute polling of a submit job that was registered by the jpqjobsub command for execution on another system. To check the job status, use the jpqjobget command.

Note on linking with JP1/NQSEXEC

Notification of job termination is not performed when a job execution request is sent from JP1/AJS3 to JP1/NQSEXEC version 05-20 or earlier. In this case, JP1/AJS3 polls the job executed by the linked JP1/NQSEXEC at five-minute intervals. Consequently, it might take as long as five minutes for the job's status to change after completion. Be aware of how this will affect the execution monitoring interval if there are succeeding jobs.

The JP1/AJS3 notification functionality is supported in JP1/NQSEXEC version 06-00 and later. When execution by JP1/NQSEXEC finishes, the job's status is immediately reported to JP1/AJS3.

If you are using JP1/NQSEXEC version 05-20 or earlier, the length of time taken to detect job termination could have a major impact on operations. We recommend that you upgrade to JP1/NQSEXEC 06-00 or later, or migrate to JP1/AJS3.

(4) Error detection and recovery time at a job execution host

If a communication error or failure occurs on the agent host executing a job (standard job, HTTP connection job, action job, or custom job), JP1/AJS3 does not immediately assume that the job has terminated abnormally. Instead, the manager host waits for a certain amount of time and then retries, waiting for the system error or communication error on the agent host to be corrected. This grace time prevents disruption of job processing should a temporary and recoverable error occur.

For some applications, you might want errors to be promptly detected and speedily corrected, rather than waiting for recovery. Errors can be rapidly detected if you reduce the TCP/IP communication time or recovery wait time. For details on how to reduce the time until an error is detected, see 6.2.12 Changing the wait time for recovery when an agent has failed in the JP1/Automatic Job Management System 3 Configuration Guide (Windows) or 15.2.12 Changing the wait time for recovery when an agent has failed in the JP1/Automatic Job Management System 3 Configuration Guide (UNIX).

The time taken to detect an error on the agent host differs at job transfer and at job execution. This is explained next.

(a) Error detection and recovery time at job transfer

The manager host uses TCP/IP communication to send jobs to an agent host. If the agent host has not started or if a network error occurs, a TCP/IP connection error results. However, because of the retry interval allowed, it might take as long as five minutes for the manager host to detect the error. The agent host where the connection error occurred is assumed to have failed. The manager host does not attempt to transfer subsequent jobs via TCP/IP while the agent host remains in failed status.

If the agent host is in failed status, all jobs wait for recovery for the specified recovery wait time (default 10 minutes). During this time, jobs are placed in Now queuing status (or Waiting to execute status in the case of a submit job). If the agent host has not recovered from the error when the wait time elapses, all jobs are placed in Failed to start status. The length of time from when a job is registered for execution until it enters Failed to start status depends on whether TCP/IP communication is performed, as follows:

  • Job transferred to the agent host before error detection (TCP/IP communication is performed)

    Communication time on the TCP/IP connection (approx. 5 minutes maximum)#1

    + recovery wait time at the agent host (10 minutes) = 15 minutes 10 seconds maximum

  • Job submitted to the agent host after error detection (TCP/IP communication is not performed)

    Recovery wait time at the agent host (10 minutes)

    Note that depending on the job execution status, such as when multiple jobs are submitted to the execution agent after error detection, it might take 10 or more minutes to place jobs in the Now queuing status.

    For example, suppose that three jobs are in the Now queuing status and it takes 10 minutes to place one job in the Failed to start status. In this case, it might take a total of approximately 30 minutes for all jobs to be placed in the Failed to start status.

(b) Error detection and recovery time at job execution

On receiving notification from the agent host that a job has started execution, the manager host changes its status to Now running and starts checking the job status by polling the agent host at the set interval (default 300 seconds (5 minutes)). At this time, TCP/IP communication is used to pass information between processes. If the agent host has not started or if a network error occurs, a TCP/IP connection error results. However, because of the retry interval allowed, it might take as long as 310 seconds (5 minutes 10 seconds) for the manager host to detect the error.#1

Note that because the manager host checks the job status for each agent host, the number of times the manager host communicates with agent hosts increases in proportion to the number of agent hosts that are managed.

If the connection error is detected within the recovery wait time set for the agent host (default 10 minutes), the manager host resumes polling. If the connection error remains undetected until after the wait time elapses, the manager host changes the job status to Killed.#2 Therefore, it takes roughly 12 to 30 minutes in total for a job error to be detected from the time it actually occurred on the agent host.#3

#1

With TCP/IP connection, retry is performed by default. Thus, the following time settings apply from the time a connection error occurs until connection timeout:

  • Connection timeout

    Default 90 seconds

  • Connection retry interval

    Default 20 seconds

  • Number of connection retries

    Default 2 retries

Even if a connection error occurs immediately, two retries are attempted at the default 20-second interval. Therefore, the communication time could be from roughly 40 seconds to as long as 310 seconds (5 minutes 10 seconds). For details on setting a connection timeout, connection retry interval and number of connection retries, see 6.2.8 Changing the timeout period, interval of retries, and number of retries for TCP/IP connections in the JP1/Automatic Job Management System 3 Configuration Guide (Windows) or 15.2.8 Changing the timeout period, interval of retries, and number of retries for TCP/IP connections in the JP1/Automatic Job Management System 3 Configuration Guide (UNIX).

#2

If the job was executed by the jpqjobsub command, its status changes to the status specified in the -rs option. For details, see jpqjobsub in 4. Commands Used for Special Operation in the manual JP1/Automatic Job Management System 3 Command Reference.

#3

With the default settings, the total time is calculated as follows:

Approximate total time to detect an error =

(agent monitoring interval x 2 retries)

+ (communication time x 3 times)

+ time from error occurrence until the job status is first verified

The communication time increases in proportion to the number of agent hosts that are managed as failed. For example, if there are 10 failed agent hosts, communication might take a maximum of 3,100 seconds (about 50 minutes) at the default settings.

When monitoring active jobs, the manager host checks the status of one job with each poll. If multiple jobs are being executed on an agent host, one job will be killed on detection of an error. The manager host then starts polling the next job. The time taken to detect the error from the time a poll starts is the same for every job.

For example, suppose three jobs are running and it takes 20 minutes to detect an error in one job. To detect the error and kill all three jobs, it would take at least 60 minutes.

Depending on the system, rather than waiting for recovery after a communication error in this way, it is sometimes better to kill all jobs currently running on the agent host, enabling immediate detection of the error and rapid recovery. For details about implementing this setup, see 6.2.20 Placing all running jobs in an end status when a communication error occurs in the JP1/Automatic Job Management System 3 Configuration Guide (Windows) or 15.2.19 Placing all running jobs in an end status when a communication error occurs in the JP1/Automatic Job Management System 3 Configuration Guide (UNIX).