Hitachi

Job Management Partner 1 Version 10 Job Management Partner 1/Automatic Job Management System 3 Overview


5.4.8 Monitoring the status of registered jobs

The manager host monitors the agent hosts and polls the jobs that are registered for execution. In JP1/AJS3, the status of each job is automatically reported by the agent host to the manager host. If the manager host fails or if a communication error occurs between the manager host and the agent host that is executing a job, the job status might not be reported correctly. JP1/AJS3 performs monitoring to enable recovery from such errors.

Organization of this subsection

(1) Monitoring active jobs

The manager host polls active jobs at five-minute intervals. As each job finishes, the manager host places it in Ended status when notified by the agent host that the job has ended. If the manager host fails to receive notification due to a temporary communication error or other fault, it detects that the job has ended by polling its status. If polling fails due to a communication or other error, and the status of the active job cannot be verified for 12 to 30 minutes (depending on the monitoring interval set for the agent host and the execution start time of the job), the manager host changes its status. In a configuration where jobs are executed by multiple agent hosts, the manager host checks the status of active jobs on a host-by-host basis. This means that the frequency of communication increases in proportion to the number of agent hosts that are being managed for failures.

If the status of a job defined in a jobnet cannot be verified, the manager host places it in Killed status and sets its return code to -1. If the job was executed by the jpqjobsub command, its status changes to the status specified in the -rs option. For details, see jpqjobsub in 3. Commands Used for Special Operation in the manual Job Management Partner 1/Automatic Job Management System 3 Command Reference 2.

At this time, the following message appears in the integrated trace log:

KAVU4534-W No response was received from the agent (agent-host-name), so the status of job (job-number) was changed to recovered (status).

The manager host monitors the status of standard jobs (other than QUEUE jobs executed from another system), action jobs, and custom jobs.

(2) Monitoring execution hosts (agents)

If the manager host gets no response from the agent host when it attempts to register a job for execution, it recognizes that an error has occurred on the agent host or that the agent host has stopped. On detection of failed or stopped status, the manager host continues to poll the agent host at five-minute intervals. While the agent host remains in failed or stopped status, jobs are queued and wait for the agent host to recover. When the manager host detects that the agent host has recovered (from failed or stopped status), it resumes registering jobs for execution on that host. However, if the agent host does not recover for 10 or more minutes after execution registration of a job on that agent host fails, the job is placed in Failed to start status. Because the manager host checks the status of the agent hosts individually, the frequency of communication increases in proportion to the number of agent hosts that are being managed for failures.

At this time, the following message appears in the integrated trace log:

KAVU4593-W An executable agent does not exist.

Applicable jobs here are standard jobs (jobs other than QUEUE jobs running in another system), action jobs, and custom jobs.

(3) Monitoring jobs on another system

To check the status of a job registered for execution on another system (JP1/NQSEXEC or JP1/OJE, for example), the manager host polls the job at five-minute intervals. If no response is received for approximately one hour or longer, the following error message is output to the integrated trace log and the job is placed in abnormally terminated status:

KAVU6218-W In the job status notification process, the job information was not acquired because the error occurred during TCP/IP communication. But the job might have ended normally. (manager-descriptor, job-number)

The other system might not support the functionality for notifying the manager whenever a status change occurs in a job. In this case, it might take as long as five minutes for the manager host to acquire the job status through polling. See the relevant system documentation to find out whether the other system supports the status notification functionality.

The manager host does not perform five-minute polling of a submit job that was registered by the jpqjobsub command for execution on another system. To check the job status, use the jpqjobget command.

Note on linking with JP1/NQSEXEC

Notification of job termination is not performed when a job execution request is sent from JP1/AJS3 to JP1/NQSEXEC version 05-20 or earlier. In this case, JP1/AJS3 polls the job executed by the linked JP1/NQSEXEC at five-minute intervals. Consequently, it might take as long as five minutes for the job's status to change after completion. Be aware of how this will affect the execution monitoring interval if there are succeeding jobs.

The JP1/AJS3 notification functionality is supported in JP1/NQSEXEC version 06-00 and later. When execution by JP1/NQSEXEC finishes, the job's status is immediately reported to JP1/AJS3.

If you are using JP1/NQSEXEC version 05-20 or earlier, the length of time taken to detect job termination could have a major impact on operations. We recommend that you upgrade to JP1/NQSEXEC 06-00 or later, or migrate to JP1/AJS3.

(4) Error detection and recovery time at a job execution host

If a communication error or failure occurs on the agent host executing a job (standard job, action job, or custom job), JP1/AJS3 does not immediately assume that the job has terminated abnormally. Instead, the manager host waits for a certain amount of time and then retries, waiting for the system error or communication error on the agent host to be corrected. This grace time prevents disruption of job processing should a temporary and recoverable error occur.

For some applications, you might want errors to be promptly detected and speedily corrected, rather than waiting for recovery. Errors can be rapidly detected if you reduce the TCP/IP communication time or recovery wait time. For details on how to reduce the time until an error is detected, see 6.2.12 Changing the wait time for recovery when an agent has failed in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1 (Windows) or 15.2.12 Changing the wait time for recovery when an agent has failed in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1 (UNIX).

The time taken to detect an error on the agent host differs at job transfer and at job execution. This is explained next.

(a) Error detection and recovery time at job transfer

The manager host uses TCP/IP communication to send jobs to an agent host. If the agent host has not started or if a network error occurs, a TCP/IP connection error results. However, because of the retry interval allowed, it might take as long as five minutes for the manager host to detect the error. The agent host where the connection error occurred is assumed to have failed. The manager host does not attempt to transfer subsequent jobs via TCP/IP while the agent host remains in failed status.

If the agent host is in failed status, all jobs wait for recovery for the specified recovery wait time (default 10 minutes). During this time, jobs are placed in Now queuing status (or Waiting to execute status in the case of a submit job). If the agent host has not recovered from the error when the wait time elapses, all jobs are placed in Failed to start status. The length of time from when a job is registered for execution until it enters Failed to start status depends on whether TCP/IP communication is performed, as follows:

  • Job transferred to the agent host before error detection (TCP/IP communication is performed)

    Communication time on the TCP/IP connection (approx. 5 minutes maximum)#1

    + recovery wait time at the agent host (10 minutes) = 15 minutes 10 seconds maximum

  • Job waiting for transfer to the agent host after error detection (TCP/IP communication is not performed)

    Recovery wait time at the agent host (10 minutes)

(b) Error detection and recovery time at job execution

On receiving notification from the agent host that a job has started execution, the manager host changes its status to Now running and starts checking the job status by polling the agent host at the set interval (default 300 seconds (5 minutes)). At this time, TCP/IP communication is used to pass information between processes. If the agent host has not started or if a network error occurs, a TCP/IP connection error results. However, because of the retry interval allowed, it might take as long as 310 seconds (5 minutes 10 seconds) for the manager host to detect the error.#1

If the connection error is detected within the recovery wait time set for the agent host (default 10 minutes), the manager host resumes polling. If the connection error remains undetected until after the wait time elapses, the manager host changes the job status to Killed.#2 Therefore, it takes roughly 12 to 30 minutes in total for a job error to be detected from the time it actually occurred on the agent host.#3

#1

With TCP/IP connection, retry is performed by default. Thus, the following time settings apply from the time a connection error occurs until connection timeout:

  • Timeout value for TCP/IP connection

    Default 90 seconds

  • Retry count for TCP/IP connection

    Default 2 retries

  • Retry interval for TCP/IP connection

    Default 20 seconds

Even if a connection error occurs immediately, two retries are attempted at the default 20-second interval. Therefore, the communication time could be from roughly 40 seconds to as long as 310 seconds (5 minutes 10 seconds). For details on setting a retry interval and retry count for TCP/IP connection, see 6.2.8 Changing the interval and number of retry attempts when a TCP/IP connection error occurs in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1 (Windows) or 15.2.8 Changing the interval and number of retry attempts when a TCP/IP connection error occurs in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1 (UNIX).

#2

If the job was executed by the jpqjobsub command, its status changes to the status specified in the -rs option. For details, see jpqjobsub in 3. Commands Used for Special Operation in the manual Job Management Partner 1/Automatic Job Management System 3 Command Reference 2.

#3

With the default settings, the total time is calculated as follows:

Approximate total time to detect an error =

(agent monitoring interval x 2 retries)

+ (communication time x 3 times)

+ time from error occurrence until the job status is first verified

When monitoring active jobs, the manager host checks the status of one job with each poll. If multiple jobs are being executed on an agent host, one job will be killed on detection of an error. The manager host then starts polling the next job. The time taken to detect the error from the time a poll starts is the same for every job.

For example, suppose three jobs are running and it takes 20 minutes to detect an error in one job. To detect the error and kill all three jobs, it would take at least 60 minutes.

Depending on the system, rather than waiting for recovery after a communication error in this way, it is sometimes better to kill all jobs currently running on the agent host, enabling immediate detection of the error and rapid recovery. For details about implementing this setup, see 6.2.20 Placing all running jobs in an end status when a communication error occurs in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1 (Windows) or 15.2.19 Placing all running jobs in an end status when a communication error occurs in the Job Management Partner 1/Automatic Job Management System 3 Configuration Guide 1 (UNIX).