2.6.7 Troubleshooting problems related to a flexible job distributed by a load balancer
This subsection describes how to troubleshoot problems that are related to a flexible job distributed by a load balancer.
- Organization of this subsection
(1) The status of a flexible job changes from "Now running" to "Ended abnormally" when monitoring ends
Possible causes are as follows:
-
The execution result of a job might not be returned from the destination agent due to a communication error or one of the following network setting errors:
-
Error in the security settings for the load balancer
-
Error in the firewall settings
-
Failure in name resolution
-
-
The load balancer might fail while a job is being executed on the destination agent.
-
While a job is being executed on the destination agent, the destination agent might be deleted by a scale-in operation.
-
While a job is being executed on the destination agent, the destination agent might fail.
-
While a job is being executed on the destination agent, the manager host might fail over and the execution result might be reported from the destination agent before the JP1/AJS3 Autonomous Agent Messenger service starts.
Correct the problem as follows:
-
If a load balancer is used, make sure that the load balancer is running and the network settings are correct.
-
Take action by referring to the execution result of the user program.
(2) The status of a flexible job changes to "Killed"
Possible causes are as follows:
-
While a job is being executed on the destination agent, the relay agent might fail.
-
While a job is being executed on the destination agent, the relay agent might fail over.
-
While a job is being executed on the destination agent, the manager host might fail.
Correct the problem as follows:
-
Make sure that the necessary services are running on the relay agent and manager host.
-
Take action by referring to the execution result of the user program.
(3) A user program terminates abnormally
Check the return value and the message sent to the standard error output and, if necessary, re-execute the flexible job.
(4) The status of a flexible job changes to "Failed to start"
Possible causes are as follows:
-
The specification of relay agents might be incorrect.
-
A connection attempt might have been refused by the connection source restriction function enabled on a relay agent.
Remove the cause of the problem according to the message that is output.