-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
2.4, 2.5
-
False
-
-
False
- Currently Receptor surfaces an error in the stdout of a job if the pod fails to start because of ImagePullBackOff, but this is really not enough information for the user to debug WHY their EE is not being pulled. To get the actual events from openshift you have to catch the pod before it is deleted and look at the related events.
Receptor detail:Error creating pod: container failed to start, ImagePullBackOff
- The customer needs more context to be able to debug ImagePullBackOff errors which can occur from such mundane reasons as a misspelled image tag, an incorrect credential, or a SSL certificate issue with an insecure registry. Especially in the context of the SaaS/managed app space this may require intervention of SRE to get the reason because end users won't often have access to the underlying openshift/AKS.
- Receptor already explicitly handles ImagePullBackOff in https://github.com/ansible/receptor/blob/122aa3e74aae8929741a7f6c7629237d8b2bf66b/pkg/workceptor/kubernetes.go#L186-L230 . Instead of just returning the error, if we run out of retries, do a request like
GET /api/v1/namespaces/<namespace>/events?fieldSelector=reason=ImagePullBackOff,involvedObject.name=<pod_name>
And include the error messages in the events that indicate reason for the ImagePullBackOff
- This would then be correctly displayed in the job stdout. To get it correctly in the job explanation, something like https://github.com/ansible/awx/pull/15689 may need to go in as right now controller has some issues correctly putting it in the job_explanation field.
(putting controller component because there is no receptor component in this project, and in the end we should test it w/ controller to make sure user gets to see the message)