-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18, 4.19, 4.20, 4.21
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
It has been observed that metal jobs that fail to mirror an image at the start of their run do not terminate immediately. Instead, they proceed to run all subsequent steps, such as tests, and only report the failure at the very end. This leads to a significant waste of computational resources and time. For instance, this [1] example job ran for over 5 hours and 45 minutes before finally failing, when the critical error occurred at the beginning. Implementing a "fail-fast" mechanism for image mirroring would save considerable resources and provide developers with much faster feedback on job failures, specially on k8s bump PRs. [1] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_kubernetes/2464/pull-ci-openshift-kubernetes-release-4.19-e2e-metal-ipi-ovn-ipv6/1967784233418100736
Version-Release number of selected component (if applicable):
All version
How reproducible:
When there is a new image that needs mirroring (which happens often in kube bumps).
Steps to Reproduce:
1. 2. 3.
Actual results:
The job logs the image mirror failure but continues to execute all subsequent steps. The job runs for its entire duration (or until another step fails) and only then reports the final "Failed" status. This can take hours, as seen in the example provided.
Expected results:
The job should detect the image mirror failure, immediately terminate, and report the error. The job status should change to "Failed" within minutes of starting.
Additional info: