-
Bug
-
Resolution: Done-Errata
-
Major
-
4.12.z
-
None
-
Low
-
No
-
False
-
Description of problem:
When we configure a MC with an OSImage that cannot be pulled, the pools are not degraded as they should be. We can see the logs in the MCDs telling that the pool will be marked as "Degraded", but instead it is marked as "Working". E0525 10:36:49.013820 201317 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc-54054-fake-image:latest failed: Error: initializing source docker://quay.io/openshifttest/tc-54054-fake-image:latest: reading manifest latest in quay.io/openshifttest/tc-54054-fake-image: name unknown: repository not found : exit status 125] I0525 10:36:49.023498 201317 daemon.go:510] Transitioned from state: Done -> Working
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-05-24-182756
How reproducible:
Always. Sometimes, after 20 or 40 minutes the pools are degraded, sometimes not.
Steps to Reproduce:
1. Create these MCs kind: MachineConfig apiVersion: machineconfiguration.openshift.io/v1 metadata: labels: machineconfiguration.openshift.io/role: "master" name: "fake-image-tc54054-master" spec: osImageURL: "quay.io/openshifttest/tc-54054-fake-image:latest" --- kind: MachineConfig apiVersion: machineconfiguration.openshift.io/v1 metadata: labels: machineconfiguration.openshift.io/role: "worker" name: "fake-image-tc54054-worker" spec: osImageURL: "quay.io/openshifttest/tc-54054-fake-image:latest" 2. Wait for the nodes to be degraded
Actual results:
In MCDs logs we can see that the MCD reports that the pool should be degraded, but instead of setting state=Degraded, it does state=Working E0525 10:36:49.013820 201317 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc-54054-fake-image:latest failed: Error: initializing source docker://quay.io/openshifttest/tc-54054-fake-image:latest: reading manifest latest in quay.io/openshifttest/tc-54054-fake-image: name unknown: repository not found : exit status 125] I0525 10:36:49.023498 201317 daemon.go:510] Transitioned from state: Done -> Working
Expected results:
The pool should be degraded when, after all the retries, we can't pull the osImage.
Additional info:
This problem does not happen in 4.13 and 4.14. It looks like that randomly the pools can be degraded eventually. Or not (I haven't waited longer that 40 minutes or so) A must gather with a try waiting 25 minutes aprox. is linked in the first comment of this issue. The issue is very similar to: https://issues.redhat.com/browse/OCPBUGS-3001
- depends on
-
OCPBUGS-1761 osImages that cannot be pulled do not set the node as Degraded properly
- Closed
- links to
-
RHBA-2024:0485 OpenShift Container Platform 4.12.z bug fix update