Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.12.z
Affects Version/s: 4.12.z
Component/s: Machine Config Operator
Labels:
None

Severity:
Low
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When we configure a MC with an OSImage that cannot be pulled, the pools are not degraded as they should be.

We can see the logs in the MCDs telling that the pool will be marked as "Degraded", but instead it is marked as "Working".

E0525 10:36:49.013820  201317 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc-54054-fake-image:latest failed: Error: initializing source docker://quay.io/openshifttest/tc-54054-fake-image:latest: reading manifest latest in quay.io/openshifttest/tc-54054-fake-image: name unknown: repository not found
: exit status 125]
I0525 10:36:49.023498  201317 daemon.go:510] Transitioned from state: Done -> Working

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-05-24-182756

How reproducible:

Always. Sometimes, after 20 or 40 minutes the pools are degraded, sometimes not.

Steps to Reproduce:

1. Create these MCs

kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
  labels:
    machineconfiguration.openshift.io/role: "master"
  name: "fake-image-tc54054-master"
spec:
  osImageURL: "quay.io/openshifttest/tc-54054-fake-image:latest"

---

kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: "fake-image-tc54054-worker"
spec:
  osImageURL: "quay.io/openshifttest/tc-54054-fake-image:latest"



2. Wait for the nodes to be degraded

Actual results:

In MCDs logs we can see that the MCD reports that the pool should be degraded, but instead of setting state=Degraded, it does state=Working


E0525 10:36:49.013820  201317 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc-54054-fake-image:latest failed: Error: initializing source docker://quay.io/openshifttest/tc-54054-fake-image:latest: reading manifest latest in quay.io/openshifttest/tc-54054-fake-image: name unknown: repository not found
: exit status 125]
I0525 10:36:49.023498  201317 daemon.go:510] Transitioned from state: Done -> Working

Expected results:

The pool should be degraded when, after all the retries, we can't pull the osImage.

Additional info:

This problem does not happen in 4.13 and 4.14.

It looks like that randomly the pools can be degraded eventually. Or not (I haven't waited longer that 40 minutes or so)

A must gather with a try waiting 25 minutes aprox. is linked in the first comment of this issue.

The issue is very similar to: https://issues.redhat.com/browse/OCPBUGS-3001

depends on

OCPBUGS-1761 osImages that cannot be pulled do not set the node as Degraded properly

Closed

links to

openshift/machine-config-operator#3717: [release-4.12] OCPBUGS-14071: Imageinspect takes type of error into account, drop podman inspect fallback

RHBA-2024:0485 OpenShift Container Platform 4.12.z bug fix update

Assignee:: John Kyros

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/05/25 11:23 AM

Updated:: 2024/01/31 4:19 PM

Resolved:: 2024/01/31 4:19 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates