Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14071

Pools are not degraded when we configure an OSimage that cannot be pulled

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Major
    • 4.12.z
    • 4.12.z
    • None
    • Low
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      When we configure a MC with an OSImage that cannot be pulled, the pools are not degraded as they should be.
      
      We can see the logs in the MCDs telling that the pool will be marked as "Degraded", but instead it is marked as "Working".
      
      E0525 10:36:49.013820  201317 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc-54054-fake-image:latest failed: Error: initializing source docker://quay.io/openshifttest/tc-54054-fake-image:latest: reading manifest latest in quay.io/openshifttest/tc-54054-fake-image: name unknown: repository not found
      : exit status 125]
      I0525 10:36:49.023498  201317 daemon.go:510] Transitioned from state: Done -> Working
      
      
      
      

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-2023-05-24-182756

      How reproducible:

      Always. Sometimes, after 20 or 40 minutes the pools are degraded, sometimes not.

      Steps to Reproduce:

      1. Create these MCs
      
      kind: MachineConfig
      apiVersion: machineconfiguration.openshift.io/v1
      metadata:
        labels:
          machineconfiguration.openshift.io/role: "master"
        name: "fake-image-tc54054-master"
      spec:
        osImageURL: "quay.io/openshifttest/tc-54054-fake-image:latest"
      
      ---
      
      kind: MachineConfig
      apiVersion: machineconfiguration.openshift.io/v1
      metadata:
        labels:
          machineconfiguration.openshift.io/role: "worker"
        name: "fake-image-tc54054-worker"
      spec:
        osImageURL: "quay.io/openshifttest/tc-54054-fake-image:latest"
      
      
      
      2. Wait for the nodes to be degraded
      
      

      Actual results:

      In MCDs logs we can see that the MCD reports that the pool should be degraded, but instead of setting state=Degraded, it does state=Working
      
      
      E0525 10:36:49.013820  201317 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc-54054-fake-image:latest failed: Error: initializing source docker://quay.io/openshifttest/tc-54054-fake-image:latest: reading manifest latest in quay.io/openshifttest/tc-54054-fake-image: name unknown: repository not found
      : exit status 125]
      I0525 10:36:49.023498  201317 daemon.go:510] Transitioned from state: Done -> Working
      
      
      

      Expected results:

      The pool should be degraded when, after all the retries, we can't pull the osImage.

      Additional info:

      This problem does not happen in 4.13 and 4.14.
      
      It looks like that randomly the pools can be degraded eventually. Or not (I haven't waited longer that 40 minutes or so)
      
      A must gather with a try waiting 25 minutes aprox. is linked in the first comment of this issue.
      
      The issue is very similar to: https://issues.redhat.com/browse/OCPBUGS-3001

       

      Attachments

        Issue Links

          Activity

            People

              jkyros@redhat.com John Kyros
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: