Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18969

SNO fails install because image-registry operator is degraded - "Degraded: The registry is removed..."

XMLWordPrintable

    • No
    • Sprint 243
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the Image Registry pruner relied on a cluster role that was managed by the openshift-apiserver. This could cause the pruner job to intermittently fail during an upgrade. Now, the Image Registry Operator is responsible for creating the pruner cluster role, which resolves the issue. (link:https://issues.redhat.com/browse/OCPBUGS-18969[*OCPBUGS-18969*])
      Show
      * Previously, the Image Registry pruner relied on a cluster role that was managed by the openshift-apiserver. This could cause the pruner job to intermittently fail during an upgrade. Now, the Image Registry Operator is responsible for creating the pruner cluster role, which resolves the issue. (link: https://issues.redhat.com/browse/OCPBUGS-18969 [* OCPBUGS-18969 *])
    • Bug Fix
    • Done
    • 9/19: telco prioritization pending triage

      Description of problem:

      While installing many SNOs via ZTP using ACM, two SNOs failed to complete install because the image-registry was degraded during the install process.
      
      # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion"
      vm01831
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         18h     Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded
      vm02740
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         18h     Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded
      
      # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co image-registry"
      vm01831
      NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      image-registry   4.14.0-rc.0   True        False         True       18h     Degraded: The registry is removed...
      vm02740
      NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      image-registry   4.14.0-rc.0   True        False         True       18h     Degraded: The registry is removed...
      
      Both showed the image-pruner job pod in error state:
      # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get po -n openshift-image-registry"
      vm01831
      NAME                                               READY   STATUS    RESTARTS   AGE
      cluster-image-registry-operator-5d497944d4-czn64   1/1     Running   0          18h
      image-pruner-28242720-w6jmv                        0/1     Error     0          18h
      node-ca-vtfj8                                      1/1     Running   0          18h
      vm02740
      NAME                                               READY   STATUS    RESTARTS      AGE
      cluster-image-registry-operator-5d497944d4-lbtqw   1/1     Running   1 (18h ago)   18h
      image-pruner-28242720-ltqzk                        0/1     Error     0             18h
      node-ca-4fntj                                      1/1     Running   0             18h

       

      Version-Release number of selected component (if applicable):

      Deployed SNO OCP - 4.14.0-rc.0
      Hub 4.13.11
      ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

      How reproducible:

      Rare, only 2 clusters were found in this state after the test

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Seems like some permissions might have been lacking:
      
      # oc --kubeconfig /root/hv-vm/kc/vm01831/kubeconfig logs -n openshift-image-registry image-pruner-28242720-w6jmv
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #1 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #2 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #3 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #4 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #5 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      
      

            fmissi Flavian Missi
            akrzos@redhat.com Alex Krzos
            Wen Wang Wen Wang
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: