Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18969

SNO fails install because image-registry operator is degraded - "Degraded: The registry is removed..."

    • No
    • Sprint 243
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the Image Registry pruner relied on a cluster role that was managed by the openshift-apiserver. This could cause the pruner job to intermittently fail during an upgrade. Now, the Image Registry Operator is responsible for creating the pruner cluster role, which resolves the issue. (link:https://issues.redhat.com/browse/OCPBUGS-18969[*OCPBUGS-18969*])
      Show
      * Previously, the Image Registry pruner relied on a cluster role that was managed by the openshift-apiserver. This could cause the pruner job to intermittently fail during an upgrade. Now, the Image Registry Operator is responsible for creating the pruner cluster role, which resolves the issue. (link: https://issues.redhat.com/browse/OCPBUGS-18969 [* OCPBUGS-18969 *])
    • Bug Fix
    • Done
    • 9/19: telco prioritization pending triage

      Description of problem:

      While installing many SNOs via ZTP using ACM, two SNOs failed to complete install because the image-registry was degraded during the install process.
      
      # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion"
      vm01831
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         18h     Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded
      vm02740
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         18h     Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded
      
      # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co image-registry"
      vm01831
      NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      image-registry   4.14.0-rc.0   True        False         True       18h     Degraded: The registry is removed...
      vm02740
      NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      image-registry   4.14.0-rc.0   True        False         True       18h     Degraded: The registry is removed...
      
      Both showed the image-pruner job pod in error state:
      # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get po -n openshift-image-registry"
      vm01831
      NAME                                               READY   STATUS    RESTARTS   AGE
      cluster-image-registry-operator-5d497944d4-czn64   1/1     Running   0          18h
      image-pruner-28242720-w6jmv                        0/1     Error     0          18h
      node-ca-vtfj8                                      1/1     Running   0          18h
      vm02740
      NAME                                               READY   STATUS    RESTARTS      AGE
      cluster-image-registry-operator-5d497944d4-lbtqw   1/1     Running   1 (18h ago)   18h
      image-pruner-28242720-ltqzk                        0/1     Error     0             18h
      node-ca-4fntj                                      1/1     Running   0             18h

       

      Version-Release number of selected component (if applicable):

      Deployed SNO OCP - 4.14.0-rc.0
      Hub 4.13.11
      ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

      How reproducible:

      Rare, only 2 clusters were found in this state after the test

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Seems like some permissions might have been lacking:
      
      # oc --kubeconfig /root/hv-vm/kc/vm01831/kubeconfig logs -n openshift-image-registry image-pruner-28242720-w6jmv
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #1 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #2 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #3 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #4 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      attempt #5 has failed (exit code 1), going to make another attempt...
      Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
      
      

            [OCPBUGS-18969] SNO fails install because image-registry operator is degraded - "Degraded: The registry is removed..."

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:7198

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7198

            Wen Wang added a comment -

            do regression test with version: 4.15.0-0.nightly-2023-10-09-101435, and upgrade from 4.14 to 4.15 success, so close it

             

             

             

             

            Wen Wang added a comment - do regression test with version: 4.15.0-0.nightly-2023-10-09-101435, and upgrade from 4.14 to 4.15 success, so close it        

            Wen Wang added a comment - - edited

            fmissi  i will do upgrade and regression test when have a available 4.15 build,now latest build is not include the bug

            Wen Wang added a comment - - edited fmissi   i will do upgrade and regression test when have a available 4.15 build,now latest build is not include the bug

            Hi fmissi,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi fmissi , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Hi wewang@redhat.com, we're about to merge the second PR that should fix this.

            The changes made move the "system:image-pruner" role creation from the openshift-apiserver into the image registry operator.

            Since this is a difficult bug to reproduce, I think we can focus on regression tests for the pruner. We need to cover both fresh install and upgrade cases.

            Let me know if you need more info, and thanks!

            Flavian Missi added a comment - Hi wewang@redhat.com , we're about to merge the second PR that should fix this. The changes made move the "system:image-pruner" role creation from the openshift-apiserver into the image registry operator. Since this is a difficult bug to reproduce, I think we can focus on regression tests for the pruner. We need to cover both fresh install and upgrade cases. Let me know if you need more info, and thanks!

            Flavian Missi added a comment - - edited

            Hi bzvonar@redhat.com, I wouldn't expect anyone other than the registry team to get this done but we are happy to take in contributions  

            I'm also uncertain about backporting the change once it lands on the main branch, it's not something I would do without a lot of consideration.

            Flavian Missi added a comment - - edited Hi bzvonar@redhat.com , I wouldn't expect anyone other than the registry team to get this done but we are happy to take in contributions   I'm also uncertain about backporting the change once it lands on the main branch, it's not something I would do without a lot of consideration.

            Thanks akrzos@redhat.com !

            The problem here is that the openshift-apiserver is currently responsible for creating a cluster role that the pruner relies on. This cluster role should be created by the registry operator (which also manages the pruner), so that there's interdependence between components that might be updated simultaneously.

             

            Flavian Missi added a comment - Thanks akrzos@redhat.com ! The problem here is that the openshift-apiserver is currently responsible for creating a cluster role that the pruner relies on. This cluster role should be created by the registry operator (which also manages the pruner), so that there's interdependence between components that might be updated simultaneously.  

            Alex Krzos added a comment -

            fmissi I attached a successfully installed SNO must-gather (vm00001, Version 4.14.0-rc.1) Let me know if you need anything else.

            Alex Krzos added a comment - fmissi I attached a successfully installed SNO must-gather (vm00001, Version 4.14.0-rc.1) Let me know if you need anything else.

            Okay so it looks like the system:image-pruner cluster role is created by openshift-apiserver:

            2023-09-13T00:13:35.527957614Z I0913 00:13:35.527791       1 storage_rbac.go:226] created clusterrole.rbac.authorization.k8s.io/system:image-pruner 

            And the pruner job tries to run for the first time:

            2023-09-13T00:01:35.766589593Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found 

            And the last try:

            2023-09-13T00:09:07.529941880Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found 

            So the required role for the pruner to work is created AFTER the pruner job runs and gives up retrying.

            Flavian Missi added a comment - Okay so it looks like the system:image-pruner cluster role is created by openshift-apiserver: 2023-09-13T00:13:35.527957614Z I0913 00:13:35.527791       1 storage_rbac.go:226] created clusterrole.rbac.authorization.k8s.io/system:image-pruner And the pruner job tries to run for the first time: 2023-09-13T00:01:35.766589593Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io " system:image-pruner" not found And the last try: 2023-09-13T00:09:07.529941880Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io " system:image-pruner" not found — So the required role for the pruner to work is created AFTER the pruner job runs and gives up retrying.

            So the pruner should run whether the registry is removed or not.

            As you pointed out, the problem does seem to be a missing ClusterRole (system:image-pruner).

            It should be created by openshift-apiserver, I'm currently trying to find out why that didn't happen.

            Flavian Missi added a comment - So the pruner should run whether the registry is removed or not. As you pointed out, the problem does seem to be a missing ClusterRole (system:image-pruner). It should be created by openshift-apiserver, I'm currently trying to find out why that didn't happen.

              fmissi Flavian Missi
              akrzos@redhat.com Alex Krzos
              Wen Wang Wen Wang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: