Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-39574

OLM catalogsource pods do not recover from node failure when registryPoll is none

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.17.0
    • 4.14.z, 4.15.z, 4.17.0, 4.16.z
    • OLM
    • Important
    • No
    • YellowJacket OLM Sprint 259
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, catalog source pods could not recover from a cluster node failure when the `registryPoll` was unset. With this fix, OLM updates its logic for checking for dead pods, and as a result, catalog source pods now recover from node failures as expected. (link:https://issues.redhat.com/browse/OCPBUGS-39574[*OCPBUGS-39574*])
      Show
      * Previously, catalog source pods could not recover from a cluster node failure when the `registryPoll` was unset. With this fix, OLM updates its logic for checking for dead pods, and as a result, catalog source pods now recover from node failures as expected. (link: https://issues.redhat.com/browse/OCPBUGS-39574 [* OCPBUGS-39574 *])
    • Bug Fix
    • Done

      Description of problem:

      The pod of catalogsource without registryPoll wasn't recreated during the node failure

          jiazha-mac:~ jiazha$ oc get pods 
      NAME                                    READY   STATUS        RESTARTS       AGE
      certified-operators-rcs64               1/1     Running       0              123m
      community-operators-8mxh6               1/1     Running       0              123m
      marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
      qe-app-registry-5jxlx                   1/1     Running       0              106m
      redhat-marketplace-4bgv9                1/1     Running       0              123m
      redhat-operators-ww5tb                  1/1     Running       0              123m
      test-2xvt8                              1/1     Terminating   0              12m
      
      jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
      NAME         READY   STATUS    RESTARTS   AGE    IP            NODE                                          NOMINATED NODE   READINESS GATES
      test-2xvt8   1/1     Running   0          7m6s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>
      
      jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
      NAME                                          STATUS     ROLES    AGE    VERSION
      qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   116m   v1.30.2+421e90e

      Version-Release number of selected component (if applicable):

           Cluster version is 4.17.0-0.nightly-2024-07-07-131215

      How reproducible:

          always

      Steps to Reproduce:

          1. create a catalogsource without the registryPoll configure.
      
      jiazha-mac:~ jiazha$ cat cs-32183.yaml 
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      metadata:
        name: test
        namespace: openshift-marketplace
      spec:
        displayName: Test Operators
        image: registry.redhat.io/redhat/redhat-operator-index:v4.16
        publisher: OpenShift QE
        sourceType: grpc
      
      jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml 
      catalogsource.operators.coreos.com/test created
      
      jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
      NAME         READY   STATUS    RESTARTS   AGE     IP            NODE                                          NOMINATED NODE   READINESS GATES
      test-2xvt8   1/1     Running   0          3m18s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>
      
      
           2. Stop the node 
      jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc 
      Temporary namespace openshift-debug-q4d5k is created for debugging node...
      Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.128.5
      If you don't see a command prompt, try pressing enter.
      sh-5.1# chroot /host
      sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet
      
      
      Removing debug pod ...
      Temporary namespace openshift-debug-q4d5k was removed.
      
      jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
      NAME                                          STATUS     ROLES    AGE    VERSION
      qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   115m   v1.30.2+421e90e
      
      
          3. check it this catalogsource's pod recreated.
      
          

      Actual results:

      No new pod was generated. 

          jiazha-mac:~ jiazha$ oc get pods 
      NAME                                    READY   STATUS        RESTARTS       AGE
      certified-operators-rcs64               1/1     Running       0              123m
      community-operators-8mxh6               1/1     Running       0              123m
      marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
      qe-app-registry-5jxlx                   1/1     Running       0              106m
      redhat-marketplace-4bgv9                1/1     Running       0              123m
      redhat-operators-ww5tb                  1/1     Running       0              123m
      test-2xvt8                              1/1     Terminating   0              12m
      
      

      once node recovery, a new pod was generated.

      
      jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
      NAME                                          STATUS   ROLES    AGE    VERSION
      qe-daily-417-0708-cv2p6-worker-westus-gcrrc   Ready    worker   127m   v1.30.2+421e90e
      
      jiazha-mac:~ jiazha$ oc get pods 
      NAME                                    READY   STATUS    RESTARTS       AGE
      certified-operators-rcs64               1/1     Running   0              127m
      community-operators-8mxh6               1/1     Running   0              127m
      marketplace-operator-769fbb9898-czsfn   1/1     Running   4 (121m ago)   140m
      qe-app-registry-5jxlx                   1/1     Running   0              109m
      redhat-marketplace-4bgv9                1/1     Running   0              127m
      redhat-operators-ww5tb                  1/1     Running   0              127m
      test-wqxvg                              1/1     Running   0              27s 

      Expected results:

      During the node failure, a new catalog source pod should be generated.

          

      Additional info:

      Hi Team,

      After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.

      • The commit [1] try to fix this issue by adding "force deleting dead pod" process into ensurePod() function.
      • The ensurePod() is called by EnsureRegistryServer() [2].
      • However, the syncRegistryServer() will return immediately without calling EnsureRegistryServer() if there is no registryPoll in catalog [3].
      • There is no registryPoll defined in catalogsource that were generated when we build catalog image following Doc [4].
        apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        metadata:
          name: redhat-operator-index
          namespace: openshift-marketplace
        spec:
          image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
          sourceType: grpc
        
      • So the catalog pod created by the catalogsource cannot recovered.

      And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).

      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      metadata:
        name: redhat-operator-index
        namespace: openshift-marketplace
      spec:
        image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
        sourceType: grpc
        updateStrategy:   <==
          registryPoll:   <==
            interval: 10m <==
      

      The registryPoll is NOT MUST for catalogsource.
      So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.

      [1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
      [2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
      [3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
      [4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html

            anik120 Anik Bhattacharjee
            rhn-support-jiazha Jian Zhang
            Jian Zhang Jian Zhang
            Alex Dellapenta Alex Dellapenta
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: