Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31850

[release-4.16] OLM: Catalog Pods CrashLoopBackOff after Cluster `WakesUp` from Hibernating

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.15.z
    • OLM / Registry
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

         If a cluster is put into Hibernation via ACM/Hive, and during the time that the cluster is a asleep, if any of the 4 catalogs that ship with OCP's digest gets update (ie a new bundle is added to the catalog tag, 4,15 in this case), then the cluster is woken up, the catalog(s) Pods that were updated now CrashLoopBackOff, and cause the cluster to be in an un-usable state. By unusable state, it means that no other operator subscriptions, or catalog sources can be applied to the cluster.

      Version-Release number of selected component (if applicable):

          OCP 4.15

      How reproducible:

         See below 

      Steps to Reproduce:

          1. Create a 4.15 cluster with ACM/Hive.
          2. Wait for it to become Health.
          3. Put the cluster into Hibernation.
          4. While the cluster is asleep, add a new bundle to an existing       catalog (yes, I understand catalogs are immutable, and this would result in a new catalog) such that the digest changes.
          5. Wake the cluster up via ACK/Hive.
          6. Either (both yeild similar logs/results):
             a: Note that the pods in `openshift-marketplace` namespace are in CrashBackLoop state.
             b: Create a new subscription and not that this fails, since the catalogs are unhealthy.
      
          

       

      Actual results:

          New catalogs, subscriptions, operators can't be applied to the cluster.

      Expected results:

          I'd expect that when a cluster wakes up that the catalogs are healthy, no matter if they have a different digest then when the cluster went to sleep.

      Additional info:

          Code in question (ie throwing the error): https://github.com/operator-framework/operator-registry/blob/master/pkg/cache/json.go#L181-L194

       

       

      Log from certified-operator (note it could be any pod, we have examples of marketplace as well) pod:
       time="2024-04-02T01:01:22Z" level=info msg="starting pprof endpoint" address="localhost:6060"
      2
      time="2024-04-02T01:01:22Z" level=fatal msg="cache requires rebuild: cache reports digest as \"2e210f20d7ad085a\", but computed digest is \"9d0c54855f748780\""
      Custom Catalog Subscription
      {"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"creationTimestamp":"2024-04-01T22:35:35Z","generation":1,"labels":{"operators.coreos.com/nginx-ingress-operator.nginx-ingress-operator":""},"managedFields":[{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:operators.coreos.com/nginx-ingress-operator.nginx-ingress-operator":{}}}},"manager":"Go-http-client","operation":"Update","time":"2024-04-01T22:35:35Z"},{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:channel":{},"f:name":{},"f:source":{},"f:sourceNamespace":{}}},"manager":"preflight","operation":"Update","time":"2024-04-01T22:35:35Z"},{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:catalogHealth":{},"f:conditions":{},"f:lastUpdated":{}}},"manager":"catalog","operation":"Update","subresource":"status","time":"2024-04-01T22:38:35Z"}],"name":"nginx-ingress-operator","namespace":"nginx-ingress-operator","resourceVersion":"40266","uid":"b5ee2e64-c43e-4062-bbb8-4a1f68518753"},"spec":{"channel":"alpha","name":"nginx-ingress-operator","source":"nginx-ingress-operator","sourceNamespace":"nginx-ingress-operator"},"status":{"catalogHealth":[{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"nginx-ingress-operator","namespace":"nginx-ingress-operator","resourceVersion":"39191","uid":"a1e95a54-2302-4cd1-9ad3-1c352c8f1379"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"certified-operators","namespace":"openshift-marketplace","resourceVersion":"38032","uid":"5180bff4-d2e2-45e3-a24e-bb37826feef5"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"community-operators","namespace":"openshift-marketplace","resourceVersion":"38073","uid":"8e2e195a-cb09-42ac-b931-666f798ab68f"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"redhat-marketplace","namespace":"openshift-marketplace","resourceVersion":"38078","uid":"11781d2c-03c0-48bb-b29a-06b9a5e1990f"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"redhat-operators","namespace":"openshift-marketplace","resourceVersion":"38079","uid":"3ea5a9a4-039b-4646-9f47-484e13589e83"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"}],"conditions":[{"message":"[failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.30.209.124:50051: connect: connection refused\", failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.30.243.69:50051: connect: connection refused\"]","reason":"ErrorPreventedResolution","status":"True","type":"ResolutionFailed"},{"lastTransitionTime":"2024-04-01T22:36:10Z","message":"all available catalogsources are healthy","reason":"AllCatalogSourcesHealthy","status":"False","type":"CatalogSourcesUnhealthy"}],"lastUpdated":"2024-04-01T22:38:34Z"}}

       

      Custom CatalogSoruce
      {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"creationTimestamp":"2024-04-01T22:35:35Z","generation":1,"managedFields":[{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:displayName":{},"f:icon":{".":{},"f:base64data":{},"f:mediatype":{}},"f:image":{},"f:secrets":{},"f:sourceType":{}}},"manager":"preflight","operation":"Update","time":"2024-04-01T22:35:35Z"},{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:connectionState":{".":{},"f:address":{},"f:lastConnect":{},"f:lastObservedState":{}},"f:registryService":{".":{},"f:createdAt":{},"f:port":{},"f:protocol":{},"f:serviceName":{},"f:serviceNamespace":{}}}},"manager":"catalog","operation":"Update","subresource":"status","time":"2024-04-01T22:36:03Z"}],"name":"nginx-ingress-operator","namespace":"nginx-ingress-operator","resourceVersion":"39191","uid":"a1e95a54-2302-4cd1-9ad3-1c352c8f1379"},"spec":{"displayName":"nginx-ingress-operator","icon":{"base64data":"","mediatype":""},"image":"quay.io/operator-pipeline-prod/nginx-ingress-operator-index:v4.16-36a87cabd459f7be3258a7e60ef53751ea737de4","secrets":["registry-auth-keys"],"sourceType":"grpc"},"status":{"connectionState":{"address":"nginx-ingress-operator.nginx-ingress-operator.svc:50051","lastConnect":"2024-04-01T22:36:03Z","lastObservedState":"READY"},"registryService":{"createdAt":"2024-04-01T22:35:37Z","port":"50051","protocol":"grpc","serviceName":"nginx-ingress-operator","serviceNamespace":"nginx-ingress-operator"}}}

      Must gather from prow (might not be for the above operator testing and another operator testing)
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-redhat-openshift-ecosystem-certified-operators-prod-ocp-4.15-preflight-prod-claim/1774927027938267136/

      Slack Discussion:
      https://redhat-internal.slack.com/archives/C3VS0LV41/p1711751930041179

       

            rh-ee-cchantse Catherine Chan-Tse
            acornett@redhat.com Adam Cornett
            Jia Fan Jia Fan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: