Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64631

CatalogSource reporting TRANSIENT_FAILURE when --olm-catalog-placement=Guest

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.21.0
    • HyperShift
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • Yes
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Catalog sources fail to start when HostedCluster uses spec.olmCatalogPlacement: guest. This is true for both default catalog sources (e.g. certified-operators) and custom sources.
      
      This is a regression brought in https://github.com/openshift/operator-framework-olm/pull/1129

      Version-Release number of selected component (if applicable):

          4.21 (since Oct 25)

      How reproducible:

          Always

      Steps to Reproduce:

          1. Start a hosted cluster using:
               hypershift create cluster aws ... 
               --olm-catalog-placement=Guest
               --release-image=<at 4.21 nightly since Oct 25>
          2. Check Catalog Sources in openshift-marketplace NS in guest cluster and package manifests.
          

      Actual results:

      Catalog Source in guest cluster:

      - apiVersion: operators.coreos.com/v1alpha1
        kind: CatalogSource
        metadata:
          annotations:
            target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
          creationTimestamp: "2025-11-04T08:13:07Z"
          generation: 1
          labels:
            hypershift.openshift.io/managed: "true"
          name: redhat-operators
          namespace: openshift-marketplace
          resourceVersion: "35089"
          uid: 9795a184-7a8e-4198-bb6a-057038116b7b
        spec:
          displayName: Red Hat Operators
          grpcPodConfig:
            securityContextConfig: restricted
          icon:
            base64data: ""
            mediatype: ""
          image: registry.redhat.io/redhat/redhat-operator-index:v4.20
          priority: -100
          publisher: Red Hat
          sourceType: grpc
          updateStrategy:
            registryPoll:
              interval: 10m
        status:
          connectionState:
            address: redhat-operators.openshift-marketplace.svc:50051
            lastConnect: "2025-11-04T09:34:33Z"
            lastObservedState: TRANSIENT_FAILURE
          latestImageRegistryPoll: "2025-11-04T09:45:56Z"
          registryService:
            createdAt: "2025-11-04T08:32:13Z"
            port: "50051"
            protocol: grpc
            serviceName: redhat-operators
            serviceNamespace: openshift-marketplace
      
      
      
      

      Lots of Pods in openshift-marketplace NS in guest cluster being started and terminated quickly. It takes a lot of time to settle down to get just 4 pods that are running (several minutes).

       ᐅ oc klock pods
      NAME                        READY   STATUS              RESTARTS   AGEcertified-operators-4bz7x   0/1     Terminating         0          43scertified-operators-5pvds   0/1     Terminating         0          36scertified-operators-bggc8   0/1     Terminating         0          28scertified-operators-c6tkb   0/1     Terminating         0          54scertified-operators-gwwq7   0/1     ContainerCreating   0          1scertified-operators-hkm4c   0/1     Terminating         0          20scertified-operators-qrgh8   0/1     ContainerCreating   0          54scertified-operators-tg9qt   0/1     Terminating         0          49scertified-operators-vr2bp   0/1     Terminating         0          11scommunity-operators-4wv48   0/1     Terminating         0          41scommunity-operators-5nfb6   0/1     Terminating         0          27scommunity-operators-ck76l   0/1     Terminating         0          47scommunity-operators-h85jp   0/1     Terminating         0          54scommunity-operators-hq7pq   0/1     Terminating         0          35scommunity-operators-lhkr8   0/1     Terminating         0          19scommunity-operators-nk8mq   0/1     ContainerCreating   0          54scommunity-operators-wff2v   0/1     Terminating         0          9sredhat-marketplace-294l9    0/1     Terminating         0          46sredhat-marketplace-5298j    0/1     Terminating         0          52sredhat-marketplace-7thcg    0/1     Error               0          16sredhat-marketplace-c2shh    0/1     Terminating         0          24sredhat-marketplace-fv5jw    0/1     Terminating         0          40sredhat-marketplace-gnrkf    0/1     Running             0          6sredhat-marketplace-qwmdx    0/1     ContainerCreating   0          52sredhat-marketplace-sv9lj    0/1     Terminating         0          32sredhat-operators-4dqgf      0/1     Terminating         0          45sredhat-operators-5vtxt      0/1     Terminating         0          38sredhat-operators-bkpx6      0/1     Terminating         0          14sredhat-operators-h2kx7      0/1     ContainerCreating   0          51sredhat-operators-hq8rg      0/1     ContainerCreating   0          5sredhat-operators-ptgmf      0/1     Terminating         0          51s

      catalog-operator Pod log in the management cluster repeats this:

      time="2025-10-31T13:31:20Z" level=info msg="evaluating current pod" catalogsource.name=redhat-marketplace catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-marketplace-hgkpc current-pod.namespace=openshift-marketplace id=SFOdj
      
      time="2025-10-31T13:31:20Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-marketplace catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-marketplace-hgkpc current-pod.namespace=openshift-marketplace id=SFOdj
      
      time="2025-10-31T13:31:20Z" level=error msg="error ensuring registry server: could not ensure update pod" catalogsource.name=redhat-marketplace catalogsource.namespace=openshift-marketplace error="catalog polling: redhat-marketplace not ready for update: update pod redhat-marketplace-pw7qp has not yet reported ready" id=SFOdj
      
      time="2025-10-31T13:31:20Z" level=error msg="error ensuring registry server: ensure update pod error is not of type UpdateNotReadyErr" catalogsource.name=redhat-marketplace catalogsource.namespace=openshift-marketplace error="catalog polling: redhat-marketplace not ready for update: update pod redhat-marketplace-pw7qp has not yet reported ready" id=SFOdjtime="2025-10-31T13:31:20Z" level=info msg="requeueing registry server for catalog update check: update pod not yet ready" catalogsource.name=redhat-marketplace catalogsource.namespace=openshift-marketplace id=SFOdj 

      There are no package manifests available in the guest cluster.

      Expected results:

          Catalog Sources READY, package manifests available.

      Additional info:

          I have identified https://github.com/openshift/operator-framework-olm/pull/1129 as the source of the problem.
      
      The nightly OCP build from Oct 24 still works, the nightly from Oct 25 does not.
      
      I have also created a custom OCP build with the one from Oct 25 and by including custom images for operator-lifecycle-manager and operator-registry built from commit https://github.com/openshift/operator-framework-olm/commit/6e79ccc19197da354249f4753449fad3037b1c9e (this commit is before the pull/1129 was merged). And that works fine (Catalog sources READY).

              Unassigned Unassigned
              mgencur@redhat.com Martin Gencur
              None
              None
              Martin Gencur Martin Gencur
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: