Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19559

SNO Operators fail to install because of "failed to populate resolver cache from source" (from disabled catalogsources)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • 4.14
    • OLM
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While installing and applying the du profile to many SNOs at scale using ZTP and ACM, some number of SNOs fail to complete rolling out the du profile because the operators are not installed.
      
      In this test the du profile is applied to these disconnected SNOs and includes a manifest to disableAllDefaultSources:
      
      apiVersion: config.openshift.io/v1
      kind: OperatorHub
      metadata:
          name: cluster
          annotations:
              ran.openshift.io/ztp-deploy-wave: "1"
      spec:
          disableAllDefaultSources: true
      
      and subsequently applies a manifest to give a new catalogsource to the disconnected registry:
      
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      metadata:
        annotations:
          target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
        name: rh-du-operators
        namespace: openshift-marketplace
      spec:
        displayName: disconnected-redhat-operators
        image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.13
        publisher: Red Hat
        sourceType: grpc
        updateStrategy:
          registryPoll:
            interval: 1h
      
      Afterwards namespaces, subscriptions, and operatorgroups are applied to generate an installplan for du profile operators to be installed. (ptp, sriov, local storage, and cluster logging in this test)
      
      However I am finding a small subset of clusters every test run which seem to have run into some sort of caching or race condition error in which they display a message in the subscription object showing that they failed to resolve one of the catalogsources that was removed earlier via the disableAllDefaultSources configuration.
      
      Example with the latest 8 clusters that failed:
      # cat ../install-data/cgu.TimedOut | xargs -I % oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get subscriptions -n openshift-local-storage local-storage-operator -o json | jq '.status.conditions[] | select(.type=="ResolutionFailed") | .message' -r
      failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-marketplace.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
      
      Example full subscription from the first SNO in the failure list:
      
      # oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -A
      NAMESPACE                          NAME                                  PACKAGE                  SOURCE            CHANNEL
      openshift-local-storage            local-storage-operator                local-storage-operator   rh-du-operators   stable
      openshift-logging                  cluster-logging                       cluster-logging          rh-du-operators   stable
      openshift-ptp                      ptp-operator-subscription             ptp-operator             rh-du-operators   stable
      openshift-sriov-network-operator   sriov-network-operator-subscription   sriov-network-operator   rh-du-operators   stable
      
      # oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -n openshift-local-storage local-storage-operator -o yaml
      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        annotations:
          scale-test-label-1: value1
          scale-test-label-2: value2
        creationTimestamp: "2023-09-21T01:13:26Z"
        generation: 1
        labels:
          operators.coreos.com/local-storage-operator.openshift-local-storage: ""
        name: local-storage-operator
        namespace: openshift-local-storage
        resourceVersion: "22382"
        uid: 491d6728-7d4c-4fc8-88eb-fc929e369906
      spec:
        channel: stable
        installPlanApproval: Manual
        name: local-storage-operator
        source: rh-du-operators
        sourceNamespace: openshift-marketplace
      status:
        catalogHealth:
        - catalogSourceRef:
            apiVersion: operators.coreos.com/v1alpha1
            kind: CatalogSource
            name: rh-du-operators
            namespace: openshift-marketplace
            resourceVersion: "21326"
            uid: 6723403b-e245-4dbc-9436-f0dc83e9d192
          healthy: true
          lastUpdated: "2023-09-21T01:13:27Z"
        conditions:
        - lastTransitionTime: "2023-09-21T01:13:27Z"
          message: all available catalogsources are healthy
          reason: AllCatalogSourcesHealthy
          status: "False"
          type: CatalogSourcesUnhealthy
        - message: 'failed to populate resolver cache from source community-operators/openshift-marketplace:
            failed to list bundles: rpc error: code = Unavailable desc = connection error:
            desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc
            on [fd02::a]:53: server misbehaving"'
          reason: ErrorPreventedResolution
          status: "True"
          type: ResolutionFailed
        lastUpdated: "2023-09-21T01:13:44Z"
      
      
      

       

      Version-Release number of selected component (if applicable):

      Hub OCP is 4.13.12
      Deployed SNOs are 4.13/4.14 and previous versions as well however the datya included in this bug is a 4.14 ci build.
      ACM - 2.9.0-DOWNSTREAM-2023-09-19-20-56-31

      How reproducible:

      The reproducible of this varies from scale run to scale run but is on the order of 6 to 30 clusters per run which represents somewhere between .8% to .2% of all clusters that have been successfully du profile initialized.  However this failure does represent 100% of the failures for applying the du profile. (If we fix this, we shouldn't experience any more du profile failures with this test)

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

        1. must-gather-vm00026.tar.gz
          23.94 MB
        2. must-gather-vm00167.tar.gz
          23.99 MB
        3. must-gather-vm00736.tar.gz
          26.24 MB
        4. must-gather-vm01141.tar.gz
          22.36 MB
        5. must-gather-vm01811.tar.gz
          27.20 MB
        6. must-gather-vm02670.tar.gz
          28.39 MB
        7. must-gather-vm02968.tar.gz
          25.13 MB
        8. must-gather-vm03100.tar.gz
          27.55 MB

            agreene1991 Alexander Greene
            akrzos@redhat.com Alex Krzos
            bruno andrade bruno andrade
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: