-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.14
-
Important
-
No
-
Rejected
-
False
-
-
Description of problem:
While installing and applying the du profile to many SNOs at scale using ZTP and ACM, some number of SNOs fail to complete rolling out the du profile because the operators are not installed. In this test the du profile is applied to these disconnected SNOs and includes a manifest to disableAllDefaultSources: apiVersion: config.openshift.io/v1 kind: OperatorHub metadata: name: cluster annotations: ran.openshift.io/ztp-deploy-wave: "1" spec: disableAllDefaultSources: true and subsequently applies a manifest to give a new catalogsource to the disconnected registry: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: annotations: target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' name: rh-du-operators namespace: openshift-marketplace spec: displayName: disconnected-redhat-operators image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.13 publisher: Red Hat sourceType: grpc updateStrategy: registryPoll: interval: 1h Afterwards namespaces, subscriptions, and operatorgroups are applied to generate an installplan for du profile operators to be installed. (ptp, sriov, local storage, and cluster logging in this test) However I am finding a small subset of clusters every test run which seem to have run into some sort of caching or race condition error in which they display a message in the subscription object showing that they failed to resolve one of the catalogsources that was removed earlier via the disableAllDefaultSources configuration. Example with the latest 8 clusters that failed: # cat ../install-data/cgu.TimedOut | xargs -I % oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get subscriptions -n openshift-local-storage local-storage-operator -o json | jq '.status.conditions[] | select(.type=="ResolutionFailed") | .message' -r failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-marketplace.openshift-marketplace.svc on [fd02::a]:53: server misbehaving" Example full subscription from the first SNO in the failure list: # oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -A NAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-local-storage local-storage-operator local-storage-operator rh-du-operators stable openshift-logging cluster-logging cluster-logging rh-du-operators stable openshift-ptp ptp-operator-subscription ptp-operator rh-du-operators stable openshift-sriov-network-operator sriov-network-operator-subscription sriov-network-operator rh-du-operators stable # oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -n openshift-local-storage local-storage-operator -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: annotations: scale-test-label-1: value1 scale-test-label-2: value2 creationTimestamp: "2023-09-21T01:13:26Z" generation: 1 labels: operators.coreos.com/local-storage-operator.openshift-local-storage: "" name: local-storage-operator namespace: openshift-local-storage resourceVersion: "22382" uid: 491d6728-7d4c-4fc8-88eb-fc929e369906 spec: channel: stable installPlanApproval: Manual name: local-storage-operator source: rh-du-operators sourceNamespace: openshift-marketplace status: catalogHealth: - catalogSourceRef: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource name: rh-du-operators namespace: openshift-marketplace resourceVersion: "21326" uid: 6723403b-e245-4dbc-9436-f0dc83e9d192 healthy: true lastUpdated: "2023-09-21T01:13:27Z" conditions: - lastTransitionTime: "2023-09-21T01:13:27Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy - message: 'failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"' reason: ErrorPreventedResolution status: "True" type: ResolutionFailed lastUpdated: "2023-09-21T01:13:44Z"
Version-Release number of selected component (if applicable):
Hub OCP is 4.13.12 Deployed SNOs are 4.13/4.14 and previous versions as well however the datya included in this bug is a 4.14 ci build. ACM - 2.9.0-DOWNSTREAM-2023-09-19-20-56-31
How reproducible:
The reproducible of this varies from scale run to scale run but is on the order of 6 to 30 clusters per run which represents somewhere between .8% to .2% of all clusters that have been successfully du profile initialized. However this failure does represent 100% of the failures for applying the du profile. (If we fix this, we shouldn't experience any more du profile failures with this test)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- is duplicated by
-
OCPBUGS-8659 The Catalog Operator attempts to connect to deleted catalogSources
- Closed