-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
While installing and applying the du profile to many SNOs at scale using ZTP and ACM, some number of SNOs fail to complete rolling out the du profile because the operators are not installed.
In this test the du profile is applied to these disconnected SNOs and includes a manifest to disableAllDefaultSources:
apiVersion: config.openshift.io/v1
kind: OperatorHub
metadata:
name: cluster
annotations:
ran.openshift.io/ztp-deploy-wave: "1"
spec:
disableAllDefaultSources: true
and subsequently applies a manifest to give a new catalogsource to the disconnected registry:
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
annotations:
target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
name: rh-du-operators
namespace: openshift-marketplace
spec:
displayName: disconnected-redhat-operators
image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.13
publisher: Red Hat
sourceType: grpc
updateStrategy:
registryPoll:
interval: 1h
Afterwards namespaces, subscriptions, and operatorgroups are applied to generate an installplan for du profile operators to be installed. (ptp, sriov, local storage, and cluster logging in this test)
However I am finding a small subset of clusters every test run which seem to have run into some sort of caching or race condition error in which they display a message in the subscription object showing that they failed to resolve one of the catalogsources that was removed earlier via the disableAllDefaultSources configuration.
Example with the latest 8 clusters that failed:
# cat ../install-data/cgu.TimedOut | xargs -I % oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get subscriptions -n openshift-local-storage local-storage-operator -o json | jq '.status.conditions[] | select(.type=="ResolutionFailed") | .message' -r
failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-marketplace.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
Example full subscription from the first SNO in the failure list:
# oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-local-storage local-storage-operator local-storage-operator rh-du-operators stable
openshift-logging cluster-logging cluster-logging rh-du-operators stable
openshift-ptp ptp-operator-subscription ptp-operator rh-du-operators stable
openshift-sriov-network-operator sriov-network-operator-subscription sriov-network-operator rh-du-operators stable
# oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -n openshift-local-storage local-storage-operator -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
annotations:
scale-test-label-1: value1
scale-test-label-2: value2
creationTimestamp: "2023-09-21T01:13:26Z"
generation: 1
labels:
operators.coreos.com/local-storage-operator.openshift-local-storage: ""
name: local-storage-operator
namespace: openshift-local-storage
resourceVersion: "22382"
uid: 491d6728-7d4c-4fc8-88eb-fc929e369906
spec:
channel: stable
installPlanApproval: Manual
name: local-storage-operator
source: rh-du-operators
sourceNamespace: openshift-marketplace
status:
catalogHealth:
- catalogSourceRef:
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
name: rh-du-operators
namespace: openshift-marketplace
resourceVersion: "21326"
uid: 6723403b-e245-4dbc-9436-f0dc83e9d192
healthy: true
lastUpdated: "2023-09-21T01:13:27Z"
conditions:
- lastTransitionTime: "2023-09-21T01:13:27Z"
message: all available catalogsources are healthy
reason: AllCatalogSourcesHealthy
status: "False"
type: CatalogSourcesUnhealthy
- message: 'failed to populate resolver cache from source community-operators/openshift-marketplace:
failed to list bundles: rpc error: code = Unavailable desc = connection error:
desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc
on [fd02::a]:53: server misbehaving"'
reason: ErrorPreventedResolution
status: "True"
type: ResolutionFailed
lastUpdated: "2023-09-21T01:13:44Z"
Version-Release number of selected component (if applicable):
Hub OCP is 4.13.12 Deployed SNOs are 4.13/4.14 and previous versions as well however the datya included in this bug is a 4.14 ci build. ACM - 2.9.0-DOWNSTREAM-2023-09-19-20-56-31
How reproducible:
The reproducible of this varies from scale run to scale run but is on the order of 6 to 30 clusters per run which represents somewhere between .8% to .2% of all clusters that have been successfully du profile initialized. However this failure does represent 100% of the failures for applying the du profile. (If we fix this, we shouldn't experience any more du profile failures with this test)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- is duplicated by
-
OCPBUGS-8659 The Catalog Operator attempts to connect to deleted catalogSources
-
- Closed
-