Loading...

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14
Component/s: OLM
Labels:
- perfscale-telco-5g
- telco-5g

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

While installing and applying the du profile to many SNOs at scale using ZTP and ACM, some number of SNOs fail to complete rolling out the du profile because the operators are not installed.

In this test the du profile is applied to these disconnected SNOs and includes a manifest to disableAllDefaultSources:

apiVersion: config.openshift.io/v1
kind: OperatorHub
metadata:
    name: cluster
    annotations:
        ran.openshift.io/ztp-deploy-wave: "1"
spec:
    disableAllDefaultSources: true

and subsequently applies a manifest to give a new catalogsource to the disconnected registry:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  annotations:
    target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
  name: rh-du-operators
  namespace: openshift-marketplace
spec:
  displayName: disconnected-redhat-operators
  image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.13
  publisher: Red Hat
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 1h

Afterwards namespaces, subscriptions, and operatorgroups are applied to generate an installplan for du profile operators to be installed. (ptp, sriov, local storage, and cluster logging in this test)

However I am finding a small subset of clusters every test run which seem to have run into some sort of caching or race condition error in which they display a message in the subscription object showing that they failed to resolve one of the catalogsources that was removed earlier via the disableAllDefaultSources configuration.

Example with the latest 8 clusters that failed:
# cat ../install-data/cgu.TimedOut | xargs -I % oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get subscriptions -n openshift-local-storage local-storage-operator -o json | jq '.status.conditions[] | select(.type=="ResolutionFailed") | .message' -r
failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup certified-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source redhat-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-operators.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"
failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup redhat-marketplace.openshift-marketplace.svc on [fd02::a]:53: server misbehaving"

Example full subscription from the first SNO in the failure list:

# oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -A
NAMESPACE                          NAME                                  PACKAGE                  SOURCE            CHANNEL
openshift-local-storage            local-storage-operator                local-storage-operator   rh-du-operators   stable
openshift-logging                  cluster-logging                       cluster-logging          rh-du-operators   stable
openshift-ptp                      ptp-operator-subscription             ptp-operator             rh-du-operators   stable
openshift-sriov-network-operator   sriov-network-operator-subscription   sriov-network-operator   rh-du-operators   stable

# oc --kubeconfig /root/hv-vm/kc/vm00026/kubeconfig get subscription -n openshift-local-storage local-storage-operator -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  annotations:
    scale-test-label-1: value1
    scale-test-label-2: value2
  creationTimestamp: "2023-09-21T01:13:26Z"
  generation: 1
  labels:
    operators.coreos.com/local-storage-operator.openshift-local-storage: ""
  name: local-storage-operator
  namespace: openshift-local-storage
  resourceVersion: "22382"
  uid: 491d6728-7d4c-4fc8-88eb-fc929e369906
spec:
  channel: stable
  installPlanApproval: Manual
  name: local-storage-operator
  source: rh-du-operators
  sourceNamespace: openshift-marketplace
status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: rh-du-operators
      namespace: openshift-marketplace
      resourceVersion: "21326"
      uid: 6723403b-e245-4dbc-9436-f0dc83e9d192
    healthy: true
    lastUpdated: "2023-09-21T01:13:27Z"
  conditions:
  - lastTransitionTime: "2023-09-21T01:13:27Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - message: 'failed to populate resolver cache from source community-operators/openshift-marketplace:
      failed to list bundles: rpc error: code = Unavailable desc = connection error:
      desc = "transport: Error while dialing dial tcp: lookup community-operators.openshift-marketplace.svc
      on [fd02::a]:53: server misbehaving"'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed
  lastUpdated: "2023-09-21T01:13:44Z"

Version-Release number of selected component (if applicable):

Hub OCP is 4.13.12
Deployed SNOs are 4.13/4.14 and previous versions as well however the datya included in this bug is a 4.14 ci build.
ACM - 2.9.0-DOWNSTREAM-2023-09-19-20-56-31

How reproducible:

The reproducible of this varies from scale run to scale run but is on the order of 6 to 30 clusters per run which represents somewhere between .8% to .2% of all clusters that have been successfully du profile initialized.  However this failure does represent 100% of the failures for applying the du profile. (If we fix this, we shouldn't experience any more du profile failures with this test)

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather-vm01141.tar.gz
2023/09/21 2:40 PM
22.36 MB
Alex Krzos
must-gather-vm00026.tar.gz
2023/09/21 2:40 PM
23.94 MB
Alex Krzos
must-gather-vm00167.tar.gz
2023/09/21 2:40 PM
23.99 MB
Alex Krzos
must-gather-vm00736.tar.gz
2023/09/21 2:40 PM
26.24 MB
Alex Krzos
must-gather-vm01811.tar.gz
2023/09/21 2:42 PM
27.20 MB
Alex Krzos
must-gather-vm02968.tar.gz
2023/09/21 2:46 PM
25.13 MB
Alex Krzos
must-gather-vm02670.tar.gz
2023/09/21 2:46 PM
28.39 MB
Alex Krzos
must-gather-vm03100.tar.gz
2023/09/21 2:48 PM
27.55 MB
Alex Krzos

is duplicated by

OCPBUGS-8659 The Catalog Operator attempts to connect to deleted catalogSources

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates