Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57430

[release-4.15] high snapshot rate on redhat-operators, OLM operator install hangs. RPC DeadlineExceeded while listing bundles.

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • Rejected
    • Lillipup Sprint 272, Mewtwo Sprint 273, Nidoran Sprint 274, Oddish Sprint 275
    • 4
    • Done
    • Bug Fix
    • Hide
      Previously, the catalog-operator would capture catalog snapshots with a frequency of 5 minutes. Under conditions with many namespaces and subscriptions, and with larger catalogsources available in 4.15, 4.16, the snapshots would start failing but would cascade across the catalogsources (causing spiking CPU loads), resulting in an effective inability to upgrade/install operators.
      With this change, the cache lifetime will be 30 minutes which will allow plenty of time for attempts to be resolved without undue load on the catalogsource pods.
      Show
      Previously, the catalog-operator would capture catalog snapshots with a frequency of 5 minutes. Under conditions with many namespaces and subscriptions, and with larger catalogsources available in 4.15, 4.16, the snapshots would start failing but would cascade across the catalogsources (causing spiking CPU loads), resulting in an effective inability to upgrade/install operators. With this change, the cache lifetime will be 30 minutes which will allow plenty of time for attempts to be resolved without undue load on the catalogsource pods.
    • None
    • None

      This is a clone of issue OCPBUGS-57429. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-57428. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-57427. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-57352. The following is the description of the original issue:

      This is the release-4.19 clone to backport the interval change for refreshing the catalog cache data for catalog-operator from 5 minutes to 30 minutes. 

      --------------

       

      When trying to install an operator, the below is logged:

      "Warning alert:CatalogSource health unknown This operator cannot be updated. The health of CatalogSource "redhat-operators" is unknown. It may have been disabled or removed from the cluster.CatalogSource CSView CatalogSource
      

       

      1. The underlying error in logs is 
        msg="error encountered while listing bundles: rpc error: code = DeadlineExceeded desc = context deadline exceeded" catalog="{redhat-operators openshift-marketplace}
      1. As discussed we could not reproduce this locally and have attempted multiple times to simulate the appropriate grpc connection and exact api call,  which succeeded for us.
      2. Therefore the suspected cause is a network issue on the customer’s cluster,  and we require full cooperation from a qualified cluster/network professional on the customer end who is aware of their exact config, and a detailed network dump/analysis what actually happened at the point in time when OLM got this timeout.
      3. We cannot proceed with investigation based on the current info we have.

              rh-ee-jkeister Jordan Keister
              openshift-crt-jira-prow OpenShift Prow Bot
              None
              None
              Xia Zhao Xia Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: