Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43966

high snapshot rate on redhat-operators, OLM operator install hangs. RPC DeadlineExceeded while listing bundles.

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • Customer Escalated
    • In Progress
    • Bug Fix
    • Hide
      Previously, the catalog-operator would capture catalog snapshots with a frequency of 5 minutes. Under conditions with many namespaces and subscriptions, and with larger catalogsources available in 4.15, 4.16, the snapshots would start failing but would cascade across the catalogsources (causing spiking CPU loads), resulting in an effective inability to upgrade/install operators.
      With this change, the cache lifetime will be 30 minutes which will allow plenty of time for attempts to be resolved without undue load on the catalogsource pods.
      Show
      Previously, the catalog-operator would capture catalog snapshots with a frequency of 5 minutes. Under conditions with many namespaces and subscriptions, and with larger catalogsources available in 4.15, 4.16, the snapshots would start failing but would cascade across the catalogsources (causing spiking CPU loads), resulting in an effective inability to upgrade/install operators. With this change, the cache lifetime will be 30 minutes which will allow plenty of time for attempts to be resolved without undue load on the catalogsource pods.
    • None

      When trying to install an operator, the below is logged:

      "Warning alert:CatalogSource health unknown This operator cannot be updated. The health of CatalogSource "redhat-operators" is unknown. It may have been disabled or removed from the cluster.CatalogSource CSView CatalogSource

       

      1. The underlying error in logs is {{msg="error encountered while listing bundles: rpc error: code = DeadlineExceeded desc = context deadline exceeded" catalog="{redhat-operators openshift-marketplace} }}
      2. As discussed we could not reproduce this locally and have attempted multiple times to simulate the appropriate grpc connection and exact api call,  which succeeded for us.
      3. Therefore the suspected cause is a network issue on the customer’s cluster,  and we require full cooperation from a qualified cluster/network professional on the customer end who is aware of their exact config, and a detailed network dump/analysis what actually happened at the point in time when OLM got this timeout.
      4. We cannot proceed with investigation based on the current info we have.

              rh-ee-jkeister Jordan Keister
              rhn-support-jshivers Jacob Shivers
              None
              None
              Xia Zhao Xia Zhao
              None
              Votes:
              3 Vote for this issue
              Watchers:
              29 Start watching this issue

                Created:
                Updated: