Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43966

high snapshot rate on redhat-operators, OLM operator install hangs. RPC DeadlineExceeded while listing bundles.

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • Customer Escalated
    • Done
    • Bug Fix
    • Hide
      * Before this update, the catalog Operator scheduled catalog snapshots for every 5 minutes. On clusters with many namespaces and subscriptions, snapshots would fail and cascade across catalog sources. As a result, the spikes in CPU loads effectively blocked installing and updating Operators. With this update, catalog snapshots are scheduled for every 30 minutes to allow enough time for the snapshotes to resolve. (link:https://issues.redhat.com/browse/OCPBUGS-43966[OCPBUGS-43966])
      Show
      * Before this update, the catalog Operator scheduled catalog snapshots for every 5 minutes. On clusters with many namespaces and subscriptions, snapshots would fail and cascade across catalog sources. As a result, the spikes in CPU loads effectively blocked installing and updating Operators. With this update, catalog snapshots are scheduled for every 30 minutes to allow enough time for the snapshotes to resolve. (link: https://issues.redhat.com/browse/OCPBUGS-43966 [ OCPBUGS-43966 ])
    • None

      When trying to install an operator, the below is logged:

      "Warning alert:CatalogSource health unknown This operator cannot be updated. The health of CatalogSource "redhat-operators" is unknown. It may have been disabled or removed from the cluster.CatalogSource CSView CatalogSource

       

      1. The underlying error in logs is {{msg="error encountered while listing bundles: rpc error: code = DeadlineExceeded desc = context deadline exceeded" catalog="{redhat-operators openshift-marketplace} }}
      2. As discussed we could not reproduce this locally and have attempted multiple times to simulate the appropriate grpc connection and exact api call,  which succeeded for us.
      3. Therefore the suspected cause is a network issue on the customer’s cluster,  and we require full cooperation from a qualified cluster/network professional on the customer end who is aware of their exact config, and a detailed network dump/analysis what actually happened at the point in time when OLM got this timeout.
      4. We cannot proceed with investigation based on the current info we have.

              rh-ee-jkeister Jordan Keister
              rhn-support-jshivers Jacob Shivers
              None
              None
              Xia Zhao Xia Zhao
              None
              Votes:
              3 Vote for this issue
              Watchers:
              30 Start watching this issue

                Created:
                Updated: