Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35305

[release-4.15] OLM catalog pods do not recover from node failure

XMLWordPrintable

    • Yes
    • Rejected
    • False
    • Hide
      * Previously, default Operator Lifecycle Manager (OLM) catalog pods remained in a termination state when there was an outage of the node that was being used. With this release, the OLM catalog pods that are backed by a `CatalogSource` correctly recover from planned and unplanned node maintenance. (link:https://issues.redhat.com/browse/OCPBUGS-35305[*OCPBUGS-35305*]).
      Show
      * Previously, default Operator Lifecycle Manager (OLM) catalog pods remained in a termination state when there was an outage of the node that was being used. With this release, the OLM catalog pods that are backed by a `CatalogSource` correctly recover from planned and unplanned node maintenance. (link: https://issues.redhat.com/browse/OCPBUGS-35305 [* OCPBUGS-35305 *]).

      Observed behavior: Default OpenShift OLM catalog pods do not survive outage of the node that they are currently being executed on. The pods remain in termination state, despite the tolerations that should move them away from unresponsive nodes latest after 5 minutes.

      Impact: Operators can no longer be installed or update from catalogs that were previously executed on a node that has gone down.

      Expected behavior: The catalog pods get automatically rescheduled on remaining nodes and their gRPC API endpoint recovers as a result.

              pegoncal@redhat.com Per Goncalves da Silva
              DanielMesser Daniel Messer
              Jian Zhang Jian Zhang
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: