Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35305

[release-4.15] OLM catalog pods do not recover from node failure

XMLWordPrintable

    • Yes
    • Rejected
    • False
    • Hide
      * Previously, default Operator Lifecycle Manager (OLM) catalog pods remained in a termination state when there was an outage of the node that was being used. With this release, the OLM catalog pods that are backed by a `CatalogSource` correctly recover from planned and unplanned node maintenance. (link:https://issues.redhat.com/browse/OCPBUGS-35305[*OCPBUGS-35305*]).
      Show
      * Previously, default Operator Lifecycle Manager (OLM) catalog pods remained in a termination state when there was an outage of the node that was being used. With this release, the OLM catalog pods that are backed by a `CatalogSource` correctly recover from planned and unplanned node maintenance. (link: https://issues.redhat.com/browse/OCPBUGS-35305 [* OCPBUGS-35305 *]).

      Observed behavior: Default OpenShift OLM catalog pods do not survive outage of the node that they are currently being executed on. The pods remain in termination state, despite the tolerations that should move them away from unresponsive nodes latest after 5 minutes.

      Impact: Operators can no longer be installed or update from catalogs that were previously executed on a node that has gone down.

      Expected behavior: The catalog pods get automatically rescheduled on remaining nodes and their gRPC API endpoint recovers as a result.

            pegoncal@redhat.com Per Goncalves da Silva
            DanielMesser Daniel Messer
            Jian Zhang Jian Zhang
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: