Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32183

OLM catalog pods do not recover from node failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.16.0
    • 4.14, 4.15
    • OLM
    • Yes
    • Rasputin OLM Sprint 252
    • 1
    • Rejected
    • False
    • Hide
      * Previously, default {olm} catalog pods backed by a `CatalogSource` object would not survive an outage of the node that they were being run on. The pods remained in termination state, despite the tolerations that should move them. This caused Operators to no longer be able to be installed or updated from related catalogs. This bug fix updates {olm} so catalog pods that get stuck in this state are deleted. As a result, catalog pods now correctly recover from planned or unplanned node maintenance. (link:https://issues.redhat.com/browse/OCPBUGS-32183[*OCPBUGS-32183*])
      Show
      * Previously, default {olm} catalog pods backed by a `CatalogSource` object would not survive an outage of the node that they were being run on. The pods remained in termination state, despite the tolerations that should move them. This caused Operators to no longer be able to be installed or updated from related catalogs. This bug fix updates {olm} so catalog pods that get stuck in this state are deleted. As a result, catalog pods now correctly recover from planned or unplanned node maintenance. (link: https://issues.redhat.com/browse/OCPBUGS-32183 [* OCPBUGS-32183 *])
    • Bug Fix
    • Done

      Observed behavior: Default OpenShift OLM catalog pods do not survive outage of the node that they are currently being executed on. The pods remain in termination state, despite the tolerations that should move them away from unresponsive nodes latest after 5 minutes.

      Impact: Operators can no longer be installed or update from catalogs that were previously executed on a node that has gone down.

      Expected behavior: The catalog pods get automatically rescheduled on remaining nodes and their gRPC API endpoint recovers as a result.

              jlanford@redhat.com Joe Lanford
              DanielMesser Daniel Messer
              Jian Zhang Jian Zhang
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: