Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1998

Cluster monitoring fails to achieve new level during upgrade w/ unavailable node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.12
    • Monitoring
    • None
    • Moderate
    • None
    • MON Sprint 225, MON Sprint 226, MON Sprint 227, MON Sprint 228, MON Sprint 229
    • 5
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      During the upgrade of build02, a worker node was unavailable.  One of the monitoring operator's daemonsets failed to fully rollout as a result (one of the pods never started running, since the node wasn't available).  This meant the monitoring operator never achieved the new level, thereby blocking the upgrade.
      
      see:
      https://coreos.slack.com/archives/C03G7REB4JV/p1663698229312909?thread_ts=1663676443.155839&cid=C03G7REB4JV
      
      and the full upgrade post mortem:
      https://docs.google.com/document/d/1N5ulciLzGHq09ouEWObGXz7iDmPmhdM6walZur1ZRbs/edit#
      
      

       

      Version-Release number of selected component (if applicable):

      4.12 ec to ec upgrade

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create a cluster w/ an unavailable node (shutdown the node in the cloud provider.  Machineapi at least right now (it's being addressed) will end up reporting the node as unavailable, but not removing it or restarting it)
      2. Upgrade the cluster
      3. See that the upgrade gets stuck on the monitoring operator
      

      Actual results:

      upgrade gets stuck until the unavailable node is deleted or fixed

      Expected results:

      upgrade completes

      Additional info:

      Miciah Masters had some suggestions on how the operator can better handle determining if it has achieved the new level, in the face of these sorts of situation.  The DNS operator appears to handle this properly (it also runs a daemonset w/ pods expected on all nodes in the cluster).

              janantha@redhat.com Jayapriya Pai
              bparees@redhat.com Ben Parees
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: