-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.12
-
None
-
Moderate
-
None
-
MON Sprint 225, MON Sprint 226, MON Sprint 227, MON Sprint 228, MON Sprint 229
-
5
-
False
-
Description of problem:
During the upgrade of build02, a worker node was unavailable. One of the monitoring operator's daemonsets failed to fully rollout as a result (one of the pods never started running, since the node wasn't available). This meant the monitoring operator never achieved the new level, thereby blocking the upgrade. see: https://coreos.slack.com/archives/C03G7REB4JV/p1663698229312909?thread_ts=1663676443.155839&cid=C03G7REB4JV and the full upgrade post mortem: https://docs.google.com/document/d/1N5ulciLzGHq09ouEWObGXz7iDmPmhdM6walZur1ZRbs/edit#
Version-Release number of selected component (if applicable):
4.12 ec to ec upgrade
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster w/ an unavailable node (shutdown the node in the cloud provider. Machineapi at least right now (it's being addressed) will end up reporting the node as unavailable, but not removing it or restarting it) 2. Upgrade the cluster 3. See that the upgrade gets stuck on the monitoring operator
Actual results:
upgrade gets stuck until the unavailable node is deleted or fixed
Expected results:
upgrade completes
Additional info:
Miciah Masters had some suggestions on how the operator can better handle determining if it has achieved the new level, in the face of these sorts of situation. The DNS operator appears to handle this properly (it also runs a daemonset w/ pods expected on all nodes in the cluster).