Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Monitoring
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
None
Sprint:
MON Sprint 225, MON Sprint 226, MON Sprint 227, MON Sprint 228, MON Sprint 229
sprint_count:
5

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During the upgrade of build02, a worker node was unavailable.  One of the monitoring operator's daemonsets failed to fully rollout as a result (one of the pods never started running, since the node wasn't available).  This meant the monitoring operator never achieved the new level, thereby blocking the upgrade.

see:
https://coreos.slack.com/archives/C03G7REB4JV/p1663698229312909?thread_ts=1663676443.155839&cid=C03G7REB4JV

and the full upgrade post mortem:
https://docs.google.com/document/d/1N5ulciLzGHq09ouEWObGXz7iDmPmhdM6walZur1ZRbs/edit#

Version-Release number of selected component (if applicable):

4.12 ec to ec upgrade

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster w/ an unavailable node (shutdown the node in the cloud provider.  Machineapi at least right now (it's being addressed) will end up reporting the node as unavailable, but not removing it or restarting it)
2. Upgrade the cluster
3. See that the upgrade gets stuck on the monitoring operator

Actual results:

upgrade gets stuck until the unavailable node is deleted or fixed

Expected results:

upgrade completes

Additional info:

Miciah Masters had some suggestions on how the operator can better handle determining if it has achieved the new level, in the face of these sorts of situation.  The DNS operator appears to handle this properly (it also runs a daemonset w/ pods expected on all nodes in the cluster).

links to

openshift/cluster-monitoring-operator#1812: OCPBUGS-1998: pkg/client: Update daemonset degrade condition

Assignee:: Jayapriya Pai

Reporter:: Ben Parees

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2022/10/04 3:50 PM

Updated:: 2025/07/29 5:36 AM

Resolved:: 2023/05/17 10:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide