[OCPBUGS-33891] Avoid degrading a node over a brief apiserver disruption - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17.0
Component/s: Machine Config Operator
Labels:
- good-first-mco-issue
- mco-triaged

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

While working on the tooling that surfaces cluster problems during upgrades, we discussed the following finding that seems to be common during updates (at least in 4.16 development):

Nodes go briefly degraded and the string present in their reason annotation shows this:

failed to set annotations on node: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +00
00 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacit
y:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeI
nfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage
{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196
->10.0.15.142:6443: read: connection reset by peer

Leaving away the not-so-useful Go structure dump, we see the controller considers the node to be degraded because it failed to patch it:

failed to set annotations on node: unable to update node...

likely because the ongoing upgrade caused a brief apiserver disruption

 Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196
->10.0.15.142:6443: read: connection reset by peer

Although apiserver is not supposed to drop connections like this, it seems that MCO could be a little more robust and do some kind of back-off retry instead of propagating a node as degraded right away (it is also a little ironic that this degraded-ness is surfaced via an annotation on a node saying the controller failed to set an annotation on the node). Tolerating minor disruption like this would make other degrading conditions less noisy.

Version-Release number of selected component (if applicable):

Condition seen often between 4.15 and 4.16 updates, this specific case was 4.15.12 -> 4.16.0-0.nightly-2024-05-08-222442

How reproducible:

often

Steps to Reproduce:

1. update a cluster and monitor all nodes

Actual results:

nodes go briefly degraded with a reason being failed apiserver action

Expected results:

nodes not degraded unless a problematic state persists

split from

OTA-1245 post-merge testing: OTA-1165 - worker node status

Closed

Assignee:: David Joshy

Reporter:: Petr Muller

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/05/17 4:08 PM

Updated:: 2025/04/14 5:02 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide