-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.17.0
-
Moderate
-
No
-
False
-
Description of problem:
While working on the tooling that surfaces cluster problems during upgrades, we discussed the following finding that seems to be common during updates (at least in 4.16 development):
Nodes go briefly degraded and the string present in their reason annotation shows this:
failed to set annotations on node: unable to update node "&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +00 00 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacit y:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeI nfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage {},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196 ->10.0.15.142:6443: read: connection reset by peer
Leaving away the not-so-useful Go structure dump, we see the controller considers the node to be degraded because it failed to patch it:
failed to set annotations on node: unable to update node...
likely because the ongoing upgrade caused a brief apiserver disruption
Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196
->10.0.15.142:6443: read: connection reset by peer
Although apiserver is not supposed to drop connections like this, it seems that MCO could be a little more robust and do some kind of back-off retry instead of propagating a node as degraded right away (it is also a little ironic that this degraded-ness is surfaced via an annotation on a node saying the controller failed to set an annotation on the node). Tolerating minor disruption like this would make other degrading conditions less noisy.
Version-Release number of selected component (if applicable):
Condition seen often between 4.15 and 4.16 updates, this specific case was 4.15.12 -> 4.16.0-0.nightly-2024-05-08-222442
How reproducible:
often
Steps to Reproduce:
1. update a cluster and monitor all nodes
Actual results:
nodes go briefly degraded with a reason being failed apiserver action
Expected results:
nodes not degraded unless a problematic state persists
- split from
-
OTA-1245 post-merge testing: OTA-1165 - worker node status
- Closed
- links to