Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33891

Avoid degrading a node over a brief apiserver disruption

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While working on the tooling that surfaces cluster problems during upgrades, we discussed the following finding that seems to be common during updates (at least in 4.16 development):

      Nodes go briefly degraded and the string present in their reason annotation shows this:

      failed to set annotations on node: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +00
      00 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacit
      y:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeI
      nfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage
      {},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196
      ->10.0.15.142:6443: read: connection reset by peer
      

      Leaving away the not-so-useful Go structure dump, we see the controller considers the node to be degraded because it failed to patch it:

      failed to set annotations on node: unable to update node...
      

      likely because the ongoing upgrade caused a brief apiserver disruption

       Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196
      ->10.0.15.142:6443: read: connection reset by peer
      

      Although apiserver is not supposed to drop connections like this, it seems that MCO could be a little more robust and do some kind of back-off retry instead of propagating a node as degraded right away (it is also a little ironic that this degraded-ness is surfaced via an annotation on a node saying the controller failed to set an annotation on the node). Tolerating minor disruption like this would make other degrading conditions less noisy.

      Version-Release number of selected component (if applicable):

      Condition seen often between 4.15 and 4.16 updates, this specific case was 4.15.12 -> 4.16.0-0.nightly-2024-05-08-222442

      How reproducible:

      often

      Steps to Reproduce:

      1. update a cluster and monitor all nodes

      Actual results:

      nodes go briefly degraded with a reason being failed apiserver action

      Expected results:

      nodes not degraded unless a problematic state persists

            dkhater@redhat.com Dalia Khater
            afri@afri.cz Petr Muller
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: