Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31255

Degraded worker pool when applying configs to master and worker pools at the same time

XMLWordPrintable

    • Low
    • None
    • MCO Sprint 256
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      
      When we apply a configuration to the master pool and to the worker pool at the same time, and this configuration involves rebooting the nodes, worker pool intermittently reports degraded status.
      
      After some minute the worker pool stops being degraded without any manual intervention. It takes random time to fix the degradation, but in our tests it is usually about 4 minutes.
      
          

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.16.0-0.nightly-2024-03-20-153747   True        False         3h12m   Cluster version is 4.16.0-0.nightly-2024-03-20-153747
      
          

      How reproducible:

      Intermittent
          

      Steps to Reproduce:

          1. Create 2 scripts, one to apply configurations to worker and master pool in an endeless loop, and another one to watch the worker pool's status
      
      
      ENDLESS UPDATE LOOP SCRIPT
      
      # cat continuous_update.sh
      function checkMCP {
          MCP=$1
          MC_NAME=mc-reproducer-$MCP
      
          if [ $(oc get mcp $MCP -o jsonpath='{.status.conditions[?(@.type=="Updated")].status}') == "True" ]; then
              echo "MCP $MCP updated."
      
      	if oc get mc $MC_NAME ; then
      	    echo "Deleting $MC_NAME" 
                  oc delete mc  $MC_NAME
              else
      	    echo "Creating $MC_NAME" 
                  oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: $MCP
        name: $MC_NAME
      spec:
        config:
          ignition:
            version: 3.1.0
        extensions:
        - usbguard
        - kerberos
        - kernel-devel
        - sandboxed-containers
      EOF
              fi
      
      
      
          fi
      
      }
      
      while true; do
          checkMCP master
          checkMCP worker
          sleep 20
      done
      
      WATCH WORKER'S STATUS SCRIPT
      
      #cat watcher.sh
      while true; do
          if [ $(oc get mcp worker -o jsonpath='{.status.conditions[?(@.type=="Degraded")].status}') == "True" ]; then
              date
              oc get mcp worker -o yaml
          fi
      
          sleep 1
      done
      
          2. Execute the continuous_update.sh script created in step 1 in a shell, and execute the watch.sh script created in step 1 in a different shell.
      
      
          3. Eventually (it can take several hours if unlucky) the watch.sh script will start reporting a degraded status in the worker pool
      
      
      
          

      Actual results:

      
      The worker node is eventually degraded for a few minutes (4-5 minutes aprox) and the it automatically returns to a non-degraded status.
      
      
      The most common nodedegraded reason is this one:
      
        - lastTransitionTime: "2024-03-21T09:18:49Z"
          message: 'Node ip-10-0-18-78.us-east-2.compute.internal is reporting: "error setting
            node''s state to Working: unable to update node \"&Node{ObjectMeta:{      0
            0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}\":
            Patch \"https://api-int.sregidor-a4.qe.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-18-78.us-east-2.compute.internal\":
            read tcp 10.0.18.78:59590->10.0.44.94:6443: read: connection reset by peer"'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
      
      
      
      But we have also seen this one in our test happening with much lesser frequency:
      
      
        - lastTransitionTime: "2024-03-20T11:25:28Z"
          message: 'Node ip-10-0-16-30.us-east-2.compute.internal is reporting: "error running
            rpm-ostree update --install usbguard --install krb5-workstation --install libkadm5
            --install kernel-devel --install kernel-headers --install kata-containers: error:
            Creating importer: Failed to invoke skopeo proxy method OpenImage: remote error:
            can''t talk to a V1 container registry\n: exit status 1"'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
      
      
      
          

      Expected results:

      No pool should be degraded.
      
      
          

      Additional info:

      
          

              dkhater@redhat.com Dalia Khater
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: