-
Bug
-
Resolution: Won't Do
-
Minor
-
None
-
4.16
-
Low
-
None
-
MCO Sprint 256
-
1
-
False
-
Description of problem:
When we apply a configuration to the master pool and to the worker pool at the same time, and this configuration involves rebooting the nodes, worker pool intermittently reports degraded status. After some minute the worker pool stops being degraded without any manual intervention. It takes random time to fix the degradation, but in our tests it is usually about 4 minutes.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-03-20-153747 True False 3h12m Cluster version is 4.16.0-0.nightly-2024-03-20-153747
How reproducible:
Intermittent
Steps to Reproduce:
1. Create 2 scripts, one to apply configurations to worker and master pool in an endeless loop, and another one to watch the worker pool's status ENDLESS UPDATE LOOP SCRIPT # cat continuous_update.sh function checkMCP { MCP=$1 MC_NAME=mc-reproducer-$MCP if [ $(oc get mcp $MCP -o jsonpath='{.status.conditions[?(@.type=="Updated")].status}') == "True" ]; then echo "MCP $MCP updated." if oc get mc $MC_NAME ; then echo "Deleting $MC_NAME" oc delete mc $MC_NAME else echo "Creating $MC_NAME" oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: $MCP name: $MC_NAME spec: config: ignition: version: 3.1.0 extensions: - usbguard - kerberos - kernel-devel - sandboxed-containers EOF fi fi } while true; do checkMCP master checkMCP worker sleep 20 done WATCH WORKER'S STATUS SCRIPT #cat watcher.sh while true; do if [ $(oc get mcp worker -o jsonpath='{.status.conditions[?(@.type=="Degraded")].status}') == "True" ]; then date oc get mcp worker -o yaml fi sleep 1 done 2. Execute the continuous_update.sh script created in step 1 in a shell, and execute the watch.sh script created in step 1 in a different shell. 3. Eventually (it can take several hours if unlucky) the watch.sh script will start reporting a degraded status in the worker pool
Actual results:
The worker node is eventually degraded for a few minutes (4-5 minutes aprox) and the it automatically returns to a non-degraded status. The most common nodedegraded reason is this one: - lastTransitionTime: "2024-03-21T09:18:49Z" message: 'Node ip-10-0-18-78.us-east-2.compute.internal is reporting: "error setting node''s state to Working: unable to update node \"&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}\": Patch \"https://api-int.sregidor-a4.qe.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-18-78.us-east-2.compute.internal\": read tcp 10.0.18.78:59590->10.0.44.94:6443: read: connection reset by peer"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded But we have also seen this one in our test happening with much lesser frequency: - lastTransitionTime: "2024-03-20T11:25:28Z" message: 'Node ip-10-0-16-30.us-east-2.compute.internal is reporting: "error running rpm-ostree update --install usbguard --install krb5-workstation --install libkadm5 --install kernel-devel --install kernel-headers --install kata-containers: error: Creating importer: Failed to invoke skopeo proxy method OpenImage: remote error: can''t talk to a V1 container registry\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Expected results:
No pool should be degraded.
Additional info: