Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Machine Config Operator
Labels:
- mco-triaged
- mco_qe_required

Severity:
Low
Regression:
None
Sprint:
MCO Sprint 256
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:


When we apply a configuration to the master pool and to the worker pool at the same time, and this configuration involves rebooting the nodes, worker pool intermittently reports degraded status.

After some minute the worker pool stops being degraded without any manual intervention. It takes random time to fix the degradation, but in our tests it is usually about 4 minutes.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-03-20-153747   True        False         3h12m   Cluster version is 4.16.0-0.nightly-2024-03-20-153747

How reproducible:

Intermittent

Steps to Reproduce:

    1. Create 2 scripts, one to apply configurations to worker and master pool in an endeless loop, and another one to watch the worker pool's status


ENDLESS UPDATE LOOP SCRIPT

# cat continuous_update.sh
function checkMCP {
    MCP=$1
    MC_NAME=mc-reproducer-$MCP

    if [ $(oc get mcp $MCP -o jsonpath='{.status.conditions[?(@.type=="Updated")].status}') == "True" ]; then
        echo "MCP $MCP updated."

	if oc get mc $MC_NAME ; then
	    echo "Deleting $MC_NAME" 
            oc delete mc  $MC_NAME
        else
	    echo "Creating $MC_NAME" 
            oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: $MCP
  name: $MC_NAME
spec:
  config:
    ignition:
      version: 3.1.0
  extensions:
  - usbguard
  - kerberos
  - kernel-devel
  - sandboxed-containers
EOF
        fi



    fi

}

while true; do
    checkMCP master
    checkMCP worker
    sleep 20
done

WATCH WORKER'S STATUS SCRIPT

#cat watcher.sh
while true; do
    if [ $(oc get mcp worker -o jsonpath='{.status.conditions[?(@.type=="Degraded")].status}') == "True" ]; then
        date
        oc get mcp worker -o yaml
    fi

    sleep 1
done

    2. Execute the continuous_update.sh script created in step 1 in a shell, and execute the watch.sh script created in step 1 in a different shell.


    3. Eventually (it can take several hours if unlucky) the watch.sh script will start reporting a degraded status in the worker pool

Actual results:


The worker node is eventually degraded for a few minutes (4-5 minutes aprox) and the it automatically returns to a non-degraded status.


The most common nodedegraded reason is this one:

  - lastTransitionTime: "2024-03-21T09:18:49Z"
    message: 'Node ip-10-0-18-78.us-east-2.compute.internal is reporting: "error setting
      node''s state to Working: unable to update node \"&Node{ObjectMeta:{      0
      0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}\":
      Patch \"https://api-int.sregidor-a4.qe.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-18-78.us-east-2.compute.internal\":
      read tcp 10.0.18.78:59590->10.0.44.94:6443: read: connection reset by peer"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded



But we have also seen this one in our test happening with much lesser frequency:


  - lastTransitionTime: "2024-03-20T11:25:28Z"
    message: 'Node ip-10-0-16-30.us-east-2.compute.internal is reporting: "error running
      rpm-ostree update --install usbguard --install krb5-workstation --install libkadm5
      --install kernel-devel --install kernel-headers --install kata-containers: error:
      Creating importer: Failed to invoke skopeo proxy method OpenImage: remote error:
      can''t talk to a V1 container registry\n: exit status 1"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

Expected results:

No pool should be degraded.

Additional info:

Assignee:: Dalia Khater

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/21 10:43 AM

Updated:: 2024/07/17 9:55 PM

Resolved:: 2024/07/17 9:55 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates