Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Machine Config Operator
Labels:
- ServiceDeliveryImpact
- mco-triaged

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Machine Config Operator is reporting as "Down", triggering ClusterOperatorDown alerts for a Machine that is Unscheduleable and is marked for deletion.

machine-config                             4.14.41   False       False         True       83m      Cluster not available for [{operator 4.14.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 26, updated: 26, ready: 25, unavailable: 1)]


oc get nodes
...
ip-192-168-204-116.ec2.internal   Ready,SchedulingDisabled   worker         273d     v1.27.16+03a907c
...


oc describe node ip-192-168-204-116.ec2.internal 
...
Taints:             ToBeDeletedByClusterAutoscaler=1733410619:NoSchedule
                    node.kubernetes.io/unschedulable:NoSchedule
                    DeletionCandidateOfClusterAutoscaler=1733410019:PreferNoSchedule
Unschedulable:      true
...

oc logs -n openshift-machine-api machine-api-controllers-xxxxxxx-xxxx -c machine-controller --since 5m
...
I1205 15:31:54.109252       1 recorder.go:104] events "msg"="Node drain requeued: [error when waiting for pod \"pod1\" in namespace \"namespace1\" to terminate: global timeout reached: 20s, error when waiting for pod \"pod2\" in namespace \"namespace2\" to terminate: global timeout reached: 20s, error when waiting for pod \"node-exporter-t5588\" in namespace \"openshift-monitoring\" to terminate: global timeout reached: 20s, error when waiting for pod \"tuned-fhqdm\" in namespace \"openshift-cluster-node-tuning-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"machine-config-daemon-477lq\" in namespace \"openshift-machine-config-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"multus-59rdb\" in namespace \"openshift-multus\" to terminate: global timeout reached: 20s, error when waiting for pod \"collector-lwnwj\" in namespace \"openshift-logging\" to terminate: global timeout reached: 20s, error when waiting for pod \"sdn-7c8bf\" in namespace \"openshift-sdn\" to terminate: global timeout reached: 20s]" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"prod-lu-w4slg-worker-32g-us-east-1a-bbdzj","uid":"229c9547-618f-4cf0-847a-ecbbaaec2ece","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"2447741058"} "reason"="DrainRequeued" "type"="Normal"
...

---

Acknowledging that most of the pods from the log above are daemonsets and won't be removed until the node is actually drained - there are two pods that are failing to drain. 

I also acknowledge this cluster is version 4.14 and this may have been fixed in a newer version, however I couldn't find anything through a quick JIRA search indicating this. If this is the case, please let me know and we can close this.

Additionally, one of the DS pods for this node is in a Terminating state while the other is in a ContainerCreateError state - which I think is just a perfect storm of errors to hit this case:

oc get pods -n openshift-machine-config-operator -o wide | grep ip-192-168-204-116.ec2.internal

machine-config-daemon-477lq                            0/2     Terminating            2              2d     192.168.204.116   ip-192-168-204-116.ec2.internal   <none>           <none>
machine-config-daemon-wx4kc                            1/2     CreateContainerError   0 (108m ago)   112m   192.168.204.116   ip-192-168-204-116.ec2.internal   <none>           <none>

---

oc describe pod -n openshift-machine-config-operator machine-config-daemon-wx4kc

...
Warning  Failed     12m (x12 over 48m)     kubelet            (combined from similar events): Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container runtime creation: context deadline exceeded: error reserving ctr name k8s_machine-config-daemon_machine-config-daemon-wx4kc_openshift-machine-config-operator_2bb365e7-d8e0-4091-9fd8-01a454b1587f_1 for id 61120a4c1283df360d747d703e6ec4b6fbf5fdd2f8b5f49fe9d177dc3978d1af: name is reserved

Version-Release number of selected component (if applicable):

4.14.41

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Machine Config Throws an error that the daemonset cannot be rolled out to all nodes because the node in question is marked as "SchedulingDisabled" because it's currently being autoscaled down.

Expected results:

Machine Config Operator should ignore Machines that are in a deleting state from the desired counts and not cause alerts based on machines that are attempting to be deleted.

Additional info:

As mentioned previously I think this is a perfect storm of edge cases - the node needs to be unscheduleable (in this case it's due to autoscaling down and deleting the node), something needs to happen to prevent the rollout of the machine config daemonset pod, and since the machine is deleting something needs to block the deletion.

Assignee:: Team MCO

Reporter:: Kirk Bater

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/12/05 3:46 PM

Updated:: 2025/01/29 7:12 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates