-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14.z
-
None
-
False
-
Description of problem:
Machine Config Operator is reporting as "Down", triggering ClusterOperatorDown alerts for a Machine that is Unscheduleable and is marked for deletion. machine-config 4.14.41 False False True 83m Cluster not available for [{operator 4.14.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 26, updated: 26, ready: 25, unavailable: 1)] oc get nodes ... ip-192-168-204-116.ec2.internal Ready,SchedulingDisabled worker 273d v1.27.16+03a907c ... oc describe node ip-192-168-204-116.ec2.internal ... Taints: ToBeDeletedByClusterAutoscaler=1733410619:NoSchedule node.kubernetes.io/unschedulable:NoSchedule DeletionCandidateOfClusterAutoscaler=1733410019:PreferNoSchedule Unschedulable: true ... oc logs -n openshift-machine-api machine-api-controllers-xxxxxxx-xxxx -c machine-controller --since 5m ... I1205 15:31:54.109252 1 recorder.go:104] events "msg"="Node drain requeued: [error when waiting for pod \"pod1\" in namespace \"namespace1\" to terminate: global timeout reached: 20s, error when waiting for pod \"pod2\" in namespace \"namespace2\" to terminate: global timeout reached: 20s, error when waiting for pod \"node-exporter-t5588\" in namespace \"openshift-monitoring\" to terminate: global timeout reached: 20s, error when waiting for pod \"tuned-fhqdm\" in namespace \"openshift-cluster-node-tuning-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"machine-config-daemon-477lq\" in namespace \"openshift-machine-config-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"multus-59rdb\" in namespace \"openshift-multus\" to terminate: global timeout reached: 20s, error when waiting for pod \"collector-lwnwj\" in namespace \"openshift-logging\" to terminate: global timeout reached: 20s, error when waiting for pod \"sdn-7c8bf\" in namespace \"openshift-sdn\" to terminate: global timeout reached: 20s]" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"prod-lu-w4slg-worker-32g-us-east-1a-bbdzj","uid":"229c9547-618f-4cf0-847a-ecbbaaec2ece","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"2447741058"} "reason"="DrainRequeued" "type"="Normal" ... --- Acknowledging that most of the pods from the log above are daemonsets and won't be removed until the node is actually drained - there are two pods that are failing to drain. I also acknowledge this cluster is version 4.14 and this may have been fixed in a newer version, however I couldn't find anything through a quick JIRA search indicating this. If this is the case, please let me know and we can close this. Additionally, one of the DS pods for this node is in a Terminating state while the other is in a ContainerCreateError state - which I think is just a perfect storm of errors to hit this case: oc get pods -n openshift-machine-config-operator -o wide | grep ip-192-168-204-116.ec2.internal machine-config-daemon-477lq 0/2 Terminating 2 2d 192.168.204.116 ip-192-168-204-116.ec2.internal <none> <none> machine-config-daemon-wx4kc 1/2 CreateContainerError 0 (108m ago) 112m 192.168.204.116 ip-192-168-204-116.ec2.internal <none> <none> --- oc describe pod -n openshift-machine-config-operator machine-config-daemon-wx4kc ... Warning Failed 12m (x12 over 48m) kubelet (combined from similar events): Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container runtime creation: context deadline exceeded: error reserving ctr name k8s_machine-config-daemon_machine-config-daemon-wx4kc_openshift-machine-config-operator_2bb365e7-d8e0-4091-9fd8-01a454b1587f_1 for id 61120a4c1283df360d747d703e6ec4b6fbf5fdd2f8b5f49fe9d177dc3978d1af: name is reserved
Version-Release number of selected component (if applicable):
4.14.41
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Machine Config Throws an error that the daemonset cannot be rolled out to all nodes because the node in question is marked as "SchedulingDisabled" because it's currently being autoscaled down.
Expected results:
Machine Config Operator should ignore Machines that are in a deleting state from the desired counts and not cause alerts based on machines that are attempting to be deleted.
Additional info:
As mentioned previously I think this is a perfect storm of edge cases - the node needs to be unscheduleable (in this case it's due to autoscaling down and deleting the node), something needs to happen to prevent the rollout of the machine config daemonset pod, and since the machine is deleting something needs to block the deletion.