Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45677

Machine Config Operator is reporting as "not available" because a drain is failing due to restrictive PDB


    • None
    • False
    • Hide



      Description of problem:

      Machine Config Operator is reporting as "Down", triggering ClusterOperatorDown alerts for a Machine that is Unscheduleable and is marked for deletion.
      machine-config                             4.14.41   False       False         True       83m      Cluster not available for [{operator 4.14.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 26, updated: 26, ready: 25, unavailable: 1)]
      oc get nodes
      ip-192-168-204-116.ec2.internal   Ready,SchedulingDisabled   worker         273d     v1.27.16+03a907c
      oc describe node ip-192-168-204-116.ec2.internal 
      Taints:             ToBeDeletedByClusterAutoscaler=1733410619:NoSchedule
      Unschedulable:      true
      oc logs -n openshift-machine-api machine-api-controllers-xxxxxxx-xxxx -c machine-controller --since 5m
      I1205 15:31:54.109252       1 recorder.go:104] events "msg"="Node drain requeued: [error when waiting for pod \"pod1\" in namespace \"namespace1\" to terminate: global timeout reached: 20s, error when waiting for pod \"pod2\" in namespace \"namespace2\" to terminate: global timeout reached: 20s, error when waiting for pod \"node-exporter-t5588\" in namespace \"openshift-monitoring\" to terminate: global timeout reached: 20s, error when waiting for pod \"tuned-fhqdm\" in namespace \"openshift-cluster-node-tuning-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"machine-config-daemon-477lq\" in namespace \"openshift-machine-config-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"multus-59rdb\" in namespace \"openshift-multus\" to terminate: global timeout reached: 20s, error when waiting for pod \"collector-lwnwj\" in namespace \"openshift-logging\" to terminate: global timeout reached: 20s, error when waiting for pod \"sdn-7c8bf\" in namespace \"openshift-sdn\" to terminate: global timeout reached: 20s]" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"prod-lu-w4slg-worker-32g-us-east-1a-bbdzj","uid":"229c9547-618f-4cf0-847a-ecbbaaec2ece","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"2447741058"} "reason"="DrainRequeued" "type"="Normal"
      Acknowledging that most of the pods from the log above are daemonsets and won't be removed until the node is actually drained - there are two pods that are failing to drain. 
      I also acknowledge this cluster is version 4.14 and this may have been fixed in a newer version, however I couldn't find anything through a quick JIRA search indicating this. If this is the case, please let me know and we can close this.
      Additionally, one of the DS pods for this node is in a Terminating state while the other is in a ContainerCreateError state - which I think is just a perfect storm of errors to hit this case:
      oc get pods -n openshift-machine-config-operator -o wide | grep ip-192-168-204-116.ec2.internal
      machine-config-daemon-477lq                            0/2     Terminating            2              2d   ip-192-168-204-116.ec2.internal   <none>           <none>
      machine-config-daemon-wx4kc                            1/2     CreateContainerError   0 (108m ago)   112m   ip-192-168-204-116.ec2.internal   <none>           <none>
      oc describe pod -n openshift-machine-config-operator machine-config-daemon-wx4kc
      Warning  Failed     12m (x12 over 48m)     kubelet            (combined from similar events): Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container runtime creation: context deadline exceeded: error reserving ctr name k8s_machine-config-daemon_machine-config-daemon-wx4kc_openshift-machine-config-operator_2bb365e7-d8e0-4091-9fd8-01a454b1587f_1 for id 61120a4c1283df360d747d703e6ec4b6fbf5fdd2f8b5f49fe9d177dc3978d1af: name is reserved

      Version-Release number of selected component (if applicable):


      How reproducible:


      Steps to Reproduce:


      Actual results:

      Machine Config Throws an error that the daemonset cannot be rolled out to all nodes because the node in question is marked as "SchedulingDisabled" because it's currently being autoscaled down. 

      Expected results:

      Machine Config Operator should ignore Machines that are in a deleting state from the desired counts and not cause alerts based on machines that are attempting to be deleted.

      Additional info:

      As mentioned previously I think this is a perfect storm of edge cases - the node needs to be unscheduleable (in this case it's due to autoscaling down and deleting the node), something needs to happen to prevent the rollout of the machine config daemonset pod, and since the machine is deleting something needs to block the deletion.

              team-mco Team MCO
              iamkirkbater Kirk Bater
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              0 Vote for this issue
              3 Start watching this issue
