Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45677

Machine Config Operator is reporting as "not available" because a drain is failing due to restrictive PDB

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Machine Config Operator is reporting as "Down", triggering ClusterOperatorDown alerts for a Machine that is Unscheduleable and is marked for deletion.
      
      machine-config                             4.14.41   False       False         True       83m      Cluster not available for [{operator 4.14.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 26, updated: 26, ready: 25, unavailable: 1)]
      
      
      oc get nodes
      ...
      ip-192-168-204-116.ec2.internal   Ready,SchedulingDisabled   worker         273d     v1.27.16+03a907c
      ...
      
      
      oc describe node ip-192-168-204-116.ec2.internal 
      ...
      Taints:             ToBeDeletedByClusterAutoscaler=1733410619:NoSchedule
                          node.kubernetes.io/unschedulable:NoSchedule
                          DeletionCandidateOfClusterAutoscaler=1733410019:PreferNoSchedule
      Unschedulable:      true
      ...
      
      oc logs -n openshift-machine-api machine-api-controllers-xxxxxxx-xxxx -c machine-controller --since 5m
      ...
      I1205 15:31:54.109252       1 recorder.go:104] events "msg"="Node drain requeued: [error when waiting for pod \"pod1\" in namespace \"namespace1\" to terminate: global timeout reached: 20s, error when waiting for pod \"pod2\" in namespace \"namespace2\" to terminate: global timeout reached: 20s, error when waiting for pod \"node-exporter-t5588\" in namespace \"openshift-monitoring\" to terminate: global timeout reached: 20s, error when waiting for pod \"tuned-fhqdm\" in namespace \"openshift-cluster-node-tuning-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"machine-config-daemon-477lq\" in namespace \"openshift-machine-config-operator\" to terminate: global timeout reached: 20s, error when waiting for pod \"multus-59rdb\" in namespace \"openshift-multus\" to terminate: global timeout reached: 20s, error when waiting for pod \"collector-lwnwj\" in namespace \"openshift-logging\" to terminate: global timeout reached: 20s, error when waiting for pod \"sdn-7c8bf\" in namespace \"openshift-sdn\" to terminate: global timeout reached: 20s]" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"prod-lu-w4slg-worker-32g-us-east-1a-bbdzj","uid":"229c9547-618f-4cf0-847a-ecbbaaec2ece","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"2447741058"} "reason"="DrainRequeued" "type"="Normal"
      ...
      
      ---
      
      Acknowledging that most of the pods from the log above are daemonsets and won't be removed until the node is actually drained - there are two pods that are failing to drain. 
      
      I also acknowledge this cluster is version 4.14 and this may have been fixed in a newer version, however I couldn't find anything through a quick JIRA search indicating this. If this is the case, please let me know and we can close this.
      
      Additionally, one of the DS pods for this node is in a Terminating state while the other is in a ContainerCreateError state - which I think is just a perfect storm of errors to hit this case:
      
      oc get pods -n openshift-machine-config-operator -o wide | grep ip-192-168-204-116.ec2.internal
      
      machine-config-daemon-477lq                            0/2     Terminating            2              2d     192.168.204.116   ip-192-168-204-116.ec2.internal   <none>           <none>
      machine-config-daemon-wx4kc                            1/2     CreateContainerError   0 (108m ago)   112m   192.168.204.116   ip-192-168-204-116.ec2.internal   <none>           <none>
      
      ---
      
      oc describe pod -n openshift-machine-config-operator machine-config-daemon-wx4kc
      
      ...
      Warning  Failed     12m (x12 over 48m)     kubelet            (combined from similar events): Error: kubelet may be retrying requests that are timing out in CRI-O due to system load. Currently at stage container runtime creation: context deadline exceeded: error reserving ctr name k8s_machine-config-daemon_machine-config-daemon-wx4kc_openshift-machine-config-operator_2bb365e7-d8e0-4091-9fd8-01a454b1587f_1 for id 61120a4c1283df360d747d703e6ec4b6fbf5fdd2f8b5f49fe9d177dc3978d1af: name is reserved
          

      Version-Release number of selected component (if applicable):

      4.14.41
          

      How reproducible:

      
          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      Machine Config Throws an error that the daemonset cannot be rolled out to all nodes because the node in question is marked as "SchedulingDisabled" because it's currently being autoscaled down. 
          

      Expected results:

      Machine Config Operator should ignore Machines that are in a deleting state from the desired counts and not cause alerts based on machines that are attempting to be deleted.
          

      Additional info:

      As mentioned previously I think this is a perfect storm of edge cases - the node needs to be unscheduleable (in this case it's due to autoscaling down and deleting the node), something needs to happen to prevent the rollout of the machine config daemonset pod, and since the machine is deleting something needs to block the deletion.
          

              team-mco Team MCO
              iamkirkbater Kirk Bater
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: