Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15531

[4.14] DaemonSet fails to scale down during the rolling update when maxUnavailable=0

XMLWordPrintable

    • Important
    • No
    • Approved
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      The OpenShift DNS daemonset has the rolling update strategy. The "maxSurge" parameter is set to a non zero value which means that the "maxUnavailable" parameter is set to zero. When the user replaces the toleration in the daemonset's template spec (via the OpenShift DNS config API) from the one which helps to be scheduled on the master node into any other toleration: the new pods are still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node.
      
      The new pods are not expected to be scheduled on the nodes which are not tolerated by the new damonset's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Always
      

      Steps to Reproduce:
      1. Create the daemonset which tolerates "node-role.kubernetes.io/master" taint and has the following rolling update parameters:

      $ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy
      rollingUpdate:
        maxSurge: 10%
        maxUnavailable: 0
      type: RollingUpdate
      
      $ oc  -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
      - key: node-role.kubernetes.io/master
        operator: Exists
      

      2. Let the daemonset to be scheduled on all the target nodes (e.g. all masters and all workers)

      $ oc -n openshift-dns get pods  -o wide | grep dns-default
      dns-default-6bfmf     2/2     Running   0          119m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
      dns-default-9cjdf     2/2     Running   0          2m35s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
      dns-default-c6j9x     2/2     Running   0          119m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
      dns-default-fhqrs     2/2     Running   0          2m12s   10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
      dns-default-lx2nf     2/2     Running   0          119m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
      dns-default-mmc78     2/2     Running   0          112m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>
      

      3. Update the daemonset's tolerations by removing "node-role.kubernetes.io/master" and adding any other toleration (not existing works too):

      $ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
      - key: test-taint
        operator: Exists
      

      Actual results:

      $ oc -n openshift-dns get pods  -o wide | grep dns-default
      dns-default-6bfmf     2/2     Running   0          124m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
      dns-default-76vjz     0/2     Pending   0          3m2s    <none>        <none>                                     <none>           <none>
      dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
      dns-default-c6j9x     2/2     Running   0          124m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
      dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
      dns-default-lx2nf     2/2     Running   0          124m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
      dns-default-mmc78     2/2     Running   0          117m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>
      

      Expected results:

      $ oc -n openshift-dns get pods  -o wide | grep dns-default
      dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
      dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
      dns-default-mmc78     2/2     Running   0          7m54s   10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>
      

      Additional info:
      Upstream issue: https://github.com/kubernetes/kubernetes/issues/118823
      Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1687455135950439

              fkrepins@redhat.com Filip Krepinsky
              alebedev@redhat.com Andrey Lebedev
              ying zhou ying zhou
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: