-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.13
-
Important
-
No
-
Approved
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
The OpenShift DNS daemonset has the rolling update strategy. The "maxSurge" parameter is set to a non zero value which means that the "maxUnavailable" parameter is set to zero. When the user replaces the toleration in the daemonset's template spec (via the OpenShift DNS config API) from the one which helps to be scheduled on the master node into any other toleration: the new pods are still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node. The new pods are not expected to be scheduled on the nodes which are not tolerated by the new damonset's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create the daemonset which tolerates "node-role.kubernetes.io/master" taint and has the following rolling update parameters:
$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy rollingUpdate: maxSurge: 10% maxUnavailable: 0 type: RollingUpdate $ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations - key: node-role.kubernetes.io/master operator: Exists
2. Let the daemonset to be scheduled on all the target nodes (e.g. all masters and all workers)
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-6bfmf 2/2 Running 0 119m 10.129.0.40 ci-ln-sb5ply2-72292-qlhc8-master-2 <none> <none> dns-default-9cjdf 2/2 Running 0 2m35s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-c6j9x 2/2 Running 0 119m 10.128.0.13 ci-ln-sb5ply2-72292-qlhc8-master-0 <none> <none> dns-default-fhqrs 2/2 Running 0 2m12s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-lx2nf 2/2 Running 0 119m 10.130.0.15 ci-ln-sb5ply2-72292-qlhc8-master-1 <none> <none> dns-default-mmc78 2/2 Running 0 112m 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
3. Update the daemonset's tolerations by removing "node-role.kubernetes.io/master" and adding any other toleration (not existing works too):
$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations - key: test-taint operator: Exists
Actual results:
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-6bfmf 2/2 Running 0 124m 10.129.0.40 ci-ln-sb5ply2-72292-qlhc8-master-2 <none> <none> dns-default-76vjz 0/2 Pending 0 3m2s <none> <none> <none> <none> dns-default-9cjdf 2/2 Running 0 7m24s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-c6j9x 2/2 Running 0 124m 10.128.0.13 ci-ln-sb5ply2-72292-qlhc8-master-0 <none> <none> dns-default-fhqrs 2/2 Running 0 7m1s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-lx2nf 2/2 Running 0 124m 10.130.0.15 ci-ln-sb5ply2-72292-qlhc8-master-1 <none> <none> dns-default-mmc78 2/2 Running 0 117m 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
Expected results:
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-9cjdf 2/2 Running 0 7m24s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-fhqrs 2/2 Running 0 7m1s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-mmc78 2/2 Running 0 7m54s 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
Additional info:
Upstream issue: https://github.com/kubernetes/kubernetes/issues/118823
Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1687455135950439
- blocks
-
OCPBUGS-13209 After custom tolerations of dns pod, the new pod stuck in pending state
- Closed
- depends on
-
OCPBUGS-19452 DaemonSet fails to scale down during the rolling update when maxUnavailable=0
- Closed
- is cloned by
-
OCPBUGS-19452 DaemonSet fails to scale down during the rolling update when maxUnavailable=0
- Closed
-
OCPBUGS-19885 [4.13] DaemonSet fails to scale down during the rolling update when maxUnavailable=0
- Closed
- is depended on by
-
OCPBUGS-19885 [4.13] DaemonSet fails to scale down during the rolling update when maxUnavailable=0
- Closed
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update