Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.13.z
Affects Version/s: 4.13
Component/s: kube-controller-manager
Labels:

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.13.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The OpenShift DNS daemonset has the rolling update strategy. The "maxSurge" parameter is set to a non zero value which means that the "maxUnavailable" parameter is set to zero. When the user replaces the toleration in the daemonset's template spec (via the OpenShift DNS config API) from the one which helps to be scheduled on the master node into any other toleration: the new pods are still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node.

The new pods are not expected to be scheduled on the nodes which are not tolerated by the new damonset's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:
1. Create the daemonset which tolerates "node-role.kubernetes.io/master" taint and has the following rolling update parameters:

$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy
rollingUpdate:
  maxSurge: 10%
  maxUnavailable: 0
type: RollingUpdate

$ oc  -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: node-role.kubernetes.io/master
  operator: Exists

2. Let the daemonset to be scheduled on all the target nodes (e.g. all masters and all workers)

$ oc -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          119m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-9cjdf     2/2     Running   0          2m35s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          119m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          2m12s   10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          119m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          112m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

3. Update the daemonset's tolerations by removing "node-role.kubernetes.io/master" and adding any other toleration (not existing works too):

$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations
- key: test-taint
  operator: Exists

Actual results:

$ oc -n openshift-dns get pods  -o wide | grep dns-default
dns-default-6bfmf     2/2     Running   0          124m    10.129.0.40   ci-ln-sb5ply2-72292-qlhc8-master-2         <none>           <none>
dns-default-76vjz     0/2     Pending   0          3m2s    <none>        <none>                                     <none>           <none>
dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-c6j9x     2/2     Running   0          124m    10.128.0.13   ci-ln-sb5ply2-72292-qlhc8-master-0         <none>           <none>
dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-lx2nf     2/2     Running   0          124m    10.130.0.15   ci-ln-sb5ply2-72292-qlhc8-master-1         <none>           <none>
dns-default-mmc78     2/2     Running   0          117m    10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

Expected results:

$ oc -n openshift-dns get pods  -o wide | grep dns-default
dns-default-9cjdf     2/2     Running   0          7m24s   10.129.2.15   ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq   <none>           <none>
dns-default-fhqrs     2/2     Running   0          7m1s    10.131.0.29   ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs   <none>           <none>
dns-default-mmc78     2/2     Running   0          7m54s   10.128.2.7    ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk   <none>           <none>

Additional info:
Upstream issue: https://github.com/kubernetes/kubernetes/issues/118823
Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1687455135950439

clones

OCPBUGS-15531 [4.14] DaemonSet fails to scale down during the rolling update when maxUnavailable=0

Closed

depends on

OCPBUGS-15531 [4.14] DaemonSet fails to scale down during the rolling update when maxUnavailable=0

Closed

links to

openshift/kubernetes#1725: [4.13] OCPBUGS-19885: UPSTREAM: 120789: change rolling update logic to exclude sunsetting nodes

RHBA-2023:5467 OpenShift Container Platform 4.13.z bug fix update

Assignee:: Filip Krepinsky

Reporter:: Andrey Lebedev

QA Contact:: Rama Kasturi Narra

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/09/28 10:05 AM

Updated:: 2023/10/10 12:34 PM

Resolved:: 2023/10/10 12:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates