-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.12, 4.14, 4.15
-
None
-
Moderate
-
No
-
False
-
Description of problem:
When the taint NoExecute is removed on the pod we except pod to be deleted as tolerationSeconds in the pod has been set to 120. But we see that in some race conditions TaintManagerEviction is cancelling deletion of pod.
Version-Release number of selected component (if applicable):
4.14, 4.12
How reproducible:
Intermittent
Steps to Reproduce:
1. Install 4.14 cluster 2. create project with name test 3. set psa on the project by running the command `oc label ns/test security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged --overwrite` 4. create a pod using the yaml below apiVersion: v1 kind: Pod metadata: labels: name: tolerationseconds-1 name: tolerationseconds-1 spec: containers: - image: "quay.io/openshifttest/hello-openshift@sha256:4200f438cf2e9446f6bcff9d67ceea1f69ed07a2f83363b7fb52529f7ddd8a83" imagePullPolicy: IfNotPresent name: tolerationseconds-1 ports: - containerPort: 8080 protocol: TCP resources: {} securityContext: capabilities: {} privileged: false terminationMessagePath: /dev/termination-log dnsPolicy: ClusterFirst restartPolicy: Always serviceAccount: "" tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 120 5. Verify pod is running and get the nodename of the pod where it is running 6. Add taint to the node where the pod is running using the command `oc adm taint node <nodename> key1=value1:NoExecute` 7. wait for 60 seconds and now remove the taint using the command `oc adm taint node <nodename> key1:NoExecute-` 8. Verify pod is still running on the node. 9. Now taint the node again using command `oc adm taint node <nodename> key1=value1:NoExecute` 10. Now verify that pod is running and it should be removed after 120 seconds as `tolerationSeconds` have been set to 120
Actual results:
The pod tries to get deleted, but we see that `TaintManagerEviction` is cancelling the pod deletion
Expected results:
`TaintManagerEviction` should not cancel the pod deletion as the pod has been marked for eviction.
Additional info:
Logs: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-nutanix-ipi-compact-f28-destructive/1710907965994700800/artifacts/nutanix-ipi-compact-f28-destructive/openshift-extended-test-disruptive/build-log.txt -> 4.14 OCP cluster Must-gather.tar: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-nutanix-ipi-compact-f28-destructive/1710907965994700800/artifacts/nutanix-ipi-compact-f28-destructive/gather-must-gather/artifacts/must-gather.tar https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-azure-mag-ipi-fullyprivate-f28-destructive/1702903496543571968/artifacts/azure-mag-ipi-fullyprivate-f28-destructive/openshift-extended-test-disruptive/build-log.txt -> OCP 4.12 cluster Must-gather.tar file: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-azure-mag-ipi-fullyprivate-f28-destructive/1702903496543571968/artifacts/azure-mag-ipi-fullyprivate-f28-destructive/gather-must-gather/artifacts/must-gather/must-gather.log
- clones
-
OCPBUGS-20322 Race condition on pod eviction due to taint manager
- New