Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23318

Race condition on pod eviction due to taint manager

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.12, 4.14, 4.15
    • kube-scheduler
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When the taint NoExecute is removed on the pod we except pod to be deleted as tolerationSeconds in the pod has been set to 120. But we see that in some race conditions TaintManagerEviction is cancelling deletion of pod.

      Version-Release number of selected component (if applicable):

      4.14, 4.12

      How reproducible:

      Intermittent

      Steps to Reproduce:

      1. Install 4.14 cluster
      2. create project with name test
      3. set psa on the project by running the command `oc label ns/test security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged --overwrite`
      4. create a pod using the yaml below
      apiVersion: v1
      kind: Pod
      metadata:
        labels:
          name: tolerationseconds-1
        name: tolerationseconds-1
      spec:
        containers:
          - image: "quay.io/openshifttest/hello-openshift@sha256:4200f438cf2e9446f6bcff9d67ceea1f69ed07a2f83363b7fb52529f7ddd8a83"
            imagePullPolicy: IfNotPresent
            name: tolerationseconds-1
            ports:
              - containerPort: 8080
                protocol: TCP
            resources: {}
            securityContext:
              capabilities: {}
              privileged: false
            terminationMessagePath: /dev/termination-log
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        serviceAccount: ""
        tolerations:
          - key: "key1"
            operator: "Equal"
            value: "value1"
            effect: "NoExecute"
            tolerationSeconds: 120
      5. Verify pod is running and get the nodename of the pod where it is running
      6. Add taint to the node where the pod is running using the command `oc adm taint node <nodename> key1=value1:NoExecute`
      7. wait for 60 seconds and now remove the taint using the command `oc adm taint node <nodename> key1:NoExecute-`
      8. Verify pod is still running on the node.
      9. Now taint the node again using command `oc adm taint node <nodename> key1=value1:NoExecute`
      10. Now verify that pod is running and it should be removed after 120 seconds as `tolerationSeconds` have been set to 120

      Actual results:

      The pod tries to get deleted, but we see that `TaintManagerEviction` is cancelling the pod deletion

      Expected results:

      `TaintManagerEviction` should not cancel the pod deletion as the pod has been marked for eviction.

      Additional info:

      Logs: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-nutanix-ipi-compact-f28-destructive/1710907965994700800/artifacts/nutanix-ipi-compact-f28-destructive/openshift-extended-test-disruptive/build-log.txt  -> 4.14 OCP cluster
      
      Must-gather.tar: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-nutanix-ipi-compact-f28-destructive/1710907965994700800/artifacts/nutanix-ipi-compact-f28-destructive/gather-must-gather/artifacts/must-gather.tar
      
      https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-azure-mag-ipi-fullyprivate-f28-destructive/1702903496543571968/artifacts/azure-mag-ipi-fullyprivate-f28-destructive/openshift-extended-test-disruptive/build-log.txt -> OCP 4.12 cluster
      
      Must-gather.tar file: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-amd64-nightly-azure-mag-ipi-fullyprivate-f28-destructive/1702903496543571968/artifacts/azure-mag-ipi-fullyprivate-f28-destructive/gather-must-gather/artifacts/must-gather/must-gather.log
      
      
      

            fkrepins@redhat.com Filip Krepinsky
            knarra@redhat.com Rama Kasturi Narra
            Rama Kasturi Narra Rama Kasturi Narra
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: