Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14006

Pod stuck in terminating state because of PVC unmount error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Undefined
    • None
    • 4.12.z
    • Storage / Kubernetes
    • None
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      As part of a CI run, the command oc delete namespace psapuser95 got stuck for ever (>4h), because of a Pod stuck in the terminating state:

      > psapuser95                                         psapuser95-0                                                               0/2     Terminating   0               4h4m    10.130.14.11   ip-10-0-137-26.us-west-2.compute.internal    <none>   

      In the journal of this node, I can see this printed once:

      May 23 01:49:14.085337 ip-10-0-137-26 kubenswrapper[1450]: E0523 01:49:14.085317    1450 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0bfe06e4e80bff5f6 podName:755344ed-a210-4f52-89b9-df22edf30fbd nodeName:}" failed. No retries permitted until 2023-05-23 01:49:14.585294249 +0000 UTC m=+4675.366414088 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "psapuser95" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0bfe06e4e80bff5f6") pod "755344ed-a210-4f52-89b9-df22edf30fbd" (UID: "755344ed-a210-4f52-89b9-df22edf30fbd") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/mount": unmount failed: exit status 32 

      and this message is printed hundreds of thousand times:

      May 23 03:28:27.202330 ip-10-0-137-26 kubenswrapper[1450]: E0523 03:28:27.202294    1450 reconciler.go:208] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"psapuser95\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bfe06e4e80bff5f6\") pod \"755344ed-a210-4f52-89b9-df22edf30fbd\" (UID: \"755344ed-a210-4f52-89b9-df22edf30fbd\") : UnmountVolume.NewUnmounter failed for volume \"psapuser95\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bfe06e4e80bff5f6\") pod \"755344ed-a210-4f52-89b9-df22edf30fbd\" (UID: \"755344ed-a210-4f52-89b9-df22edf30fbd\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/vol_data.json]: open /var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"psapuser95\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bfe06e4e80bff5f6\") pod \"755344ed-a210-4f52-89b9-df22edf30fbd\" (UID: \"755344ed-a210-4f52-89b9-df22edf30fbd\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/vol_data.json]: open /var/lib/kubelet/pods/755344ed-a210-4f52-89b9-df22edf30fbd/volumes/kubernetes.io~csi/pvc-e59644eb-9c41-4294-9d19-b366b0eb7e4e/vol_data.json: no such file or directory" 

      Version-Release number of selected component (if applicable):

      4.12.12

      How reproducible:

      Intermittent

      Steps to Reproduce:

      1. Create Pods with PVCs on AWS (default classes)
      2. Delete the Pods
      

      Actual results:

      See the Pod stuck in the Terminating state forever

      Expected results:

      The Pod gets deleted after a few seconds

      Additional info:

      The directory below contains various information about the state of the cluster + various information about the nodes.
      The 'must-gather' was *not* collected.

      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-psap_ci-artifacts/707/pull-ci-openshift-psap-ci-artifacts-main-ods-notebooks-long/1660794851358150656/artifacts/notebooks-long/destroy-clusters/artifacts/sutest__gather-extra/

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kpouget2 Kevin Pouget
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: