Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11470

[Reliability] delete ns, pvc under the ns stuck in Terminating, failed to open volume data file

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Normal Normal
    • None
    • 4.13, 4.12.0
    • Storage / Kubernetes
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      On an AWS OVN cluster, run reliability test 
      https://github.com/openshift/svt/tree/master/reliability-v2,
      During the test namespace is deleted with pods under it. After deleting the namespace, pod under the namespace stuck in Terminating status, so as the namespace.
      
      kubelet logs:
      Apr 04 06:01:37 ip-10-0-184-246 kubenswrapper[2120]: I0404 06:01:37.155797    2120 kubelet.go:2235] "SyncLoop DELETE" source="api" pods="[testuser-9-0/mysql-1-lzbl7]"
      
      Apr 04 06:01:37 ip-10-0-184-246 kubenswrapper[2120]: I0404 06:01:37.156035    2120 kuberuntime_container.go:709] "Killing container with a grace period" pod="testuser-9-0/mysql-1-lzbl7" podUID=d369b3ba-7590-4c53-ba9b-f44a254d7ac9 containerName="mysql" containerID="cri-o://d3ff7d7d668a6b81d3dcd078730cf80b06db881210f826fc6d67b09d8fb9bb07" gracePeriod=30
      
      Apr 04 06:01:37 ip-10-0-184-246 kubenswrapper[2120]: E0404 06:01:37.228737    2120 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"mysql-1-lzbl7.1752a5f860bbc339", GenerateName:"", Namespace:"testuser-9-0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"testuser-9-0", Name:"mysql-1-lzbl7", UID:"d369b3ba-7590-4c53-ba9b-f44a254d7ac9", APIVersion:"v1", ResourceVersion:"9324243", FieldPath:"spec.containers{mysql}"}, Reason:"Killing", Message:"Stopping container mysql", Source:v1.EventSource{Component:"kubelet", Host:"ip-10-0-184-246.us-east-2.compute.internal"}, FirstTimestamp:time.Date(2023, time.April, 4, 6, 1, 37, 156006713, time.Local), LastTimestamp:time.Date(2023, time.April, 4, 6, 1, 37, 156006713, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "mysql-1-lzbl7.1752a5f860bbc339" is forbidden: unable to create new content in namespace testuser-9-0 because it is being terminated' (will not retry!)

      Version-Release number of selected component (if applicable):

      Server Version: 4.13.0-rc.2
      Kubernetes Version: v1.26.2+dc93b13
      
      and
      
      Server Version: 4.12.0
      Kubernetes Version: v1.25.4+77bec7a

      How reproducible:

      Intermittently, not on all namespaces.
      Happened on about 1 to 10 namespaces during a 7 days long run which had totally 10k+ namespaces creation/deletion.
      Use 'oc delete po --force' can successfully delete the Terminating pod, and the namespace will be deleted too.

      Steps to Reproduce:

      1. Install an AWS OVN cluster 
      2. Run reliability test 
      https://github.com/openshift/svt/tree/master/reliability-v2 
      3. Monitor if there is namespace that can not be deleted
      

      Actual results:

      After deleting the namespace, pod under the namespace stuck in Terminating status, so as the namespace.  
      
      % oc get ns | grep test | grep Terminating
      testuser-0-1                                       Terminating   38h
      testuser-11-0                                      Terminating   10h 
      
      PODs stuck in Terminating
      % oc get po -n testuser-0-1 -o wide NAME                 READY   STATUS        RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES 
      postgresql-1-p4h4z   0/1     Terminating   0          38h   10.131.1.20   ip-10-0-219-79.us-east-2.compute.internal   <none>           <none>
       
      % oc get po -n testuser-11-0 -o wide NAME               READY   STATUS        RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES 
      database-1-9r9bl   0/1     Terminating   0          10h   10.129.2.91   ip-10-0-141-18.us-east-2.compute.internal   <none>           <none>
      
      PVCs stuck in Terminating
      % oc get pvc -n testuser-0-1
      NAME         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      postgresql   Terminating   pvc-386b4090-be64-4233-ae68-98030cdc8790   1Gi        RWO            gp3-csi        3d9h
      
      % oc get pvc -n testuser-11-0
      NAME       STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      database   Terminating   pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15   1Gi        RWO            gp3-csi        2d5h
      
      % oc get ns testuser-11-0 -o yam;
      ^C
      qili@qili-mac reliability-v2 % oc get ns testuser-11-0 -o yaml
      apiVersion: v1
      kind: Namespace
      metadata:
        annotations:
          openshift.io/description: ""
          openshift.io/display-name: ""
          openshift.io/requester: testuser-11
          openshift.io/sa.scc.mcs: s0:c112,c39
          openshift.io/sa.scc.supplemental-groups: 1012510000/10000
          openshift.io/sa.scc.uid-range: 1012510000/10000
        creationTimestamp: "2023-04-09T23:57:00Z"
        deletionTimestamp: "2023-04-10T00:11:58Z"
        labels:
          kubernetes.io/metadata.name: testuser-11-0
          pod-security.kubernetes.io/audit: restricted
          pod-security.kubernetes.io/audit-version: v1.24
          pod-security.kubernetes.io/warn: restricted
          pod-security.kubernetes.io/warn-version: v1.24
          purpose: reliability
        name: testuser-11-0
        resourceVersion: "4694383"
        uid: 72df3f0d-b10f-4df6-83fc-cf652f24610e
      spec:
        finalizers:
        - kubernetes
      status:
        conditions:
        - lastTransitionTime: "2023-04-10T00:12:05Z"
          message: All resources successfully discovered
          reason: ResourcesDiscovered
          status: "False"
          type: NamespaceDeletionDiscoveryFailure
        - lastTransitionTime: "2023-04-10T00:12:05Z"
          message: All legacy kube types successfully parsed
          reason: ParsedGroupVersions
          status: "False"
          type: NamespaceDeletionGroupVersionParsingFailure
        - lastTransitionTime: "2023-04-10T00:12:34Z"
          message: 'Failed to delete all resource types, 1 remaining: unexpected items still
            remain in namespace: testuser-11-0 for gvr: /v1, Resource=pods'
          reason: ContentDeletionFailed
          status: "True"
          type: NamespaceDeletionContentFailure
        - lastTransitionTime: "2023-04-10T00:12:05Z"
          message: 'Some resources are remaining: persistentvolumeclaims. has 1 resource
            instances, pods. has 1 resource instances'
          reason: SomeResourcesRemain
          status: "True"
          type: NamespaceContentRemaining
        - lastTransitionTime: "2023-04-10T00:12:05Z"
          message: 'Some content in the namespace has finalizers remaining: kubernetes.io/pvc-protection
            in 1 resource instances'
          reason: SomeFinalizersRemain
          status: "True"
          type: NamespaceFinalizersRemaining
        phase: Terminating 
      
      kubelet log of the pvc pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15  
      
      Apr 12 05:53:06 ip-10-0-141-18 kubenswrapper[2122]: E0412 05:53:06.149497    2122 reconciler_common.go:166] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"database-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bc8f9566de339aeb\") pod \"3316571f-1c19-45b7-9337-299cac8e4fdf\" (UID: \"3316571f-1c19-45b7-9337-299cac8e4fdf\") : UnmountVolume.NewUnmounter failed for volume \"database-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bc8f9566de339aeb\") pod \"3316571f-1c19-45b7-9337-299cac8e4fdf\" (UID: \"3316571f-1c19-45b7-9337-299cac8e4fdf\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json]: open /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"database-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bc8f9566de339aeb\") pod \"3316571f-1c19-45b7-9337-299cac8e4fdf\" (UID: \"3316571f-1c19-45b7-9337-299cac8e4fdf\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json]: open /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json: no such file or directory"
      
      Check the file, nothing is under folder "kubernetes.io~csi"
      sh-5.1# ls /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/
      ls: cannot access '/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/': No such file or directory
      
      sh-5.1# ls /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi
      
      
      % oc get csidriver NAME              ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE ebs.csi.aws.com   true             false            false             <unset>         false               Persistent   5d3h  

      Expected results:

      No pod and pvc should be stuck in Terminating status by deleting the ns directly.

      Additional info:

      I opened a similar bug before and that was fixed. https://bugzilla.redhat.com/show_bug.cgi?id=2038780

      This bug is similar to 

      https://github.com/kubernetes/kubernetes/issues/116847 

      I added a comment 

      https://github.com/kubernetes/kubernetes/issues/116847#issuecomment-1504736538 

              hekumar@redhat.com Hemant Kumar
              rhn-support-qili Qiujie Li
              None
              None
              Qiujie Li Qiujie Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: