-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.13, 4.12.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
On an AWS OVN cluster, run reliability test https://github.com/openshift/svt/tree/master/reliability-v2, During the test namespace is deleted with pods under it. After deleting the namespace, pod under the namespace stuck in Terminating status, so as the namespace. kubelet logs: Apr 04 06:01:37 ip-10-0-184-246 kubenswrapper[2120]: I0404 06:01:37.155797 2120 kubelet.go:2235] "SyncLoop DELETE" source="api" pods="[testuser-9-0/mysql-1-lzbl7]" Apr 04 06:01:37 ip-10-0-184-246 kubenswrapper[2120]: I0404 06:01:37.156035 2120 kuberuntime_container.go:709] "Killing container with a grace period" pod="testuser-9-0/mysql-1-lzbl7" podUID=d369b3ba-7590-4c53-ba9b-f44a254d7ac9 containerName="mysql" containerID="cri-o://d3ff7d7d668a6b81d3dcd078730cf80b06db881210f826fc6d67b09d8fb9bb07" gracePeriod=30 Apr 04 06:01:37 ip-10-0-184-246 kubenswrapper[2120]: E0404 06:01:37.228737 2120 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"mysql-1-lzbl7.1752a5f860bbc339", GenerateName:"", Namespace:"testuser-9-0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"testuser-9-0", Name:"mysql-1-lzbl7", UID:"d369b3ba-7590-4c53-ba9b-f44a254d7ac9", APIVersion:"v1", ResourceVersion:"9324243", FieldPath:"spec.containers{mysql}"}, Reason:"Killing", Message:"Stopping container mysql", Source:v1.EventSource{Component:"kubelet", Host:"ip-10-0-184-246.us-east-2.compute.internal"}, FirstTimestamp:time.Date(2023, time.April, 4, 6, 1, 37, 156006713, time.Local), LastTimestamp:time.Date(2023, time.April, 4, 6, 1, 37, 156006713, time.Local), Count:1, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "mysql-1-lzbl7.1752a5f860bbc339" is forbidden: unable to create new content in namespace testuser-9-0 because it is being terminated' (will not retry!)
Version-Release number of selected component (if applicable):
Server Version: 4.13.0-rc.2 Kubernetes Version: v1.26.2+dc93b13 and Server Version: 4.12.0 Kubernetes Version: v1.25.4+77bec7a
How reproducible:
Intermittently, not on all namespaces. Happened on about 1 to 10 namespaces during a 7 days long run which had totally 10k+ namespaces creation/deletion. Use 'oc delete po --force' can successfully delete the Terminating pod, and the namespace will be deleted too.
Steps to Reproduce:
1. Install an AWS OVN cluster 2. Run reliability test https://github.com/openshift/svt/tree/master/reliability-v2 3. Monitor if there is namespace that can not be deleted
Actual results:
After deleting the namespace, pod under the namespace stuck in Terminating status, so as the namespace. % oc get ns | grep test | grep Terminating testuser-0-1 Terminating 38h testuser-11-0 Terminating 10h PODs stuck in Terminating % oc get po -n testuser-0-1 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES postgresql-1-p4h4z 0/1 Terminating 0 38h 10.131.1.20 ip-10-0-219-79.us-east-2.compute.internal <none> <none> % oc get po -n testuser-11-0 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES database-1-9r9bl 0/1 Terminating 0 10h 10.129.2.91 ip-10-0-141-18.us-east-2.compute.internal <none> <none> PVCs stuck in Terminating % oc get pvc -n testuser-0-1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE postgresql Terminating pvc-386b4090-be64-4233-ae68-98030cdc8790 1Gi RWO gp3-csi 3d9h % oc get pvc -n testuser-11-0 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE database Terminating pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15 1Gi RWO gp3-csi 2d5h % oc get ns testuser-11-0 -o yam; ^C qili@qili-mac reliability-v2 % oc get ns testuser-11-0 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/description: "" openshift.io/display-name: "" openshift.io/requester: testuser-11 openshift.io/sa.scc.mcs: s0:c112,c39 openshift.io/sa.scc.supplemental-groups: 1012510000/10000 openshift.io/sa.scc.uid-range: 1012510000/10000 creationTimestamp: "2023-04-09T23:57:00Z" deletionTimestamp: "2023-04-10T00:11:58Z" labels: kubernetes.io/metadata.name: testuser-11-0 pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/audit-version: v1.24 pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/warn-version: v1.24 purpose: reliability name: testuser-11-0 resourceVersion: "4694383" uid: 72df3f0d-b10f-4df6-83fc-cf652f24610e spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2023-04-10T00:12:05Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2023-04-10T00:12:05Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2023-04-10T00:12:34Z" message: 'Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: testuser-11-0 for gvr: /v1, Resource=pods' reason: ContentDeletionFailed status: "True" type: NamespaceDeletionContentFailure - lastTransitionTime: "2023-04-10T00:12:05Z" message: 'Some resources are remaining: persistentvolumeclaims. has 1 resource instances, pods. has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2023-04-10T00:12:05Z" message: 'Some content in the namespace has finalizers remaining: kubernetes.io/pvc-protection in 1 resource instances' reason: SomeFinalizersRemain status: "True" type: NamespaceFinalizersRemaining phase: Terminating kubelet log of the pvc pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15 Apr 12 05:53:06 ip-10-0-141-18 kubenswrapper[2122]: E0412 05:53:06.149497 2122 reconciler_common.go:166] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"database-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bc8f9566de339aeb\") pod \"3316571f-1c19-45b7-9337-299cac8e4fdf\" (UID: \"3316571f-1c19-45b7-9337-299cac8e4fdf\") : UnmountVolume.NewUnmounter failed for volume \"database-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bc8f9566de339aeb\") pod \"3316571f-1c19-45b7-9337-299cac8e4fdf\" (UID: \"3316571f-1c19-45b7-9337-299cac8e4fdf\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json]: open /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"database-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0bc8f9566de339aeb\") pod \"3316571f-1c19-45b7-9337-299cac8e4fdf\" (UID: \"3316571f-1c19-45b7-9337-299cac8e4fdf\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json]: open /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/vol_data.json: no such file or directory" Check the file, nothing is under folder "kubernetes.io~csi" sh-5.1# ls /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/ ls: cannot access '/var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi/pvc-6db43f60-2db3-4d92-a9f9-04fe711dea15/': No such file or directory sh-5.1# ls /var/lib/kubelet/pods/3316571f-1c19-45b7-9337-299cac8e4fdf/volumes/kubernetes.io~csi % oc get csidriver NAME ATTACHREQUIRED PODINFOONMOUNT STORAGECAPACITY TOKENREQUESTS REQUIRESREPUBLISH MODES AGE ebs.csi.aws.com true false false <unset> false Persistent 5d3h
Expected results:
No pod and pvc should be stuck in Terminating status by deleting the ns directly.
Additional info:
I opened a similar bug before and that was fixed. https://bugzilla.redhat.com/show_bug.cgi?id=2038780
This bug is similar to
https://github.com/kubernetes/kubernetes/issues/116847
I added a comment
https://github.com/kubernetes/kubernetes/issues/116847#issuecomment-1504736538