-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18.z
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During scheduled node drain operations (triggered by scaling a machine pool to zero), worker nodes frequently fail to complete the drain process. The node becomes permanently stuck in a Ready,SchedulingDisabled state and cannot be deleted. Analysis of node logs reveals a clear failure chain originating from a CSI volume unmount error (vol_data.json: no such file or directory), followed by container cleanup failures and cgroup cleanup timeouts. This behavior matches the upstream Kubernetes race condition bug #116847, where the kubelet's internal state becomes desynchronized after failing to unmount a CSI volume due to missing metadata, leading to a deadlock in the Pod and node termination sequence.
Version-Release number of selected component (if applicable):
ROSA HCP 4.18.25 Kubernetes: v1.31.13
How reproducible:
Consistently during scheduled node drain operations (e.g., nightly cost-control scaling). The issue does not occur on every drain attempt but affects a seemingly random subset of nodes each time, indicating a race condition.
Steps to Reproduce:
1. In a ROSA cluster, schedule a drain of worker nodes in a specific machine pool (e.g., by executing rosa edit machinepool -c $CLUSTER_NAME --enable-autoscaling=false --replicas=0 $POOL_NAME).
2. Monitor the node drain process. Observe one or more nodes entering the SchedulingDisabled state.
3. After several minutes, observe that the affected node(s) remain in Ready,SchedulingDisabled status indefinitely and are not removed from the cluster.
Actual results:
The node drain process hangs. Investigation of the affected node's logs (journalctl, kubelet) shows the following sequence:
1. CSI Unmount Failure: kubelet logs show UnmountVolume failed ... open ... /vol_data.json: no such file or directory for a Pod using an EBS CSI volume.
Feb 03 12:04:03.976507 ip-10-136-8-205 kubenswrapper[2426]: E0203 12:04:03.946590 2426 reconciler_common.go:156] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"files-storage\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0ea889142c34f3980\") pod \"0c6a643c-4565-4c41-802f-4bec8d38d563\" (UID: \"0c6a643c-4565-4c41-802f-4bec8d38d563\") : UnmountVolume.NewUnmounter failed for volume \"files-storage\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0ea889142c34f3980\") pod \"0c6a643c-4565-4c41-802f-4bec8d38d563\" (UID: \"0c6a643c-4565-4c41-802f-4bec8d38d563\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/0c6a643c-4565-4c41-802f-4bec8d38d563/volumes/kubernetes.io~csi/pvc-5398906b-9a70-45f0-a675-a6c54767bea8/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/0c6a643c-4565-4c41-802f-4bec8d38d563/volumes/kubernetes.io~csi/pvc-5398906b-9a70-45f0-a675-a6c54767bea8/vol_data.json]: open /var/lib/kubelet/pods/0c6a643c-4565-4c41-802f-4bec8d38d563/volumes/kubernetes.io~csi/pvc-5398906b-9a70-45f0-a675-a6c54767bea8/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"files-storage\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0ea889142c34f3980\") pod \"0c6a643c-4565-4c41-802f-4bec8d38d563\" (UID: \"0c6a643c-4565-4c41-802f-4bec8d38d563\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/0c6a643c-4565-4c41-802f-4bec8d38d563/volumes/kubernetes.io~csi/pvc-5398906b-9a70-45f0-a675-a6c54767bea8/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/0c6a643c-4565-4c41-802f-4bec8d38d563/volumes/kubernetes.io~csi/pvc-5398906b-9a70-45f0-a675-a6c54767bea8/vol_data.json]: open /var/lib/kubelet/pods/0c6a643c-4565-4c41-802f-4bec8d38d563/volumes/kubernetes.io~csi/pvc-5398906b-9a70-45f0-a675-a6c54767bea8/vol_data.json: no such file or directory"
12:03:57.761740 ip-10-136-9-40 kubenswrapper[2412]: E0206 12:03:57.744568 2412 reconciler_common.go:156] "operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume \"quarkus-test-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-085422e0ab77a6b82\") pod \"156c5f90-5d7a-4f35-833f-97eb126fb35a\" (UID: \"156c5f90-5d7a-4f35-833f-97eb126fb35a\") : UnmountVolume.NewUnmounter failed for volume \"quarkus-test-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-085422e0ab77a6b82\") pod \"156c5f90-5d7a-4f35-833f-97eb126fb35a\" (UID: \"156c5f90-5d7a-4f35-833f-97eb126fb35a\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/156c5f90-5d7a-4f35-833f-97eb126fb35a/volumes/kubernetes.io~csi/pvc-f8b68cb5-9086-489b-b763-f19d449f1d28/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/156c5f90-5d7a-4f35-833f-97eb126fb35a/volumes/kubernetes.io~csi/pvc-f8b68cb5-9086-489b-b763-f19d449f1d28/vol_data.json]: open /var/lib/kubelet/pods/156c5f90-5d7a-4f35-833f-97eb126fb35a/volumes/kubernetes.io~csi/pvc-f8b68cb5-9086-489b-b763-f19d449f1d28/vol_data.json: no such file or directory" err="UnmountVolume.NewUnmounter failed for volume \"quarkus-test-data\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-085422e0ab77a6b82\") pod \"156c5f90-5d7a-4f35-833f-97eb126fb35a\" (UID: \"156c5f90-5d7a-4f35-833f-97eb126fb35a\") : kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/156c5f90-5d7a-4f35-833f-97eb126fb35a/volumes/kubernetes.io~csi/pvc-f8b68cb5-9086-489b-b763-f19d449f1d28/mount]: kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/156c5f90-5d7a-4f35-833f-97eb126fb35a/volumes/kubernetes.io~csi/pvc-f8b68cb5-9086-489b-b763-f19d449f1d28/vol_data.json]: open /var/lib/kubelet/pods/156c5f90-5d7a-4f35-833f-97eb126fb35a/volumes/kubernetes.io~csi/pvc-f8b68cb5-9086-489b-b763-f19d449f1d28/vol_data.json: no such file or directory"
2. State Contradiction: Immediately following the error, logs may show UnmountVolume.TearDown succeeded and Volume detached, indicating physical cleanup occurred but logical state is inconsistent.
Feb 06 12:03:57.761740 ip-10-136-9-40 kubenswrapper[2412]: I0206 12:03:57.755312 2412 operation_generator.go:803] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/ebs.csi.aws.com^vol-085422e0ab77a6b82" (OuterVolumeSpecName: "quarkus-test-data") pod "156c5f90-5d7a-4f35-833f-97eb126fb35a" (UID: "156c5f90-5d7a-4f35-833f-97eb126fb35a"). InnerVolumeSpecName "pvc-f8b68cb5-9086-489b-b763-f19d449f1d28". PluginName "kubernetes.io/csi", VolumeGidValue ""
Feb 06 12:03:58.141442 ip-10-136-9-40 kubenswrapper[2412]: I0206 12:03:58.140776 2412 operation_generator.go:917] UnmountDevice succeeded for volume "pvc-f8b68cb5-9086-489b-b763-f19d449f1d28" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-085422e0ab77a6b82") on node "ip-10-136-9-40.ap-northeast-1.compute.internal"
3. Container Cleanup Failure: The container runtime (cri-o) repeatedly logs errors trying to kill non-existent containers:
Killing container ... failed: ... No such process. Feb 06 12:04:11.394410 ip-10-136-9-40 crio[2380]: time="2026-02-06 12:04:11.394282266Z" level=error msg="Killing container 39b8bc3f7f4bbc93ee94cfc9996d89a55e5d6f2a1c82a9baba1fe5f20110788c failed: `/usr/bin/crun --root /run/crun --systemd-cgroup kill 39b8bc3f7f4bbc93ee94cfc9996d89a55e5d6f2a1c82a9baba1fe5f20110788c KILL` failed: process not running: No such process\n : exit status 1"
4. Resource Cleanup Timeout: kubelet logs show Failed to delete cgroup paths ... Timed out while waiting for systemd to remove ...slice.
Feb 03 12:04:27.019493 ip-10-136-8-205 kubenswrapper[2426]: I0203 12:04:27.019412 2426 pod_container_manager_linux.go:210] "Failed to delete cgroup paths" cgroupName=["kubepods","besteffort","podddea8bb9-bddf-4f74-9beb-5294b09067b0"] err="unable to destroy cgroup paths for cgroup [kubepods besteffort podddea8bb9-bddf-4f74-9beb-5294b09067b0] : Timed out while waiting for systemd to remove kubepods-besteffort-podddea8bb9_bddf_4f74_9beb_5294b09067b0.slice"
5. Final State: The node remains in Ready,SchedulingDisabled. The associated Pod(s) may be stuck in Terminating. The drain process does not complete.
Expected results:
The node drain should complete successfully. All Pods should be gracefully evicted, all volumes unmounted, and the node should be removed from the cluster API. The node should transition out of the Ready state and be terminated in the cloud provider.
Additional info:
Example faulty nodes: 2026/02/03 ip-10-136-8-205 2026/02/06 ip-10-136-9-40
LOGS attached:
$ oc get nodes
$ oc get node <problem-node-name> -oyaml
$ oc describe node <problem-node-name>
$ oc get pods --all-namespaces --field-selector spec.nodeName=<problem-node-name> -o wide
$ oc get pdb -A
$ oc get events -A
$ rosa list machinepools -c <cluster-name>
$ rosa describe machinepool -c <cluster-name> <problem-machinepool-name>
$ oc adm node-logs ip-10-136-8-205.ap-northeast-1.compute.internal -u kubelet >> 9-node-logs.log
$ oc adm node-logs ip-10-136-9-40.ap-northeast-1.compute.internal --path=journal > node_journal.log