-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18.z
-
None
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Customer encounters pods stuck terminating with the following error. When listing mounts on the host the mount point does NOT show as mounted and no processes indicate using the path. However /proc/self/mountinfo does list the mountpoint. When we strace cri-o it's receiving -1EINVAL on the umount2() call to the relevant path.
Jan 24 06:51:04 worker crio[4437]: time="2026-01-24 06:51:04.487574617Z" level=warning msg="Failed to unmount container 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: replacing mount point \"/var/lib/containers/storage/overlay/804f846ce959cc4fffeb930421231dc2298146fa8721732101b9870aee1217e5/merged\": device or resource busy" id=863adbef-0903-4953-a16c-d9018a970671 name=/runtime.v1.RuntimeService/StopPodSandbox
Jan 24 06:51:04 worker kubenswrapper[4476]: E0124 06:51:04.487776 4476 log.go:32] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to stop infra container for pod sandbox 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: failed to unmount container 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: replacing mount point \"/var/lib/containers/storage/overlay/804f846ce959cc4fffeb930421231dc2298146fa8721732101b9870aee1217e5/merged\": device or resource busy" podSandboxID="9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f"
Jan 24 06:51:04 worker kubenswrapper[4476]: E0124 06:51:04.487830 4476 kuberuntime_manager.go:1479] "Failed to stop sandbox" podSandboxID={"Type":"cri-o","ID":"9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f"}
Jan 24 06:51:04 worker kubenswrapper[4476]: E0124 06:51:04.487877 4476 kubelet.go:2041] "Unhandled Error" err="failed to \"KillPodSandbox\" for \"0f025850-3739-43c8-9e28-74c4bd792148\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to stop infra container for pod sandbox 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: failed to unmount container 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: replacing mount point \\\"/var/lib/containers/storage/overlay/804f846ce959cc4fffeb930421231dc2298146fa8721732101b9870aee1217e5/merged\\\": device or resource busy\"" logger="UnhandledError"
Version-Release number of selected component (if applicable):
OpenShift 4.18.24 cri-o-0-1.31.12-3.rhaos4.18.gitdc59c78 github.com/containers/storage v1.55.1-0.20250328180621-028e88d2378c kernel-0-5.14.0-427.87.1.el9_4-x86_64
How reproducible:
Daily to weekly on two clusters in a loaded cluster with stackrox collectors and netobservability enabled. Sometimes happens in nodes that are only days old (elastic cloud environment).
Steps to Reproduce:
Unknown
Actual results:
Pods stuck in terminating state
Expected results:
Pods terminate and cleanup
Additional info:
This is a cluster running Stackrox 4.8.4 and Network Observability, both of which I understand to exert additional pressure on the kernel mount house keeping. However, the customer has plenty of other clusters with those components that don't encounter this problem so this may be not be valuable debugging information. Container storage has to be reset and the node rebooted to clean things up. Often they just destroy the node or a machine health check does so. The following kernel warnings are seen in the logs but we believe they're associated with debugging efforts rather than root cause. They happen around the time a debug toolbox container is created, additionally they do not happen in all nodes where the problem is observed. No other kernel messages are logged that seem related to this. Jan 24 07:49:34 worker kernel: overlayfs: upperdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior. Jan 24 07:49:34 worker kernel: overlayfs: workdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior.
- relates to
-
OCPBUGS-73733 Pod termination failed due to container storage unmount error (device or resource busy).
-
- Closed
-
- links to