Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Node / CRI-O
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Customer encounters pods stuck terminating with the following error. When listing mounts on the host the mount point does NOT show as mounted and no processes indicate using the path. However /proc/self/mountinfo does list the mountpoint. When we strace cri-o it's receiving -1EINVAL on the umount2() call to the relevant path.

Jan 24 06:51:04 worker crio[4437]: time="2026-01-24 06:51:04.487574617Z" level=warning msg="Failed to unmount container 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: replacing mount point \"/var/lib/containers/storage/overlay/804f846ce959cc4fffeb930421231dc2298146fa8721732101b9870aee1217e5/merged\": device or resource busy" id=863adbef-0903-4953-a16c-d9018a970671 name=/runtime.v1.RuntimeService/StopPodSandbox
Jan 24 06:51:04 worker kubenswrapper[4476]: E0124 06:51:04.487776    4476 log.go:32] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to stop infra container for pod sandbox 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: failed to unmount container 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: replacing mount point \"/var/lib/containers/storage/overlay/804f846ce959cc4fffeb930421231dc2298146fa8721732101b9870aee1217e5/merged\": device or resource busy" podSandboxID="9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f"
Jan 24 06:51:04 worker kubenswrapper[4476]: E0124 06:51:04.487830    4476 kuberuntime_manager.go:1479] "Failed to stop sandbox" podSandboxID={"Type":"cri-o","ID":"9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f"}
Jan 24 06:51:04 worker kubenswrapper[4476]: E0124 06:51:04.487877    4476 kubelet.go:2041] "Unhandled Error" err="failed to \"KillPodSandbox\" for \"0f025850-3739-43c8-9e28-74c4bd792148\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to stop infra container for pod sandbox 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: failed to unmount container 9c634a49e087b3b7f6c48c8bf7fd6050d328bd5e985bd48e460de5743f98d78f: replacing mount point \\\"/var/lib/containers/storage/overlay/804f846ce959cc4fffeb930421231dc2298146fa8721732101b9870aee1217e5/merged\\\": device or resource busy\"" logger="UnhandledError"

Version-Release number of selected component (if applicable):

OpenShift 4.18.24
cri-o-0-1.31.12-3.rhaos4.18.gitdc59c78
github.com/containers/storage v1.55.1-0.20250328180621-028e88d2378c
kernel-0-5.14.0-427.87.1.el9_4-x86_64

How reproducible:

Daily to weekly on two clusters in a loaded cluster with stackrox collectors and netobservability enabled. Sometimes happens in nodes that are only days old (elastic cloud environment).

Steps to Reproduce:

Unknown

Actual results:

Pods stuck in terminating state

Expected results:

Pods terminate and cleanup

Additional info:

This is a cluster running Stackrox 4.8.4 and Network Observability, both of which I understand to exert additional pressure on the kernel mount house keeping. However, the customer has plenty of other clusters with those components that don't encounter this problem so this may be not be valuable debugging information.

Container storage has to be reset and the node rebooted to clean things up. Often they just destroy the node or a machine health check does so.


The following kernel warnings are seen in the logs but we believe they're associated with debugging efforts rather than root cause. They happen around the time a debug toolbox container is created, additionally they do not happen in all nodes where the problem is observed. No other kernel messages are logged that seem related to this.

Jan 24 07:49:34 worker kernel: overlayfs: upperdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior.
Jan 24 07:49:34 worker kernel: overlayfs: workdir is in-use as upperdir/workdir of another mount, accessing files from both mounts will result in undefined behavior.

relates to

OCPBUGS-73733 Pod termination failed due to container storage unmount error (device or resource busy).

Closed

links to

KCS 7135090: Pod termination failed to unmount container with message device or resource busy in OpenShift 4

Assignee:: Peter Hunt

Reporter:: Scott Dodson

Need Info From:: None

Contributors:: None

QA Contact:: Min Li

Doc Contact:: None

Votes:: 3 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2026/01/30 2:56 PM

Updated:: 2026/02/17 8:48 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates