-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18
-
None
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
A race condition exists between the Kubelet cgroup manager and CRI-O. When a new pod is created, the cgroup manager receives an inotify event for the new cgroup path before the container is fully registered in CRI-O. Kubelet attempts to query the container status, receives a 404, and marks the pod as failed internally. This prevents all future synchronization for that pod.
Version-Release number of selected component (if applicable):
How reproducible:
Customer Environment
Steps to Reproduce:
1.Deploy a pod with multiple containers (e.g., NetApp Trident controller). 2. The issue occurs intermittently, often under high system load or when CNI setup takes >100ms.
Actual results:
Kubelet Logs: manager.go:1169: "Failed to process watch event... Status 404 returned error can't find the container". Pod Status: Stuck in Pending / ContainerCreating. Status Fields: imageID is empty and PodReadyToStartContainers is False, even though crictl ps shows the container is actually Running.
Expected results:
Kubelet should implement a retry mechanism with exponential backoff when receiving a 404 error from the runtime during a cgroup watch event, allowing for the natural delay in CRI-O registration.
Additional info: