Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Node / CRI-O
Labels:
- blue
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:

4.18, 4.19, 4.20.0
Target Version:

4.20.0
Release Blocker:
Rejected
Sprint:
None

Customer Impact:

Customer Escalated, Customer Facing

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

- Stateful set pod stuck in a Terminating state
- As per Doc: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#statefulset-considerations , This is a expected behavior as StatefulSet pod needs stabel storage.
- Why cri-o can't tell that the pod has in fact been terminated (the process no longer exists) and finish the pod termination flow. 'Reload the cri-o systemd unit so that it can recognise the pod has terminated' 
Logs:
Event logs:
~~~
 Warning  Unhealthy  3m4s (x811 over 132m)   kubelet  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 is running failed: container process not found
~~
Kubelet logs:
~~~
Apr 08 18:05:42 dell-xyz crio[5637]: time="2025-04-08 18:05:42.926691442Z" level=warning msg="Stopping container e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 with stop signal timed out. Killing..."
Apr 08 18:07:43 dell-xyz crio[5637]: time="2025-04-08 18:07:43.218580890Z" level=info msg="Stopping container: e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 (timeout: 30s)" id=13d1e782-9694-4142-bc44-005f5e9326b3 name=/runtime.v1.RuntimeService/StopContainer
~~~
Grabbing the pid for the container's process:
~~~
crictl inspect e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 | jq .info.pid
1502294
~~~
But the process is not running:
~~~
[root@dell-xyz ~]# ps axo pid=,stat= | grep 1502294
[root@dell-xyz ~]# ps | grep 1502294
[root@dell-xyz ~]# ps aux | grep 1502294
[root@dell-xyz ~]# stat /proc/1502294
stat: cannot statx '/proc/1502294': No such file or directory
~~~

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

When backend storage goes down for all statefull set pod on RHOCP 4.16 in customer enviroment.

Actual results:

 Pod stuck in Terminating state wethout any process

Expected results:

  Handel Pod Termination properly

Additional info:

clones

OCPBUGS-55485 [4.20] Statefull set pod stuck in a Terminating State when backend storage went down

Closed

is depended on by

OCPBUGS-55485 [4.20] Statefull set pod stuck in a Terminating State when backend storage went down

Closed

Assignee:: Node Team Bot Account

Reporter:: Mayur Deore

Need Info From:: None

Contributors:: None

QA Contact:: Min Li

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/07/04 7:07 AM

Updated:: 2025/09/13 6:18 PM

Resolved:: 2025/07/04 7:09 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates