Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-58382

[parent] Statefull set pod stuck in a Terminating State when backend storage went down

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.16
    • Node / CRI-O
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • Rejected
    • None
    • Customer Escalated, Customer Facing
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      - Stateful set pod stuck in a Terminating state
      - As per Doc: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#statefulset-considerations , This is a expected behavior as StatefulSet pod needs stabel storage.
      - Why cri-o can't tell that the pod has in fact been terminated (the process no longer exists) and finish the pod termination flow. 'Reload the cri-o systemd unit so that it can recognise the pod has terminated' 
      Logs:
      Event logs:
      ~~~
       Warning  Unhealthy  3m4s (x811 over 132m)   kubelet  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 is running failed: container process not found
      ~~
      Kubelet logs:
      ~~~
      Apr 08 18:05:42 dell-xyz crio[5637]: time="2025-04-08 18:05:42.926691442Z" level=warning msg="Stopping container e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 with stop signal timed out. Killing..."
      Apr 08 18:07:43 dell-xyz crio[5637]: time="2025-04-08 18:07:43.218580890Z" level=info msg="Stopping container: e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 (timeout: 30s)" id=13d1e782-9694-4142-bc44-005f5e9326b3 name=/runtime.v1.RuntimeService/StopContainer
      ~~~
      Grabbing the pid for the container's process:
      ~~~
      crictl inspect e05838f64303680b4d2fd5d81788555650cf84c196d67086ca44e873dca12221 | jq .info.pid
      1502294
      ~~~
      But the process is not running:
      ~~~
      [root@dell-xyz ~]# ps axo pid=,stat= | grep 1502294
      [root@dell-xyz ~]# ps | grep 1502294
      [root@dell-xyz ~]# ps aux | grep 1502294
      [root@dell-xyz ~]# stat /proc/1502294
      stat: cannot statx '/proc/1502294': No such file or directory
      ~~~

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Always

      Steps to Reproduce:

      When backend storage goes down for all statefull set pod on RHOCP 4.16 in customer enviroment. 

      Actual results:

       Pod stuck in Terminating state wethout any process

      Expected results:

        Handel Pod Termination properly

      Additional info:

          

              aos-node@redhat.com Node Team Bot Account
              rhn-support-mdeore Mayur Deore
              None
              None
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: