-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.22
Description of problem
During some fencing operations - such as running 'sudo pcs stonith fence <node>', the kubelet resource times out trying to start. For some reason, we never retry this, so kubelet stays dead. So the fenced node never rejoins the cluster and you're left in a degraded state until you clean up the resources using sudo pcs resource cleanup. The component affected is the pacemaker kubelet agent for Two Node OpenShift with fencing.
Version-Release number of selected component
4.22
How reproducible
Sometimes
Steps to Reproduce
- Deploy TNF using TNT
- ssh to one of the nodes (e.g. ssh core@192.168.111.20)
- Run sudo pcs stonith fence <other-node>
- Monitor kubelet to see if it fails to start up in time
Actual results
The kubelet times out after 1m45s trying to start, and the failcount is set to INFINITY, preventing any retries. The fenced node never rejoins the cluster.
Output shows:
Failed Resource Actions: * kubelet start on master-1 could not be executed (Timed Out: start action for systemd unit kubelet did not complete in time) at Thu Jan 29 18:08:51 2026 after 1m45.001s Failcounts for resource 'kubelet' master-1: INFINITY
Full cluster status after waiting 5 minutes post-fence:
Cluster name: TNF
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: master-0 (version 2.1.9-1.2.el9_6-49aab9983) - partition with quorum
* Last updated: Thu Jan 29 18:11:05 2026 on master-0
* Last change: Thu Jan 29 18:10:40 2026 by root via root on master-0
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ master-0 master-1 ]
Full List of Resources:
* Clone Set: kubelet-clone [kubelet]:
* kubelet (systemd:kubelet): FAILED master-1
* Started: [ master-0 ]
* master-0_redfish (stonith:fence_redfish): Started master-0
* master-1_redfish (stonith:fence_redfish): Started master-0
* Clone Set: etcd-clone [etcd]:
* etcd (ocf:heartbeat:podman-etcd): FAILED master-0
* Stopped: [ master-1 ]
Failed Resource Actions:
* kubelet start on master-1 could not be executed (Timed Out: start action for systemd unit kubelet did not complete in time) at Thu Jan 29 18:08:51 2026 after 1m45.001s
* etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster) at Thu Jan 29 18:09:07 2026
Expected results
Kubelet should start properly or retry upon failing to start. The node should automatically rejoin the cluster after fencing completes.
Additional info
Workaround: Run sudo pcs resource cleanup to clear the failed state and allow the node to rejoin the cluster.
Note: Journal dumps from both nodes will be attached showing the detailed timeline of the fencing operation and kubelet start attempts.