Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74679

Kubelet resource start timeout prevents automatic fencing recovery

    • None
    • False
    • Hide

      None

      Show
      None
    • 0
    • None
    • None
    • None
    • Proposed
    • OCPEDGE Sprint 285
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      During some fencing operations - such as running 'sudo pcs stonith fence <node>', the kubelet resource times out trying to start. For some reason, we never retry this, so kubelet stays dead. So the fenced node never rejoins the cluster and you're left in a degraded state until you clean up the resources using sudo pcs resource cleanup. The component affected is the pacemaker kubelet agent for Two Node OpenShift with fencing.

      Version-Release number of selected component

      4.22

      How reproducible

      Sometimes

      Steps to Reproduce

      1. Deploy TNF using TNT
      2. ssh to one of the nodes (e.g. ssh core@192.168.111.20)
      3. Run sudo pcs stonith fence <other-node>
      4. Monitor kubelet to see if it fails to start up in time

      Actual results

      The kubelet times out after 1m45s trying to start, and the failcount is set to INFINITY, preventing any retries. The fenced node never rejoins the cluster.

      Output shows:

      Failed Resource Actions:
        * kubelet start on master-1 could not be executed (Timed Out: start action for systemd unit kubelet did not complete in time) at Thu Jan 29 18:08:51 2026 after 1m45.001s
      
      Failcounts for resource 'kubelet'
        master-1: INFINITY
      

      Full cluster status after waiting 5 minutes post-fence:

      Cluster name: TNF
      Cluster Summary:
        * Stack: corosync (Pacemaker is running)
        * Current DC: master-0 (version 2.1.9-1.2.el9_6-49aab9983) - partition with quorum
        * Last updated: Thu Jan 29 18:11:05 2026 on master-0
        * Last change:  Thu Jan 29 18:10:40 2026 by root via root on master-0
        * 2 nodes configured
        * 6 resource instances configured
      
      Node List:
        * Online: [ master-0 master-1 ]
      
      Full List of Resources:
        * Clone Set: kubelet-clone [kubelet]:
          * kubelet    (systemd:kubelet):     FAILED master-1
          * Started: [ master-0 ]
        * master-0_redfish    (stonith:fence_redfish):     Started master-0
        * master-1_redfish    (stonith:fence_redfish):     Started master-0
        * Clone Set: etcd-clone [etcd]:
          * etcd    (ocf:heartbeat:podman-etcd):     FAILED master-0
          * Stopped: [ master-1 ]
      
      Failed Resource Actions:
        * kubelet start on master-1 could not be executed (Timed Out: start action for systemd unit kubelet did not complete in time) at Thu Jan 29 18:08:51 2026 after 1m45.001s
        * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster) at Thu Jan 29 18:09:07 2026
      

      Expected results

      Kubelet should start properly or retry upon failing to start. The node should automatically rejoin the cluster after fencing completes.

      Additional info

      Workaround: Run sudo pcs resource cleanup to clear the failed state and allow the node to rejoin the cluster.

      Note: Journal dumps from both nodes will be attached showing the detailed timeline of the fencing operation and kubelet start attempts.

        1. master-0.txt
          7.75 MB
          Jeremy Poulin
        2. master-1.txt
          733 kB
          Jeremy Poulin
        3. master-1-var-log-pacemaker.txt
          1.75 MB
          Jeremy Poulin
        4. post-reboot-pacemaker.txt
          12 kB
          Jeremy Poulin

              rh-ee-fcappa Francesco Cappa
              jpoulin Jeremy Poulin
              Douglas Hensel Douglas Hensel
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: