Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59102

After node crash, node was stuck in NotReady status because of certificate file that is empty

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Minor Minor
    • None
    • 4.19
    • Node / Kubelet
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Im trying to test node crash scenarios in latest 4.19 nighty build cluster. I induce the crash by running echo c > /proc/sysrq-trigger from the node. After the crash is completed we have seen the node stuck in NotReady state. On further analysis we found out that kubelet  service logs it complains about /var/lib/kubelet/pki/kubelet-client-current.pem certificate and the reason was, that this file was empty.
      
      I also find out, that there were some Pending CSRs:
      
      csr-66nc8                                        81s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      csr-ldvdv                                        68s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      
      Once I've approved those CSRs, removed the empty file /var/lib/kubelet/pki/kubelet-client-current.pem and restart kubelet service, the nodes get's ready.
      
      I have been doing similar testing almost every build of 4.19 but we are only seeing this issue in the recent builds (< 1.5 weeks)

      Version-Release number of selected component (if applicable):

      OCP: 4.19.0-0.nightly-2025-07-02-143253

      How reproducible:

      Near 100%

       

      Steps to Reproduce:

      1. Deploy latest 4.19 OCP cluster     
      2. Crash some of the nodes using echo c > /proc/sysrq-trigger     
      3. Should be hitting the above mentioned issue on node recovery     

      Actual results:

      Node stuck in NotReady because of certificate issue 

      Expected results:

      Node should recover successfully

      Additional info:

          

              rh-ee-kehannon Kevin Hannon
              mashetty@redhat.com Mahesh Shetty (Inactive)
              Min Li Min Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: