Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61576

Kubelet serving certificates are not generated on nodes rebooted within 24 hours of initial cluster installation.

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When nodes of a newly installed OpenShift cluster are shut down and brought back online within the first 24 hours of the cluster's creation, they fail to rejoin the cluster. This appears to be caused by an interruption in the initial kubelet certificate bootstrapping and rotation process.
      
      The documented recovery procedure for expired certificates involves approving the initial node-bootstrapper CSR, which should then allow the cluster's auto-approver to sign the subsequent CSRs for the long-term client and serving certificates.
      
      However, in this scenario (a reboot before the first 24-hour rotation), a failure occurs. After the node reboots and the bootstrap CSR is approved, the kubelet successfully obtains its client certificate (kubelet-client-current.pem). Critically, it never creates a CSR for its serving certificate. This leaves the node in a broken state where it cannot fully initialize or become Ready.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Always    

      Steps to Reproduce:

          1. Install a new OpenShift cluster.Crucially, within the first 24 hours of installation, shut down the nodes.
          2. Wait for more than 24hrs and then power the nodes back on. 
          3. To ensure a clean re-bootstrap attempt for testing, manually reset the kubelet's PKI on the affected node:
          4. $ systemctl stop kubelet
          5. $ mv /var/lib/kubelet/pki/ /tmp/
          6. $ systemctl restart kubelet
          7. observe and approve the pending bootstrap CSR:# Observe the new CSR from the node-bootstrapper
          8. $ oc get csr
          9. # Approve all pending CSRs 
         10. $ oc get csr -o name | xargs oc adm certificate approve 
         11. On the affected node, check the contents of the newly created /var/lib/kubelet/pki/ directory.
         12. $ oc get csr     
         13. Observe that 2nd set of csr i.e. for node certificates are not created.
      
          

      Actual results:

      1. A bootstrap CSR from system:serviceaccount:openshift-machine-config-operator:node-bootstrapper is created and successfully moves to an Approved,Issued state.
      
      2. The /var/lib/kubelet/pki/ directory is created on the node, but it only contains the client certificate (kubelet-client-....pem and the kubelet-client-current.pem symlink).
      
      3. The kubelet serving certificate (kubelet-server-current.pem) is never generated. 
      4. No subsequent CSR with the signer name kubernetes.io/kubelet-serving is ever created by the node.

      Expected results:

      1. After the bootstrap CSR is approved, the kubelet should proceed to create a second CSR for its serving certificate (kubernetes.io/kubelet-serving).
      
      2. This second CSR should be automatically approved by the cluster.
      
      3. Both kubelet-client-current.pem and kubelet-server-current.pem should be present in /var/lib/kubelet/pki/.
      
      4. The kubelet service should start successfully, and the node should rejoin the cluster in a Ready state.   

      Additional info:

      The documentation states otherwise:
      - https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html-single/installing_on_any_platform/index#:~:text=The%20Ignition%20config%20files%20that%20the%20installation,if%20the%20certificate%20update%20runs%20during%20installation
      - https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/backup_and_restore/control-plane-backup-and-restore#dr-scenario-3-recovering-expired-certs_dr-recovering-expired-certs 

              rphillip@redhat.com Ryan Phillips
              rhn-support-vdurgam Vedant Durgam
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: