-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When nodes of a newly installed OpenShift cluster are shut down and brought back online within the first 24 hours of the cluster's creation, they fail to rejoin the cluster. This appears to be caused by an interruption in the initial kubelet certificate bootstrapping and rotation process. The documented recovery procedure for expired certificates involves approving the initial node-bootstrapper CSR, which should then allow the cluster's auto-approver to sign the subsequent CSRs for the long-term client and serving certificates. However, in this scenario (a reboot before the first 24-hour rotation), a failure occurs. After the node reboots and the bootstrap CSR is approved, the kubelet successfully obtains its client certificate (kubelet-client-current.pem). Critically, it never creates a CSR for its serving certificate. This leaves the node in a broken state where it cannot fully initialize or become Ready.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install a new OpenShift cluster.Crucially, within the first 24 hours of installation, shut down the nodes. 2. Wait for more than 24hrs and then power the nodes back on. 3. To ensure a clean re-bootstrap attempt for testing, manually reset the kubelet's PKI on the affected node: 4. $ systemctl stop kubelet 5. $ mv /var/lib/kubelet/pki/ /tmp/ 6. $ systemctl restart kubelet 7. observe and approve the pending bootstrap CSR:# Observe the new CSR from the node-bootstrapper 8. $ oc get csr 9. # Approve all pending CSRs 10. $ oc get csr -o name | xargs oc adm certificate approve 11. On the affected node, check the contents of the newly created /var/lib/kubelet/pki/ directory. 12. $ oc get csr 13. Observe that 2nd set of csr i.e. for node certificates are not created.
Actual results:
1. A bootstrap CSR from system:serviceaccount:openshift-machine-config-operator:node-bootstrapper is created and successfully moves to an Approved,Issued state. 2. The /var/lib/kubelet/pki/ directory is created on the node, but it only contains the client certificate (kubelet-client-....pem and the kubelet-client-current.pem symlink). 3. The kubelet serving certificate (kubelet-server-current.pem) is never generated. 4. No subsequent CSR with the signer name kubernetes.io/kubelet-serving is ever created by the node.
Expected results:
1. After the bootstrap CSR is approved, the kubelet should proceed to create a second CSR for its serving certificate (kubernetes.io/kubelet-serving). 2. This second CSR should be automatically approved by the cluster. 3. Both kubelet-client-current.pem and kubelet-server-current.pem should be present in /var/lib/kubelet/pki/. 4. The kubelet service should start successfully, and the node should rejoin the cluster in a Ready state.
Additional info:
The documentation states otherwise: - https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html-single/installing_on_any_platform/index#:~:text=The%20Ignition%20config%20files%20that%20the%20installation,if%20the%20certificate%20update%20runs%20during%20installation - https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/backup_and_restore/control-plane-backup-and-restore#dr-scenario-3-recovering-expired-certs_dr-recovering-expired-certs