-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.17.0, 4.17.z, 4.16.z, 4.18.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
Customer Escalated, Customer Facing, Customer Reported
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Automatic certificate rotation can lead to cluster inaccessibility when a master node is temporarily unavailable during the rotation process. If one or more master nodes are in a "Not Ready" state and recovery is delayed, the certificate rotation mechanism fails to complete successfully across all masters. The certificate rotation failure eventually leads to expired certificates, preventing authentication and rendering the entire cluster inaccessible.
Version-Release number of selected component (if applicable):
How reproducible:
Always during cert rotation process
Steps to Reproduce:
1. Induce a failure on one master node, causing it to go into a "Not Ready" state (e.g., stopping essential services, network isolation). 2. Allow the automatic certificate rotation process to initiate while this master node is down. 3. Observe the failure of the certificate rotation to complete successfully due to the unavailable master.
Actual results:
The automatic certificate rotation fails when not all master nodes are in a "Ready" state. Consequently, certificates are not renewed, leading to their expiration and rendering the cluster inaccessible due to authentication failures.
Expected results:
Given that OpenShift clusters are designed to tolerate the loss of one master node (2 out of 3 masters being "Ready" is sufficient for cluster operations), the automatic certificate rotation process should exhibit similar resiliency. Specifically, it is expected that: - If a master node is in a "Not Ready" state during the rotation, the rotation process should temporarily skip this node. - Upon the "Not Ready" node returning to a "Ready" state (e.g., after recovery and subsequent certificate approval processes like CSR), the certificate rotation should then be applied to that specific node to bring it up to date.
Additional info:
This behavior would ensure that the cluster remains accessible and highly available even during transient master node disruptions while maintaining up-to-date certificates across all healthy masters.