Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-58227

Authentication fails when a Master node is temporarily unavailable during the Certificate rotation process

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.0, 4.17.z, 4.16.z, 4.18.z
    • apiserver-auth
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • Customer Escalated, Customer Facing, Customer Reported
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Automatic certificate rotation can lead to cluster inaccessibility when a master node is temporarily unavailable during the rotation process. If one or more master nodes are in a "Not Ready" state and recovery is delayed, the certificate rotation mechanism fails to complete successfully across all masters. 
      The certificate rotation failure eventually leads to expired certificates, preventing authentication and rendering the entire cluster inaccessible.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          Always during cert rotation process

      Steps to Reproduce:

          1. Induce a failure on one master node, causing it to go into a "Not Ready" state (e.g., stopping essential services, network isolation).
      
          2. Allow the automatic certificate rotation process to initiate while this master node is down.
      
          3. Observe the failure of the certificate rotation to complete successfully due to the unavailable master.     

      Actual results:

          The automatic certificate rotation fails when not all master nodes are in a "Ready" state. Consequently, certificates are not renewed, leading to their expiration and rendering the cluster inaccessible due to authentication failures.

      Expected results:

          Given that OpenShift clusters are designed to tolerate the loss of one master node (2 out of 3 masters being "Ready" is sufficient for cluster operations), the automatic certificate rotation process should exhibit similar resiliency.
      Specifically, it is expected that:
      - If a master node is in a "Not Ready" state during the rotation, the rotation process should temporarily skip this node.
      - Upon the "Not Ready" node returning to a "Ready" state (e.g., after recovery and subsequent certificate approval processes like CSR), the certificate rotation should then be applied to that specific node to bring it up to date.

      Additional info:

          This behavior would ensure that the cluster remains accessible and highly available even during transient master node disruptions while maintaining up-to-date certificates across all healthy masters.

              Unassigned Unassigned
              rhn-support-ssardar Sameer Sardar
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: