Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57444

If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      If a master node goes down the cluster authentication is not anymore High Available. The reason is that the authentication pods will reject any authentication requests while are under a revision rollout.

      $ oc get pods -n openshift-authentication -owide
      NAME READY STATUS RESTARTS AGE 
      oauth-openshift-8f66c8b57-j8p6m 1/1 Running 1 78d 
      oauth-openshift-8f66c8b57-mcsdg 1/1 Running 1 78d 
      oauth-openshift-8f66c8b57-sjchh 1/1 Running 0 78d 
      oauth-openshift-d694495cb-9z65h 0/1 Pending 0 59m <---------- Here
      

      As you can see there is a new rollout for the pods with pods from different replica-set and we can see that from the oauth-openshift-XXXXX hash that is different.

      They are stuck cause the deployment is set to maxUnavailable = 1. 

      This was triggered after master node was down for some days and a new kubelet-serving-ca cert was rotated. 
      The cluster started to return 403 errors:

      Login failed (401 Unauthorized)

      This means that High Availability is not guaranteed if one master node is down and needs to be fixed manually following the steps in the article i created.

       

      Version-Release number of selected component (if applicable):

      In general this is seen in various versions like below (see attached cases):

      • 4.12
      • 4.14
      • 4.16
      • 4.17

      Steps to Reproduce:

      1. Bring a master node down by any means (e.g. $ systemctl disable kubelet && systemctl stop kubelet)
      2. Delete the latest kubelet-serving-ca configmap under openshift-kube-apiserver namespace to trigger a rotation. (this is an example and it can happen with any other cert rotation or other reason that can trigger an authentication pod revision rollout, or change the log level to force a new deployment: oc patch authentications.operator/cluster --type=json -p '[\{"op": "replace", "path": "/spec/logLevel", "value": "TRACE" }]'
      )
      3. See pods under the openshift-authentication namespace to get stuck in a rollout that never completes

      Actual results:

      Login no longer works.

      Expected results:
      Login should continue working even with one master down for long period of time.

              tjungblu@redhat.com Thomas Jungblut
              rhn-support-nstamate Nikolaos Stamatelopoulos
              None
              None
              Xingxing Xia Xingxing Xia
              None
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: