-
Bug
-
Resolution: Unresolved
-
Undefined
-
4.14, 4.15, 4.16, 4.17.z, 4.18
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
If a master node goes down the cluster authentication is not anymore High Available. The reason is that the authentication pods will reject any authentication requests while are under a revision rollout.
$ oc get pods -n openshift-authentication -owide NAME READY STATUS RESTARTS AGE oauth-openshift-8f66c8b57-j8p6m 1/1 Running 1 78d oauth-openshift-8f66c8b57-mcsdg 1/1 Running 1 78d oauth-openshift-8f66c8b57-sjchh 1/1 Running 0 78d oauth-openshift-d694495cb-9z65h 0/1 Pending 0 59m <---------- Here
As you can see there is a new rollout for the pods with pods from different replica-set and we can see that from the oauth-openshift-XXXXX hash that is different.
They are stuck cause the deployment is set to maxUnavailable = 1.
This was triggered after master node was down for some days and a new kubelet-serving-ca cert was rotated.
The cluster started to return 403 errors:
Login failed (401 Unauthorized)
This means that High Availability is not guaranteed if one master node is down and needs to be fixed manually following the steps in the article i created.
Version-Release number of selected component (if applicable):
In general this is seen in various versions like below (see attached cases):
- 4.12
- 4.14
- 4.16
- 4.17
Steps to Reproduce:
1. Bring a master node down by any means (e.g. $ systemctl disable kubelet && systemctl stop kubelet)
2. Delete the latest kubelet-serving-ca configmap under openshift-kube-apiserver namespace to trigger a rotation. (this is an example and it can happen with any other cert rotation or other reason that can trigger an authentication pod revision rollout, or change the log level to force a new deployment: oc patch authentications.operator/cluster --type=json -p '[\{"op": "replace", "path": "/spec/logLevel", "value": "TRACE" }]'
)
3. See pods under the openshift-authentication namespace to get stuck in a rollout that never completes
Actual results:
Login no longer works.
Expected results:
Login should continue working even with one master down for long period of time.
- blocks
-
OCPBUGS-61896 [4.20] If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout
-
- POST
-
-
OCPBUGS-61895 If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout
-
- Closed
-
- is cloned by
-
OCPBUGS-61896 [4.20] If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout
-
- POST
-
-
OCPBUGS-61895 If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout
-
- Closed
-
- links to