Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: 4.21.0
Affects Version/s: 4.14, 4.15, 4.16, 4.17.z, 4.18
Component/s: apiserver-auth
Labels:
- cee.next
- rits-work

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:

4.21.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

If a master node goes down the cluster authentication is not anymore High Available. The reason is that the authentication pods will reject any authentication requests while are under a revision rollout.

$ oc get pods -n openshift-authentication -owide
NAME READY STATUS RESTARTS AGE 
oauth-openshift-8f66c8b57-j8p6m 1/1 Running 1 78d 
oauth-openshift-8f66c8b57-mcsdg 1/1 Running 1 78d 
oauth-openshift-8f66c8b57-sjchh 1/1 Running 0 78d 
oauth-openshift-d694495cb-9z65h 0/1 Pending 0 59m <---------- Here

As you can see there is a new rollout for the pods with pods from different replica-set and we can see that from the oauth-openshift-XXXXX hash that is different.

They are stuck cause the deployment is set to maxUnavailable = 1.

This was triggered after master node was down for some days and a new kubelet-serving-ca cert was rotated.
The cluster started to return 403 errors:

Login failed (401 Unauthorized)

This means that High Availability is not guaranteed if one master node is down and needs to be fixed manually following the steps in the article i created.

Version-Release number of selected component (if applicable):

In general this is seen in various versions like below (see attached cases):

4.12
4.14
4.16
4.17

Steps to Reproduce:

1. Bring a master node down by any means (e.g. $ systemctl disable kubelet && systemctl stop kubelet)
2. Delete the latest kubelet-serving-ca configmap under openshift-kube-apiserver namespace to trigger a rotation. (this is an example and it can happen with any other cert rotation or other reason that can trigger an authentication pod revision rollout, or change the log level to force a new deployment: oc patch authentications.operator/cluster --type=json -p '[\{"op": "replace", "path": "/spec/logLevel", "value": "TRACE" }]'
)
3. See pods under the openshift-authentication namespace to get stuck in a rollout that never completes

Actual results:

Expected results:
Login should continue working even with one master down for long period of time.

blocks

OCPBUGS-61895 If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout

Closed

OCPBUGS-61896 [4.20] If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout

Closed

is cloned by

OCPBUGS-61895 If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout

Closed

OCPBUGS-61896 [4.20] If one master node stays down authentication the cluster fails due to stuck authentication pod revision rollout

Closed

links to

KCS 7124336: Users cannot login to Openshift 4 with error "Login failed (401 Unauthorized)" while a control plane node is failing

openshift/cluster-authentication-operator#789: OCPBUGS-57444: set appropriate rolling update settings

(1 links to)

Assignee:: Thomas Jungblut

Reporter:: Nikolaos Stamatelopoulos

Need Info From:: None

Contributors:: None

QA Contact:: Xingxing Xia

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2025/06/13 11:47 AM

Updated:: 2025/10/11 11:05 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide