-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
4.13.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We recently noticed that the `e2e-gcp-operator` job from cluster-kube-apiserver-operator CI was broken. It is permanently failing because the cluster always ends up in an unstable state with a lot of operators being degraded. Looking at the failures, a lot of requests ends up in Unauthorized errors.
We were able to trickle down the suspect to two different test cases that are putting the cluster in an unstable state:
The actual scenarios are very similar and in both case involves the service account token logic.
Scenario 1: OCPBUGS-8475: public key invalidation
- delete `next-bound-service-account-signing-key` secret in openshift-kube-apiserver-operator namespace
- new revision of kube-apiserver is created
- wait for the secret to be recreated by the operator with a new keypair
- check that the `bound-sa-token-signing-certs` configmap in openshift-kube-apiserver namespace contains both the original key from step 1 and the new one that was created by the operator
- delete the `bound-sa-token-signing-certs` configmap
- wait for the configmap to be recreated by the operator and verify that it only contains the last key
- BOOM - tokens signed with the original public key will not be recognized as valid anymore
Scenario 2: OCPBUGS-8476: force the revocation of all the service account token issuers
- set a first custom service account issuer in the Authentication cluster object: https://docs.openshift.com/container-platform/4.12/authentication/bound-service-account-tokens.html
- replace this custom service account issuer with a new one
- set the custom service account issuer to `""` which means that it will revoke all the issuer and only keep the default `"https://kubernetes.default.svc"`
- BOOM