-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
When QE tested the AllowUnsafeMalformedObjectDeletion upstream feature, they noticed that the kas-operator started reporting degraded with:
InstallerControllerDegraded: missing required resources: secrets: etcd-client-16,localhost-recovery-client-token-16,localhost-recovery-serving-certkey-16
While looking at the logs, we noticed that the controller was frantically going through revisions. In the span of a few minutes, it was already reaching revision 100+.
This was caused by the nature of the test intentionally corrupting a secret. To view the full script for the test you can look at https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-80286, but the gist is:
- create a generic secret in a "test" namespace
oc create secret generic test-secret -n test
- corrupt the secret in etcd
oc exec -it pod/etcd-ip-10-0-39-82.us-east-2.compute.internal -n openshift-etcd -- /bin/bash
etcdctl put /kubernetes.io/secrets/test/test-secret "corrupted-data"
A disruption was expected to some extent as we were aware that Informers were going to fail listing secrets/configmaps if one of them was corrupted/undecryptable, but we weren't expecting:
- A secret in a "test" namespace to affect kas-o
- The revision controller to run wild and start creating revisions over and over.
So there are potentially 2 issues here:
- The revision controller probably has a bug where its getters can't see that the required resources are present because their underlying informers can't get fresh LIST from kas because of the corrupted object.
- kas-o is watching resources it shouldn't.
This is probably due the fact that what we call `kubeInformersForNamespaces` in the codebase actually contains an informer factory that build an informer watching all namespaces (empty string equals no namespace filter in Kubernetes):
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/6867bc1cff74ab2305a19a51f6e0bf1cff1a5954/pkg/operator/starter.go#L101-L102
and then this factory is used to build getters:
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/6867bc1cff74ab2305a19a51f6e0bf1cff1a5954/pkg/operator/starter.go#L318-L319 and is also passed down to each indivual controllers which probably also uses it incorrectly
some additional context can be found in https://redhat-internal.slack.com/archives/CC3CZCQHM/p1740759235587259, but the thread is huge.
- split to
-
OCPBUGS-59626 kas-operator is watching secrets from all namespaces
-
- POST
-