Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.20
Component/s: kube-apiserver
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None
Epic Link:
CNTRLPLANE-88

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

When QE tested the AllowUnsafeMalformedObjectDeletion upstream feature, they noticed that the kas-operator started reporting degraded with:

InstallerControllerDegraded: missing required resources: secrets: etcd-client-16,localhost-recovery-client-token-16,localhost-recovery-serving-certkey-16

While looking at the logs, we noticed that the controller was frantically going through revisions. In the span of a few minutes, it was already reaching revision 100+.

This was caused by the nature of the test intentionally corrupting a secret. To view the full script for the test you can look at https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-80286, but the gist is:

create a generic secret in a "test" namespace

oc create secret generic test-secret -n test

corrupt the secret in etcd

oc exec -it pod/etcd-ip-10-0-39-82.us-east-2.compute.internal -n openshift-etcd -- /bin/bash

etcdctl put /kubernetes.io/secrets/test/test-secret "corrupted-data"

A disruption was expected to some extent as we were aware that Informers were going to fail listing secrets/configmaps if one of them was corrupted/undecryptable, but we weren't expecting:

A secret in a "test" namespace to affect kas-o
The revision controller to run wild and start creating revisions over and over.

So there are potentially 2 issues here:

The revision controller probably has a bug where its getters can't see that the required resources are present because their underlying informers can't get fresh LIST from kas because of the corrupted object.
kas-o is watching resources it shouldn't.
This is probably due the fact that what we call `kubeInformersForNamespaces` in the codebase actually contains an informer factory that build an informer watching all namespaces (empty string equals no namespace filter in Kubernetes):
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/6867bc1cff74ab2305a19a51f6e0bf1cff1a5954/pkg/operator/starter.go#L101-L102
and then this factory is used to build getters:
https://github.com/openshift/cluster-openshift-apiserver-operator/blob/6867bc1cff74ab2305a19a51f6e0bf1cff1a5954/pkg/operator/starter.go#L318-L319 and is also passed down to each indivual controllers which probably also uses it incorrectly

some additional context can be found in https://redhat-internal.slack.com/archives/CC3CZCQHM/p1740759235587259, but the thread is huge.

split to

OCPBUGS-59626 kas-operator is watching secrets from all namespaces

POST

Assignee:: Unassigned

Reporter:: Damien Grisonnet

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/07/22 9:50 AM

Updated:: 2025/07/28 8:27 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates