Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59625

kas-operator is creating new revisions frantically when an object is corrupted

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.20
    • kube-apiserver
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When QE tested the AllowUnsafeMalformedObjectDeletion upstream feature, they noticed that the kas-operator started reporting degraded with:

      InstallerControllerDegraded: missing required resources: secrets: etcd-client-16,localhost-recovery-client-token-16,localhost-recovery-serving-certkey-16
      

      While looking at the logs, we noticed that the controller was frantically going through revisions. In the span of a few minutes, it was already reaching revision 100+.

      This was caused by the nature of the test intentionally corrupting a secret. To view the full script for the test you can look at https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-80286, but the gist is:

      1. create a generic secret in a "test" namespace
        oc create secret generic test-secret -n test
      2. corrupt the secret in etcd
        oc exec -it pod/etcd-ip-10-0-39-82.us-east-2.compute.internal -n openshift-etcd -- /bin/bash
        etcdctl put /kubernetes.io/secrets/test/test-secret "corrupted-data"

      A disruption was expected to some extent as we were aware that Informers were going to fail listing secrets/configmaps if one of them was corrupted/undecryptable, but we weren't expecting:

      1. A secret in a "test" namespace to affect kas-o
      2. The revision controller to run wild and start creating revisions over and over.

      So there are potentially 2 issues here:

      1. The revision controller probably has a bug where its getters can't see that the required resources are present because their underlying informers can't get fresh LIST from kas because of the corrupted object.
      2. kas-o is watching resources it shouldn't.
        This is probably due the fact that what we call `kubeInformersForNamespaces` in the codebase actually contains an informer factory that build an informer watching all namespaces (empty string equals no namespace filter in Kubernetes):
        https://github.com/openshift/cluster-openshift-apiserver-operator/blob/6867bc1cff74ab2305a19a51f6e0bf1cff1a5954/pkg/operator/starter.go#L101-L102
        and then this factory is used to build getters:
        https://github.com/openshift/cluster-openshift-apiserver-operator/blob/6867bc1cff74ab2305a19a51f6e0bf1cff1a5954/pkg/operator/starter.go#L318-L319 and is also passed down to each indivual controllers which probably also uses it incorrectly

      some additional context can be found in https://redhat-internal.slack.com/archives/CC3CZCQHM/p1740759235587259, but the thread is huge.

              Unassigned Unassigned
              dgrisonn@redhat.com Damien Grisonnet
              None
              None
              Ke Wang Ke Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: