-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16
-
Critical
-
Yes
-
CLOUD Sprint 254
-
1
-
Approved
-
False
-
-
N/A
-
Release Note Not Required
-
Done
-
As of OpenShift 4.16, CRD management is more complex. This is an artifact of improvements made to feature gates and feature sets. deads@redhat.com and I agreed that, to avoid confusion, we should aim to stop having CRDs installed via operator repos, and, if their types live in o/api, install them from there instead.
We started this by moving the ControlPlaneMachineSet back to o/api, which is part of the MachineAPI capability.
Unbeknown to us at the time, the way the installer currently works is that all resources that are rendered, get applied by a cluster-bootstrap tools, roughly here and not by CVO.
Cluster-bootstrap is not capability aware, so installed the CPMS CRD, which in turn broke the check in the CSR approver which stops it from crashing on MachineAPI less clusters.
Options for moving forward include:
- Reverting the move (complex)
- Making the API render somehow understand capabilities and remove any CRD from a disabled cap
- Make the cluster-bootstrap tool filter for caps
I'm not sure presently which of the 2nd or 3rd options is better, nor am I sure how I would expect the caps to come into knowledge of the "renderers", installer can provide them as args in bootkube.sh.template?
Original bug below, description of what's happening above
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.16.0-rc.1
How reproducible:
once so far
Steps to Reproduce:
1. Deploy SNO with DU profile with disabled capabilities: installConfigOverrides: "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}" 2. Leave the node running tests overnight for a couple of hours 3. Check for Pending CSRs
Actual results:
oc get csr -A | grep Pending | wc -l 27
Expected results:
No pending CSRs Also oc logs will return a tls internal error: oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error
Additional info:
Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities. I0514 13:25:09.266546 1 controller.go:120] Reconciling CSR: csr-dw9c8 E0514 13:25:09.275585 1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:09.275665 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c" I0514 13:25:43.792140 1 controller.go:120] Reconciling CSR: csr-jvrvt E0514 13:25:43.798079 1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1" E0514 13:25:43.798128 1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff"
- blocks
-
OCPBUGS-34398 Cluster Bootstrap does not account for capabilities in rendered manifests
- Closed
- clones
-
OCPBUGS-33644 kubelet-serving CSRs in Pending state on SNO with Telco DU with disabled capabilities
- Closed
- is cloned by
-
OCPBUGS-34398 Cluster Bootstrap does not account for capabilities in rendered manifests
- Closed
- is related to
-
OCPBUGS-34059 Cluster Bootstrap cannot filter resources
- New
- links to
-
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update