Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33882

Cluster Bootstrap does not account for capabilities in rendered manifests

XMLWordPrintable

    • Critical
    • Yes
    • CLOUD Sprint 254
    • 1
    • Approved
    • False
    • Hide

      None

      Show
      None

      As of OpenShift 4.16, CRD management is more complex. This is an artifact of improvements made to feature gates and feature sets. deads@redhat.com and I agreed that, to avoid confusion, we should aim to stop having CRDs installed via operator repos, and, if their types live in o/api, install them from there instead.

      We started this by moving the ControlPlaneMachineSet back to o/api, which is part of the MachineAPI  capability.

      Unbeknown to us at the time, the way the installer currently works is that all resources that are rendered, get applied by a cluster-bootstrap tools, roughly here and not by CVO.

      Cluster-bootstrap is not capability aware, so installed the CPMS CRD, which in turn broke the check in the CSR approver which stops it from crashing on MachineAPI less clusters.

      Options for moving forward include:

      • Reverting the move (complex)
      • Making the API render somehow understand capabilities and remove any CRD from a disabled cap
      • Make the cluster-bootstrap tool filter for caps

      I'm not sure presently which of the 2nd or 3rd options is better, nor am I sure how I would expect the caps to come into knowledge of the "renderers", installer can provide them as args in bootkube.sh.template?


      Original bug below, description of what's happening above


      Description of problem:

      After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time. 

      Version-Release number of selected component (if applicable):

      4.16.0-rc.1    

      How reproducible:

      once so far    

      Steps to Reproduce:

          1. Deploy SNO with DU profile with disabled capabilities:
      
          installConfigOverrides:  "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}"
      
      2. Leave the node running tests overnight for a couple of hours
      
      3. Check for Pending CSRs
      

      Actual results:

      oc get csr -A | grep Pending | wc -l 
      27    

      Expected results:

      No pending CSRs    
      
      Also oc logs will return a tls internal error:
      
      oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks 
      Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller
      Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error
      

      Additional info:

      Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities.
      
      I0514 13:25:09.266546       1 controller.go:120] Reconciling CSR: csr-dw9c8
      E0514 13:25:09.275585       1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1"
      E0514 13:25:09.275665       1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c"
      I0514 13:25:43.792140       1 controller.go:120] Reconciling CSR: csr-jvrvt
      E0514 13:25:43.798079       1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1"
      E0514 13:25:43.798128       1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff" 

            joelspeed Joel Speed
            joelspeed Joel Speed
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: