Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25821

cert issues during or after 4.14 to 4.15 upgrade

XMLWordPrintable

    • +
    • Critical
    • No
    • 13
    • MCO Sprint 248, MCO Sprint 249, MCO Sprint 250, MCO Sprint 251, MCO Sprint 252
    • 5
    • Rejected
    • False
    • Hide

      Regression in 4.15.

      Show
      Regression in 4.15.
    • Hide
      * Previously, when the `kube-apiserver` server Certificate Authority (CA) certificate was rotated, the MCO did not properly react and update the on-disk kubelet kubeconfig. The meant that the kubelet and some pods on the node were eventually unable to communicate with the APIserver, causing the node to enter the `NotReady` state. With this release, the Machine Config Operator (MCO) properly reacts to the change, and updates the on-disk kubeconfig such that authenticated communication with the APIServer can continue when this rotates, and also restarts kubelet/MCDaemon pod. The certificate authority has 10-year validity, so this rotation should happen rarely and is generally non-disruptive. (link:https://issues.redhat.com/browse/OCPBUGS-25821[*OCPBUGS-25821*])
      Show
      * Previously, when the `kube-apiserver` server Certificate Authority (CA) certificate was rotated, the MCO did not properly react and update the on-disk kubelet kubeconfig. The meant that the kubelet and some pods on the node were eventually unable to communicate with the APIserver, causing the node to enter the `NotReady` state. With this release, the Machine Config Operator (MCO) properly reacts to the change, and updates the on-disk kubeconfig such that authenticated communication with the APIServer can continue when this rotates, and also restarts kubelet/MCDaemon pod. The certificate authority has 10-year validity, so this rotation should happen rarely and is generally non-disruptive. (link: https://issues.redhat.com/browse/OCPBUGS-25821 [* OCPBUGS-25821 *])
    • Bug Fix
    • Done

      Description of problem:

      Older clusters updating into or running 4.15.0-rc.0 (and possibly Engineering Candidates?) can have the Kube API server operator initiate certificate rollouts, including the api-int CA. Missing pieces in the pipeline to roll out the new CA to kubelets and other consumers lead the cluster to lock up when the Kubernetes API servers transition to using the new cert/CA pair when serving incoming requests. For example, nodes may go NotReady with kubelets unable to call in their status to an api-int signed by the new CA that they don't yet trust.

      Version-Release number of selected component (if applicable):

      Seen in two updates from 4.14.6 to 4.15.0-rc0. Unclear if Engineering Candidates were also exposed. 4.15.0-rc.1 and later will not be exposed because they have the fix for OCPBUGS-18761. They may still have broken logic for these CA rotations in place, but until the certs are 8y or more old, they will not trigger that broken logic.

      How reproducible:

      We're working on it. Maybe cluster-kube-apiserver-operator#1615.

      Actual results:

      Nodes go NotReady with kubelet failing to communicate with api-int because of tls: failed to verify certificate: x509: certificate signed by unknown authority.

      Expected results:

      Happy certificate rollout.

      Additional info:

      Rolling the api-int CA is complicated, and we seem to be missing a number of steps. It's probably worth working out details in a GDoc or something where we have a shared space to fill out the picture.

      One piece is getting the api-int certificates out to the kubelet, where the flow seems to be:

      1. Kube API-server operator updates a Secret, like loadbalancer-serving-signer in openshift-kube-apiserver-operator (code).
      2. Kube API-server aggregates a number of certificates into the kube-apiserver-server-ca ConfigMap in the openshift-config-managed namespace (code).
      3. FIXME, possibly something in the Kube controller manager's ServiceAccount stack (and the serviceaccount-ca ConfigMap in openshift-kube-controller-manager) is handling getting the data from kube-apiserver-server-ca into node-bootstrapper-token?
      4. Machine-config operator consumes FIXME and writes a node-bootstrapper-token ServiceAccount Secret.
      5. Machine-config servers mount the node-bootstrapper-token Secret to /etc/mcs/bootstrap.
      6. Machine-config servers consume ca.crt from /etc/mcs/bootstrap-token and build a kubeconfig to serve in Ignition configs here as /etc/kubernetes/kubeconfig (code)
      7. Bootimage Ignition lays down the MCS-served content into the local /etc/kubernetes/kubeconfig, but only when the node is first born.
      8. FIXME propagates /etc/kubernetes/kubeconfig to /var/lib/kubelet/kubeconfig (FIXME:code Possibly the kubelet via --bootstrap-kubeconfig).
      9. The kubelet consumes /var/lib/kubelet/kubeconfig and uses its CA trust store when connecting to api-int (code).

      That handles new-node creation, but not "Kube API-server operator rolled the CA, and now we need to update existing nodes, and systemctl status restart their kubelets. And any pods using ServiceAccount kubeconfigs? And...?". This bug is about filling in those missing pieces in the cert-rolling pipeline (including having the Kube API server not use the new CA until it has been sufficiently rolled out to api-int clients, possibly including every ServiceAccount-consuming pod on the cluster?), and anything else that seems broken with the early cert-rolls.

      Somewhat relevant here is OCPBUGS-15367 currently managing /etc/kubernetes/kubeconfig permissions in the machine-config daemon to backstop for the file existing in the MCS-served Ignition config but not being a part of the rendered MachineConfig or the ControllerConfig stack.

              jerzhang@redhat.com Yu Qi Zhang
              achvatal.openshift Alex Chvatal
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Shauna Diaz Shauna Diaz
              Votes:
              1 Vote for this issue
              Watchers:
              47 Start watching this issue

                Created:
                Updated:
                Resolved: