Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8349

Bootstrap kubelet client cert should include system:serviceaccounts group

    • No
    • CLOUD Sprint 234
    • 1
    • Rejected
    • False
    • Hide

      Setting blocker+ as this is an install time bug with no workaround.

      Show
      Setting blocker+ as this is an install time bug with no workaround.
    • Hide
      * Previously, the bootstrap credentials used to request client credentials for control plane nodes did not include the generic, all service accounts group. As a result, the cluster machine approver ignored certificate signing requests (CSRs) created during this phase. In certain conditions, this prevented approval of CSRs during bootstrap and caused the installation to fail. With this release, the bootstrap credential includes the groups that the cluster machine approver expects for a service account. This change allows the machine approver to take over from the bootstrap CSR approver earlier in the cluster lifecycle and should reduce bootstrap failures related to CSR approval. (link:https://issues.redhat.com/browse/OCPBUGS-8349[*OCPBUGS-8349*])
      Show
      * Previously, the bootstrap credentials used to request client credentials for control plane nodes did not include the generic, all service accounts group. As a result, the cluster machine approver ignored certificate signing requests (CSRs) created during this phase. In certain conditions, this prevented approval of CSRs during bootstrap and caused the installation to fail. With this release, the bootstrap credential includes the groups that the cluster machine approver expects for a service account. This change allows the machine approver to take over from the bootstrap CSR approver earlier in the cluster lifecycle and should reduce bootstrap failures related to CSR approval. (link: https://issues.redhat.com/browse/OCPBUGS-8349 [* OCPBUGS-8349 *])
    • Bug Fix
    • Done
    • Customer Escalated

      Description of problem:

      On a freshly installed cluster, the control-plane-machineset-operator begins rolling a new master node, but the machine remains in a Provisioned state and never joins as a node.
      
      Its status is:
      Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}]
      
      The cluster is left in this state until an admin manually removes the stuck master node, at which point a new master machine is provisioned and successfully joins the cluster.

      Version-Release number of selected component (if applicable):

      4.12.4

      How reproducible:

      Observed at least 4 times over the last week, but unsure on how to reproduce.

      Actual results:

      A master node remains in a stuck Provisioned state and requires manual deletion to unstick the control plane machine set process.

      Expected results:

      No manual interaction should be necessary.

      Additional info:

       

            [OCPBUGS-8349] Bootstrap kubelet client cert should include system:serviceaccounts group

            Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            OpenShift Jira Automation Bot added a comment - Per the announcement sent regarding the removal of "Blocker" as an option in the Priority field, this issue (which was already closed at the time of the bulk update) had Priority = "Blocker." It is being updated to Priority = Critical. No additional fields were changed.

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:5006

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5006

            Joel Speed added a comment -

            Yep, LGTM

            Joel Speed added a comment - Yep, LGTM

            Move to verified.

            Zhaohua Sun added a comment - Move to verified.

            Joel Speed added a comment -

            Great thanks rhn-support-zhsun , that proves my fix is doing as expected, and we believe now the CSR approver should operate on these CSRs even when there are issues with CPMS!

            Joel Speed added a comment - Great thanks rhn-support-zhsun , that proves my fix is doing as expected, and we believe now the CSR approver should operate on these CSRs even when there are issues with CPMS!

            Zhaohua Sun added a comment -

            Thanks joelspeed I set up a cluster  clusterversion: 4.14.0-0.ci-2023-03-29-023531 and checked all node client CSRs (those from the node-bootstrapper service account) have 3 groups `system:authenticated`, `system:serviceaccounts` and `system:serviceaccounts:openshift-machine-config-operator`. Also checked in another 4.14 cluster doestn't include this pr, some csrs don't have `system:serviceaccounts` 

            Clusterversion: 4.14.0-0.ci-2023-03-29-023531

            $ oc get csr                                                                  
            NAME                                             AGE   SIGNERNAME                                    REQUESTOR                                                                         REQUESTEDDURATION   CONDITION
            csr-4pbv8                                        35m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-57js9                                        35m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-176-67.us-east-2.compute.internal                             <none>              Approved,Issued
            csr-8bgwv                                        30m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-c6v2x                                        35m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-d5ztg                                        35m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-mktz5                                        30m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-nr5m4                                        27m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-196-121.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-ph2tc                                        35m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-217-204.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-pwfjl                                        35m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-145-18.us-east-2.compute.internal                             <none>              Approved,Issued
            csr-pz5zf                                        27m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-r2pz2                                        30m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-149-2.us-east-2.compute.internal                              <none>              Approved,Issued
            csr-wrbwl                                        30m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-184-167.us-east-2.compute.internal                            <none>              Approved,Issued
            system:openshift:openshift-authenticator-7mzw9   33m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-authentication-operator:authentication-operator   <none>              Approved,Issued
            system:openshift:openshift-monitoring-dl5pq      33m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-monitoring:cluster-monitoring-operator            <none>              Approved,Issued
            $ oc get csr -o json > csr.json
            $ cat csr.json| grep "system:serviceaccounts:openshift-machine-config-operator" -C 1   
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"

            Checked in clusterversion: 4.14.0-0.nightly-2023-03-28-031439 some csrs don't have `system:serviceaccounts`

             

            $ oc get csr 
            NAME                                             AGE   SIGNERNAME                                    REQUESTOR                                                                         REQUESTEDDURATION   CONDITION
            csr-4zjsg                                        47m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-88f8g                                        41m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-174-215.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-bb27t                                        47m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-151-180.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-cbwqz                                        40m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-158-163.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-cgp2n                                        46m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-fcvjj                                        41m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-j2k2m                                        46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-176-18.us-east-2.compute.internal                             <none>              Approved,Issued
            csr-v9vgj                                        46m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-204-200.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-vm89n                                        40m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-wf64r                                        41m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-207-104.us-east-2.compute.internal                            <none>              Approved,Issued
            csr-xfgg8                                        41m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            csr-xw7s2                                        46m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Approved,Issued
            system:openshift:openshift-authenticator-nr56z   44m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-authentication-operator:authentication-operator   <none>              Approved,Issued
            system:openshift:openshift-monitoring-p4qv7      44m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-monitoring:cluster-monitoring-operator            <none>              Approved,Issued
            $ oc get csr -o json > csr.json
            $ cat csr.json| grep "system:serviceaccounts:openshift-machine-config-operator" -C 1     
                            "groups": [
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                            "groups": [
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                                "system:serviceaccounts",
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"
            --
                            "groups": [
                                "system:serviceaccounts:openshift-machine-config-operator",
                                "system:authenticated"

             

            Zhaohua Sun added a comment - Thanks joelspeed I set up a cluster  clusterversion: 4.14.0-0.ci-2023-03-29-023531 and checked all node client CSRs (those from the node-bootstrapper service account) have 3 groups `system:authenticated`, `system:serviceaccounts` and `system:serviceaccounts:openshift-machine-config-operator`. Also checked in another 4.14 cluster doestn't include this pr, some csrs don't have `system:serviceaccounts`  Clusterversion: 4.14.0-0.ci-2023-03-29-023531 $ oc get csr                                                                   NAME                                             AGE   SIGNERNAME                                    REQUESTOR                                                                         REQUESTEDDURATION   CONDITION csr-4pbv8                                        35m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config- operator :node-bootstrapper         <none>              Approved,Issued csr-57js9                                        35m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-176-67.us-east-2.compute.internal                             <none>              Approved,Issued csr-8bgwv                                        30m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config- operator :node-bootstrapper         <none>              Approved,Issued csr-c6v2x                                        35m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config- operator :node-bootstrapper         <none>              Approved,Issued csr-d5ztg                                        35m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config- operator :node-bootstrapper         <none>              Approved,Issued csr-mktz5                                        30m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config- operator :node-bootstrapper         <none>              Approved,Issued csr-nr5m4                                        27m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-196-121.us-east-2.compute.internal                            <none>              Approved,Issued csr-ph2tc                                        35m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-217-204.us-east-2.compute.internal                            <none>              Approved,Issued csr-pwfjl                                        35m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-145-18.us-east-2.compute.internal                             <none>              Approved,Issued csr-pz5zf                                        27m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config- operator :node-bootstrapper         <none>              Approved,Issued csr-r2pz2                                        30m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-149-2.us-east-2.compute.internal                              <none>              Approved,Issued csr-wrbwl                                        30m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-184-167.us-east-2.compute.internal                            <none>              Approved,Issued system:openshift:openshift-authenticator-7mzw9   33m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-authentication- operator :authentication- operator   <none>              Approved,Issued system:openshift:openshift-monitoring-dl5pq      33m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-monitoring:cluster-monitoring- operator            <none>              Approved,Issued $ oc get csr -o json > csr.json $ cat csr.json| grep "system:serviceaccounts:openshift-machine-config- operator " -C 1                       "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" Checked in clusterversion: 4.14.0-0.nightly-2023-03-28-031439 some csrs don't have `system:serviceaccounts`   $ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-4zjsg 47m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config- operator :node-bootstrapper <none> Approved,Issued csr-88f8g 41m kubernetes.io/kubelet-serving system:node:ip-10-0-174-215.us-east-2.compute.internal <none> Approved,Issued csr-bb27t 47m kubernetes.io/kubelet-serving system:node:ip-10-0-151-180.us-east-2.compute.internal <none> Approved,Issued csr-cbwqz 40m kubernetes.io/kubelet-serving system:node:ip-10-0-158-163.us-east-2.compute.internal <none> Approved,Issued csr-cgp2n 46m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config- operator :node-bootstrapper <none> Approved,Issued csr-fcvjj 41m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config- operator :node-bootstrapper <none> Approved,Issued csr-j2k2m 46m kubernetes.io/kubelet-serving system:node:ip-10-0-176-18.us-east-2.compute.internal <none> Approved,Issued csr-v9vgj 46m kubernetes.io/kubelet-serving system:node:ip-10-0-204-200.us-east-2.compute.internal <none> Approved,Issued csr-vm89n 40m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config- operator :node-bootstrapper <none> Approved,Issued csr-wf64r 41m kubernetes.io/kubelet-serving system:node:ip-10-0-207-104.us-east-2.compute.internal <none> Approved,Issued csr-xfgg8 41m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config- operator :node-bootstrapper <none> Approved,Issued csr-xw7s2 46m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config- operator :node-bootstrapper <none> Approved,Issued system:openshift:openshift-authenticator-nr56z 44m kubernetes.io/kube-apiserver-client system:serviceaccount:openshift-authentication- operator :authentication- operator <none> Approved,Issued system:openshift:openshift-monitoring-p4qv7 44m kubernetes.io/kube-apiserver-client system:serviceaccount:openshift-monitoring:cluster-monitoring- operator <none> Approved,Issued $ oc get csr -o json > csr.json $ cat csr.json| grep "system:serviceaccounts:openshift-machine-config- operator " -C 1                     "groups" : [                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                 "groups" : [                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                     "system:serviceaccounts" ,                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated" --                 "groups" : [                     "system:serviceaccounts:openshift-machine-config- operator " ,                     "system:authenticated"  

            Joel Speed added a comment -

            For the benefit of QE, can you please check that when the cluster bootstraps, all node client CSRs (those from the node-bootstrapper service account) have 3 groups associated with them, they should have `system:authenticated`, `system:serviceaccounts` and `system:serviceaccounts:openshift-machine-config-operator`. If they have these, then that means the CSR approver will recognise them correctly and approve them without the need for the bootstrap CSR approval mechanism, which should fix this race condition

            Joel Speed added a comment - For the benefit of QE, can you please check that when the cluster bootstraps, all node client CSRs (those from the node-bootstrapper service account) have 3 groups associated with them, they should have `system:authenticated`, `system:serviceaccounts` and `system:serviceaccounts:openshift-machine-config-operator`. If they have these, then that means the CSR approver will recognise them correctly and approve them without the need for the bootstrap CSR approval mechanism, which should fix this race condition

            Joel Speed added a comment -

            To answer my own question above, there are differences between day 1 and day 2!

            The machine config server, when in bootstrap mode, copies the kubelet client credential from the bootstrap machine, and appends this so that any Machine that comes up during the bootstrap process, uses that client credential to bootstrap rather than the service account token that is normally used (why it does this I have no idea, is the service account token not available at this point?).

            Anyway, that then led me to https://github.com/openshift/installer/blob/a24e632c60d2344a27b53b920b091add4b114495/pkg/asset/tls/kubelet.go#L176-L189 which is where this is generated. If we were to add the additional group to the certificate here, then the CSR approver would be able to approve the certificate, which would prevent strange races like this in the future

            Joel Speed added a comment - To answer my own question above, there are differences between day 1 and day 2! The machine config server, when in bootstrap mode, copies the kubelet client credential from the bootstrap machine, and appends this so that any Machine that comes up during the bootstrap process, uses that client credential to bootstrap rather than the service account token that is normally used (why it does this I have no idea, is the service account token not available at this point?). Anyway, that then led me to https://github.com/openshift/installer/blob/a24e632c60d2344a27b53b920b091add4b114495/pkg/asset/tls/kubelet.go#L176-L189 which is where this is generated. If we were to add the additional group to the certificate here, then the CSR approver would be able to approve the certificate, which would prevent strange races like this in the future

            Joel Speed added a comment -

            Has anyone managed to understand why the CSR doesn't meet the expectations of the CSR approver? Day 2 should be no different to Day 1 with regards to the CSR approval flow. The CSR approver we know is running but is rejecting the certificates because they don't have the correct groups associated with them, yet, for the workers, they seem to be coming up correctly?

            Why is the API server not adding the service account group for certain requests, is this a bug in the API server? I can't imagine there's a config drift between these early masters and workers is there?

             

            Joel Speed added a comment - Has anyone managed to understand why the CSR doesn't meet the expectations of the CSR approver? Day 2 should be no different to Day 1 with regards to the CSR approval flow. The CSR approver we know is running but is rejecting the certificates because they don't have the correct groups associated with them, yet, for the workers, they seem to be coming up correctly? Why is the API server not adding the service account group for certain requests, is this a bug in the API server? I can't imagine there's a config drift between these early masters and workers is there?  

              joelspeed Joel Speed
              mbargenq Matt Bargenquast (Inactive)
              Zhaohua Sun Zhaohua Sun
              Jeana Routh Jeana Routh
              Matt Bargenquast (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

                Created:
                Updated:
                Resolved: