Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-47508

Reissued kube-apiserver-client-kubelet csrs can trigger csr reconcile limit on large scaleup

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.z
    • None
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When kubernetes.io/kube-apiserver-client-kubelet CSRs are not approved, they are recreated with exponential backoff, causing the old CSRs to be abandoned and left in a pending state. This usually doesn't cause an issue because CSR approvals are typically quick. However, during large machine scale-ups, adding IP addresses to the machine object (which is required for node-to-machine mapping for CSR validation) can be slow, leading to longer CSR approval times. Eventually, the number of pending CSRs may reach the maximum allowed, halting CSR approvals.
      Machine approver leaves invalid csrs in pending state. This behaviour is there to not reject valid csrs that should be approved by another controller and to allow manual approval by system admin.

      Version-Release number of selected component (if applicable):

          I expect this to be the issue across all versions

      How reproducible:

          Likely when scaling over 500 machines.

      Steps to Reproduce:

      https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/58057/rehearse-58057-pull-ci-openshift-ovn-kubernetes-release-4.16-ovncore-perfscale-aws-ovn-xlarge-cluster-density-v2/1865957686453997568

      Actual results:

          Machine approver logs:
      sr-6mxzf: failed to find machine for node ip-10-0-5-76.ec2.internal, cannot approve
      after limit is reached:
      csr-zt477: Pending CSRs: 1074; Max pending allowed: 856. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSR

      Expected results:

          Cluster machine approver should handle scaling to 750 nodes at once.

      Additional info:

      Example machine timeline:
      machine creationTimestamp: "2024-12-09T04:57:29Z"
      last attempt to reconcile one of its csrs before CSR limit is reached: 2024-12-09T05:21:39.211911225Z
      Adresses added to machine: "2024-12-09T05:29:44Z"
      for file in *.yaml; do ts=$(grep "creationTimestamp" "$file" | awk '{print $2}'); cn=$(grep "request:" "$file" | awk '{print $2}' | base64 -d | openssl req -noout -text -in - | grep CN= || echo "No CN found"); echo "$cn $ts"; done | sort
      
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:23:48Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:23:58Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:39:01Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:54:05Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:09:13Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:24:30Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:40:02Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:56:02Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T07:11:30Z"
              Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T07:26:58Z"
      

              rmanak@redhat.com Radek Manak
              rmanak@redhat.com Radek Manak
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: