Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-46429

Too many pending CSRs lead to scaleup failures when scaling to 500 nodes

XMLWordPrintable

    • No
    • CLOUD Sprint 263, CLOUD Sprint 264
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the certificate signing request (CSR) approver included certificates from other systems when it calculated if it should stop approving certificates when the system was overloaded. In larger clusters, where other subsystems used CSRs, the CSR approver determined that there were many unapproved CSRs and prevented additional approvals. With this release, the CSR approver prevents new approvals when there are many CSRs for the signerName values that it observes, but has not been able to approve. The CSR approver now only includes CSRs that it can approve, using the signerName property as a filter.
      ====
      Cause: The CSR approver was including certificates from other systems within its own calculations for whether or not it was overwhelmed and should stop approving certificates
      Consequence: In larger clusters, with other subsystems using CSRs, the CSR approver would determine that there were many unapproved CSRs, and prevent further approvals
      Fix: The CSR approver now only includes CSRs that it can approve, using the signerName property as a filter
      Result: The CSR approver will only prevent new approvals when there are a large number of CSRs, for the signerName values that it observes, that it has not been able to approve
      Show
      Previously, the certificate signing request (CSR) approver included certificates from other systems when it calculated if it should stop approving certificates when the system was overloaded. In larger clusters, where other subsystems used CSRs, the CSR approver determined that there were many unapproved CSRs and prevented additional approvals. With this release, the CSR approver prevents new approvals when there are many CSRs for the signerName values that it observes, but has not been able to approve. The CSR approver now only includes CSRs that it can approve, using the signerName property as a filter. ==== Cause: The CSR approver was including certificates from other systems within its own calculations for whether or not it was overwhelmed and should stop approving certificates Consequence: In larger clusters, with other subsystems using CSRs, the CSR approver would determine that there were many unapproved CSRs, and prevent further approvals Fix: The CSR approver now only includes CSRs that it can approve, using the signerName property as a filter Result: The CSR approver will only prevent new approvals when there are a large number of CSRs, for the signerName values that it observes, that it has not been able to approve
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-46425. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-36404. The following is the description of the original issue:

      Description of problem:
      machine-approver logs

      E0221 20:29:52.377443       1 controller.go:182] csr-dm7zr: Pending CSRs: 1871; Max pending allowed: 604. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSRs seen

      .

      oc get csr |wc -l
      3818
      oc get csr |grep "node-bootstrapper" |wc -l
      2152

      By approving the pending CSR manually I can get the cluster to scaleup.

      We can increase the maxPending to a higher number https://github.com/openshift/cluster-machine-approver/blob/2d68698410d7e6239dafa6749cc454272508db19/pkg/controller/controller.go#L330 

       

              rmanak@redhat.com Radek Manak
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: