Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-6772

Harden machine-approver logic vs. cloud-side delays in Machine provisioning

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • openshift-4.17
    • machine-api
    • None
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request

      Harden machine-approver logic vs. cloud-side delays in Machine provisioning

      2. What is the nature and description of the request?

      Pivoting from:

      • The CSR creation timestamp must be close to the Machine creation timestamp (currently within 2 hours)

      to using "the time the machine was Provisioned".  So the time it took the cloud to provision the backing infra wouldn't be counted against the "reasonable CSR creation window".  Or maybe that opens folks up to attacks where old Machines that took a while to provision have predictable acceptance criteria for too long?  In which case I'm back to thinking we should be more aggressive in deleting took-too-long-to-provision Machines, e.g. not counting slow-to-provision Machines (regardless of whether they eventually provisioned or not) against MHC short circuits, or something.

      And alternative would be adjusting the MachineHealthCheckUnterminatedShortCircuit handling to allow terminating (some? all?) old Provisioned machines once they fall outside the approver window.

      3. Why does the customer need this?

      With the current logic, the following timeline can wedge the cluster's ability to create new nodes:

      1. Cloud capacity issues break new Machine provisioning.
      2. Goes on long enough to trip MachineHealthCheckUnterminatedShortCircuit.
      3. Eventually cloud capacity recovers, and Machines go from Provisioning to Provisioned and their kubelets come up and create CertificateSigningRequests.
      4. But by this point the Machines are too old, and the CSRs no longer get approved by the auto-approver.
      5. System wedges.

      4. List any affected packages or components.

      May be something the cloud team can do unilaterally?

              rh-ee-smodeel Subin M
              trking W. Trevor King
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: