Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17485

Node in deleting state for too long because draining is blocked should be signalled better


    • No
    • Rejected
    • False
    • Hide



      Description of problem:

      SRE investigated a [NodeNotReady alert|https://redhat.pagerduty.com/incidents/Q03OFJXEUBFLRU] and found that a HostedCluster's node/machine marked NotReady/Unhealthy had been stuck in the "Deleting" state for almost an hour. CAPI logs from the management cluster seem to indicate a race between (unsuccessfully) updating the unhealthy machine's security groups and actually terminating it.

      Version-Release number of selected component (if applicable):

      Cluster Version: 4.12.22
      CAPI image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a70ac366dcbd7d3a2d091ec2e505547b035a7a670e6659ac45c6e2708adcdacb

      How reproducible:


      Steps to Reproduce:

      1. Observe an HCP worker node for which CAPI is trying to update its security groups but receiving "Client.InvalidPermission.Duplicate" error
      2. Make that node go Unhealthy 
      3. Observe CAPI's reaction 

      Actual results:

      CAPI tries and fails to update security groups, and then stops processing that node's reconciliation, failing to delete/replace the unhealthy machine... until it does (an hour later in this case)

      Expected results:

      CAPI deletes and replaces the unhealthy machine within a few minutes

      Additional info:

      It's not clear what made CAPI decide to stop trying to reconcile the security group (after an hour of trying) and skip straight to deleting the unhealthy machine

            agarcial@redhat.com Alberto Garcia Lamela
            abyrne.openshift Anthony Byrne
            Jie Zhao Jie Zhao
            0 Vote for this issue
            5 Start watching this issue
