Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17485

Node in deleting state for too long because draining is blocked should be signalled better

XMLWordPrintable

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      SRE investigated a [NodeNotReady alert|https://redhat.pagerduty.com/incidents/Q03OFJXEUBFLRU] and found that a HostedCluster's node/machine marked NotReady/Unhealthy had been stuck in the "Deleting" state for almost an hour. CAPI logs from the management cluster seem to indicate a race between (unsuccessfully) updating the unhealthy machine's security groups and actually terminating it.
      

      Version-Release number of selected component (if applicable):

      Cluster Version: 4.12.22
      CAPI image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a70ac366dcbd7d3a2d091ec2e505547b035a7a670e6659ac45c6e2708adcdacb
      

      How reproducible:

      Unclear
      

      Steps to Reproduce:

      1. Observe an HCP worker node for which CAPI is trying to update its security groups but receiving "Client.InvalidPermission.Duplicate" error
      2. Make that node go Unhealthy 
      3. Observe CAPI's reaction 
      

      Actual results:

      CAPI tries and fails to update security groups, and then stops processing that node's reconciliation, failing to delete/replace the unhealthy machine... until it does (an hour later in this case)
      

      Expected results:

      CAPI deletes and replaces the unhealthy machine within a few minutes
      

      Additional info:

      It's not clear what made CAPI decide to stop trying to reconcile the security group (after an hour of trying) and skip straight to deleting the unhealthy machine
      

              agarcial@redhat.com Alberto Garcia Lamela
              abyrne.openshift Anthony Byrne
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: