Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2117

[gcp] pre-emptible VM: machine-api-termination-handler not marking instance for deletion

XMLWordPrintable

    • Important
    • CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245, CLOUD Sprint 246
    • 4
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the termination handler prematurely exited before marking a node for termination. This condition occured based on the timing of when the termination signal was received by the controller. With this release, the possibility of early termination is accounted for by introducing an additional check for termination. (link:https://issues.redhat.com/browse/OCPBUGS-2117[*OCPBUGS-2117*])
      Show
      * Previously, the termination handler prematurely exited before marking a node for termination. This condition occured based on the timing of when the termination signal was received by the controller. With this release, the possibility of early termination is accounted for by introducing an additional check for termination. (link: https://issues.redhat.com/browse/OCPBUGS-2117 [* OCPBUGS-2117 *])
    • Bug Fix
    • Done

      Description of problem:

      GCP preemptible VM termination is not being handled correctly by machine-api-termination-handler.
      

      Version-Release number of selected component (if applicable):

      Tested on both 4.10.22 and 4.11.2
      

      How reproducible:

      To reproduce the issue:
      
      Create spot instance machine in gcp. Stop instance, notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated. Note we do see on machines list the TERMINATED status. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.
      
      We would expect a terminated node to wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node. 
      
      

      Steps to Reproduce:

      1. Create spot instance machine in gcp. 
      2. Stop instance
      3. Notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated.
      4. Note we do see on machines list the TERMINATED status. 
      5. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.
      

      Actual results:

      The machine-api-termination-handler logs don't show any message such as "Instance marked for termination, marking Node for deletion" but instead no signal is received from GCP.
      

      Expected results:

      A terminated node should wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node. 
      

      Additional info:
      Here is the code:
      https://github.com/openshift/machine-api-provider-gcp/blob/main/pkg/termination/termination.go#L96-L127

      #forum-cloud slack thread:
      https://coreos.slack.com/archives/CBZHF4DHC/p1656524730323259

      #forum-node slack thread:
       https://coreos.slack.com/archives/CK1AE4ZCK/p1656619821630479

            mimccune@redhat.com Michael McCune
            rhn-support-ddelcian Daniel Del Ciancio
            Zhaohua Sun Zhaohua Sun
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: