Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19484

[aws] pre-emptible VM: machine-api-termination-handler not marking instance for deletion

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • 4.10
    • None
    • Important
    • None
    • CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • If Release Note Needed, Set a Value
    • Set a Value

      This is a clone of the GCP bug OCPBUGS-2117, this problem also affects AWS. The description below contains GCP specific information but the same general problem exists in the termination handler for AWS as well.

      Description of problem:

      GCP preemptible VM termination is not being handled correctly by machine-api-termination-handler.
      

      Version-Release number of selected component (if applicable):

      Tested on both 4.10.22 and 4.11.2
      

      How reproducible:

      To reproduce the issue:
      
      Create spot instance machine in gcp. Stop instance, notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated. Note we do see on machines list the TERMINATED status. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.
      
      We would expect a terminated node to wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node. 
      
      

      Steps to Reproduce:

      1. Create spot instance machine in gcp. 
      2. Stop instance
      3. Notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated.
      4. Note we do see on machines list the TERMINATED status. 
      5. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.
      

      Actual results:

      The machine-api-termination-handler logs don't show any message such as "Instance marked for termination, marking Node for deletion" but instead no signal is received from GCP.
      

      Expected results:

      A terminated node should wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node. 
      

      Additional info:
      Here is the code:
      https://github.com/openshift/machine-api-provider-gcp/blob/main/pkg/termination/termination.go#L96-L127

      #forum-cloud slack thread:
      https://coreos.slack.com/archives/CBZHF4DHC/p1656524730323259

      #forum-node slack thread:
       https://coreos.slack.com/archives/CK1AE4ZCK/p1656619821630479

              mimccune@redhat.com Michael McCune
              rhn-support-ddelcian Daniel Del Ciancio
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: