[OCPBUGS-19484] [aws] pre-emptible VM: machine-api-termination-handler not marking instance for deletion - Red Hat Issue Tracker

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.10
Component/s: Cloud Compute / Unknown
Labels:
None

Severity:
Important
Regression:
None
Sprint:
CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245
sprint_count:
3
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Type:
If Release Note Needed, Set a Value
Release Note Status:
Set a Value

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:

This is a clone of the GCP bug ~~OCPBUGS-2117~~, this problem also affects AWS. The description below contains GCP specific information but the same general problem exists in the termination handler for AWS as well.

Description of problem:

GCP preemptible VM termination is not being handled correctly by machine-api-termination-handler.

Version-Release number of selected component (if applicable):

Tested on both 4.10.22 and 4.11.2

How reproducible:

To reproduce the issue:

Create spot instance machine in gcp. Stop instance, notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated. Note we do see on machines list the TERMINATED status. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.

We would expect a terminated node to wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node.

Steps to Reproduce:

1. Create spot instance machine in gcp. 
2. Stop instance
3. Notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated.
4. Note we do see on machines list the TERMINATED status. 
5. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.

Actual results:

The machine-api-termination-handler logs don't show any message such as "Instance marked for termination, marking Node for deletion" but instead no signal is received from GCP.

Expected results:

A terminated node should wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node.

Additional info:
Here is the code:
https://github.com/openshift/machine-api-provider-gcp/blob/main/pkg/termination/termination.go#L96-L127

#forum-cloud slack thread:
https://coreos.slack.com/archives/CBZHF4DHC/p1656524730323259

#forum-node slack thread:
https://coreos.slack.com/archives/CK1AE4ZCK/p1656619821630479

clones

OCPBUGS-2117 [gcp] pre-emptible VM: machine-api-termination-handler not marking instance for deletion

Closed

Assignee:: Michael McCune

Reporter:: Daniel Del Ciancio

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/09/20 2:32 PM

Updated:: 2023/11/21 7:28 PM

Resolved:: 2023/11/21 7:28 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide