[OCPBUGS-2117] [gcp] pre-emptible VM: machine-api-termination-handler not marking instance for deletion - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.15.0
Affects Version/s: 4.10
Component/s: Cloud Compute / Unknown
Labels:
None

Severity:
Important
Regression:
None
Sprint:
CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245, CLOUD Sprint 246
sprint_count:
4
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the termination handler prematurely exited before marking a node for termination. This condition occured based on the timing of when the termination signal was received by the controller. With this release, the possibility of early termination is accounted for by introducing an additional check for termination. (link:https://issues.redhat.com/browse/OCPBUGS-2117[*~~OCPBUGS-2117~~*])

Show
* Previously, the termination handler prematurely exited before marking a node for termination. This condition occured based on the timing of when the termination signal was received by the controller. With this release, the possibility of early termination is accounted for by introducing an additional check for termination. (link: https://issues.redhat.com/browse/OCPBUGS-2117 [* OCPBUGS-2117 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:

Description of problem:

GCP preemptible VM termination is not being handled correctly by machine-api-termination-handler.

Version-Release number of selected component (if applicable):

Tested on both 4.10.22 and 4.11.2

How reproducible:

To reproduce the issue:

Create spot instance machine in gcp. Stop instance, notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated. Note we do see on machines list the TERMINATED status. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.

We would expect a terminated node to wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node.

Steps to Reproduce:

1. Create spot instance machine in gcp. 
2. Stop instance
3. Notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated.
4. Note we do see on machines list the TERMINATED status. 
5. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.

Actual results:

The machine-api-termination-handler logs don't show any message such as "Instance marked for termination, marking Node for deletion" but instead no signal is received from GCP.

Expected results:

A terminated node should wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node.

Additional info:
Here is the code:
https://github.com/openshift/machine-api-provider-gcp/blob/main/pkg/termination/termination.go#L96-L127

#forum-cloud slack thread:
https://coreos.slack.com/archives/CBZHF4DHC/p1656524730323259

#forum-node slack thread:
https://coreos.slack.com/archives/CK1AE4ZCK/p1656619821630479

is cloned by

OCPBUGS-19484 [aws] pre-emptible VM: machine-api-termination-handler not marking instance for deletion

Closed

OCPBUGS-19485 [azure] pre-emptible VM: machine-api-termination-handler not marking instance for deletion

Closed

links to

openshift/machine-api-provider-gcp#71: OCPBUGS-2117: Add extra check to termination handler

RHEA-2023:7198 rpm

Assignee:: Michael McCune

Reporter:: Daniel Del Ciancio

QA Contact:: Zhaohua Sun

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2022/10/07 9:53 PM

Updated:: 2024/03/28 2:38 PM

Resolved:: 2024/02/27 9:01 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates